mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2024-11-08 14:58:42 +00:00
e4e981b6f1
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1067 48356398-32a2-884e-a903-53898d9a118a
188 lines
7.5 KiB
Plaintext
188 lines
7.5 KiB
Plaintext
|
|
XHTML 1.1 and HTML Purifier
|
|
[needs updating, most of this is implemented]
|
|
|
|
Todo for XHTML 1.1 support <http://www.w3.org/TR/xhtml11/changes.html>
|
|
1. Scratch lang entirely in favor of xml:lang
|
|
2. Scratch name entirely in favor of id (partially-done)
|
|
3. Support Ruby <http://www.w3.org/TR/2001/REC-ruby-20010531/>
|
|
|
|
HTML Purifier uses the modularization of XHTML
|
|
<http://www.w3.org/TR/xhtml-modularization/> to organize the internals
|
|
of HTMLDefinition into a more manageable and extensible fashion. Rather
|
|
than have one super-object, HTMLDefinition is split into HTMLModules,
|
|
each of which are responsible for defining elements, their attributes,
|
|
and other properties (for a more indepth coverage, see
|
|
/library/HTMLPurifier/HTMLModule.php's docblock comments).
|
|
|
|
The modules that W3C defines and we support are:
|
|
|
|
* 5.1. Attribute Collections (technically not a module
|
|
* 5.2. Core Modules
|
|
o 5.2.2. Text Module
|
|
o 5.2.3. Hypertext Module
|
|
o 5.2.4. List Module
|
|
* 5.4. Text Extension Modules
|
|
o 5.4.1. Presentation Module
|
|
o 5.4.2. Edit Module
|
|
o 5.4.3. Bi-directional Text Module
|
|
* 5.6. Table Modules
|
|
o 5.6.2. Tables Module
|
|
* 5.7. Image Module
|
|
* 5.18. Style Attribute Module
|
|
|
|
Modules that we don't support but coul support are:
|
|
|
|
* 5.6. Table Modules
|
|
o 5.6.1. Basic Tables Module [?]
|
|
* 5.8. Client-side Image Map Module [?]
|
|
* 5.9. Server-side Image Map Module [?]
|
|
* 5.12. Target Module [?]
|
|
* 5.21. Name Identification Module [deprecated]
|
|
* 5.22. Legacy Module [deprecated]
|
|
|
|
These modules will not be implemented due to their dangerousness or
|
|
inapplicability as an XHTML fragment:
|
|
|
|
* 5.2. Core Modules
|
|
o 5.2.1. Structure Module
|
|
* 5.3. Applet Module
|
|
* 5.5. Forms Modules
|
|
o 5.5.1. Basic Forms Module
|
|
o 5.5.2. Forms Module
|
|
* 5.10. Object Module
|
|
* 5.11. Frames Module
|
|
* 5.13. Iframe Module
|
|
* 5.14. Intrinsic Events Module
|
|
* 5.15. Metainformation Module
|
|
* 5.16. Scripting Module
|
|
* 5.17. Style Sheet Module
|
|
* 5.19. Link Module
|
|
* 5.20. Base Module
|
|
|
|
We will not be using W3C's XML Schemas or DTDs directly due to the lack
|
|
of robust tools for handling them (the main problem is that all the
|
|
current parsers are usually PHP 5 only and solely-validating, not
|
|
correcting).
|
|
|
|
The abstraction of the HTMLDefinition creation process will also
|
|
contribute to a need for a caching system. Cache invalidation would be
|
|
difficult, but could be done by comparing the HTML and Attr config
|
|
namespaces with a copy that was packaged along with the serialized
|
|
HTMLDefinition object.
|
|
|
|
== General Use-Case ==
|
|
|
|
The outwards API of HTMLDefinition has been largely preserved, not
|
|
only for backwards-compatibility but also by design. Instead,
|
|
HTMLDefinition can be retrieved "raw", in which it loads a structure
|
|
that closely resembles the modules of XHTML 1.1. This structure is very
|
|
dynamic, making it easy to make cascading changes to global content
|
|
sets or remove elements in bulk.
|
|
|
|
However, once HTML Purifier needs the actual definition, it retrieves
|
|
a finalized version of HTMLDefinition. The finalized definition involves
|
|
processing the modules into a form that it is optimized for multiple
|
|
calls. This final version is immutable and, even if editable, would
|
|
be extremely hard to change.
|
|
|
|
So, some code taking advantage of the XHTML modularization may look
|
|
like this:
|
|
|
|
<?php
|
|
$config = HTMLPurifier_Config::createDefault();
|
|
$def =& $config->getHTMLDefinition(true); // reference to raw
|
|
unset($def->modules['Hypertext']); // rm ''a'' link
|
|
$purifier = new HTMLPurifier($config);
|
|
$purifier->purify($html); // now the definition is finalized
|
|
?>
|
|
|
|
== Inclusions ==
|
|
|
|
One of the nice features of HTMLDefinition is that piggy-backing off
|
|
of global attribute and content sets is extremely easy to do.
|
|
|
|
=== Attributes ===
|
|
|
|
HTMLModule->elements[$element]->attr stores attribute information for the
|
|
specific attributes of $element. This is quite close to the final
|
|
API that HTML Purifier interfaces with, but there's an important
|
|
extra feature: attr may also contain a array with a member index zero.
|
|
|
|
<?php
|
|
HTMLModule->elements[$element]->attr[0] = array('AttrSet');
|
|
?>
|
|
|
|
Rather than map the attribute key 0 to an array (which should be
|
|
an AttrDef), it defines a number of attribute collections that should
|
|
be merged into this elements attribute array.
|
|
|
|
Furthermore, the value of an attribute key, attribute value pair need
|
|
not be a fully fledged AttrDef object. They can also be a string, which
|
|
signifies a AttrDef that is looked up from a centralized registry
|
|
AttrTypes. This allows more concise attribute definitions that look
|
|
more like W3C's declarations, as well as offering a centralized point
|
|
for modifying the behavior of one attribute type. And, of course, the
|
|
old method of manually instantiating an AttrDef still works.
|
|
|
|
=== Attribute Collections ===
|
|
|
|
Attribute collections are stored and processed in the AttrCollections
|
|
object, which is responsible for performing the inclusions signified
|
|
by the 0 index. These attribute collections, too, are mutable, by
|
|
using HTMLModule->attr_collections. You may add new attributes
|
|
to a collection or define an entirely new collection for your module's
|
|
use. Inclusions can also be cumulative.
|
|
|
|
Attribute collections allow us to get rid of so called "global attributes"
|
|
(which actually aren't so global).
|
|
|
|
=== Content Models and ChildDef ===
|
|
|
|
An implementation of the above-mentioned attributes and attribute
|
|
collections was applied to the ChildDef system. HTML Purifier uses
|
|
a proprietary system called ChildDef for performance and flexibility
|
|
reasons, but this does not line up very well with W3C's notion of
|
|
regexps for defining the allowed children of an element.
|
|
|
|
HTMLPurifier->elements[$element]->content_model and
|
|
HTMLPurifier->elements[$element]->content_model_type store information
|
|
about the final ChildDef that will be stored in
|
|
HTMLPurifier->elements[$element]->child (we use a different variable
|
|
because the two forms are sufficiently different).
|
|
|
|
$content_model is an abstract, string representation of the internal
|
|
state of ChildDef, while $content_model_type is a string identifier
|
|
of which ChildDef subclass to instantiate. $content_model is processed
|
|
by substituting all content set identifiers (capitalized element names)
|
|
with their contents. It is then parsed and passed into the appropriate
|
|
ChildDef class, as defined by the ContentSets->getChildDef() or the
|
|
custom fallback HTMLModule->getChildDef() for custom child definitions
|
|
not in the core.
|
|
|
|
You'll need to use these facilities if you plan on referencing a content
|
|
set like "Inline" or "Block", and using them is recommended even if you're
|
|
not due to their conciseness.
|
|
|
|
A few notes on $content_model: it's structure can be as complicated
|
|
as you want, but the pipe symbol (|) is reserved for defining possible
|
|
choices, due to the content sets implementation. For example, a content
|
|
model that looks like:
|
|
|
|
"Inline -> Block -> a"
|
|
|
|
...when the Inline content set is defined as "span | b" and the Block
|
|
content set is defined as "div | blockquote", will expand into:
|
|
|
|
"span | b -> div | blockquote -> a"
|
|
|
|
The custom HTMLModule->getChildDef() function will need to be able to
|
|
then feed this information to ChildDef in a usable manner.
|
|
|
|
=== Content Sets ===
|
|
|
|
Content sets can be altered using HTMLModule->content_sets, an associative
|
|
array of content set names to content set contents. If the content set
|
|
already exists, your values are appended on to it (great for, say,
|
|
registering the font tag as an inline element), otherwise it is
|
|
created. They are substituted into content_model. |