htmlpurifier/docs/ref-xhtml-1.1.txt


XHTML 1.1 and HTML Purifier

Todo for XHTML 1.1 support <http://www.w3.org/TR/xhtml11/changes.html>
1. Scratch lang entirely in favor of xml:lang
2. Scratch name entirely in favor of id (partially-done)
3. Support Ruby <http://www.w3.org/TR/2001/REC-ruby-20010531/>

HTML Purifier uses the modularization of XHTML
<http://www.w3.org/TR/xhtml-modularization/> to organize the internals
of HTMLDefinition into a more manageable and extensible fashion. Rather
than have one super-object, HTMLDefinition is split into HTMLModules,
each of which are responsible for defining elements, their attributes,
and other properties (for a more indepth coverage, see
/library/HTMLPurifier/HTMLModule.php's docblock comments).

The modules that W3C defines and we support are:

    * 5.1. Attribute Collections (technically not a module
    * 5.2. Core Modules
          o 5.2.2. Text Module
          o 5.2.3. Hypertext Module
          o 5.2.4. List Module
    * 5.4. Text Extension Modules
          o 5.4.1. Presentation Module
          o 5.4.2. Edit Module
          o 5.4.3. Bi-directional Text Module
    * 5.6. Table Modules
          o 5.6.2. Tables Module
    * 5.7. Image Module
    * 5.18. Style Attribute Module

Modules that we don't support but coul support are:

    * 5.6. Table Modules
          o 5.6.1. Basic Tables Module [?]
    * 5.8. Client-side Image Map Module [?]
    * 5.9. Server-side Image Map Module [?]
    * 5.12. Target Module [?]
    * 5.21. Name Identification Module [deprecated]
    * 5.22. Legacy Module [deprecated]

These modules will not be implemented due to their dangerousness or
inapplicability as an XHTML fragment:

    * 5.2. Core Modules
          o 5.2.1. Structure Module
    * 5.3. Applet Module
    * 5.5. Forms Modules
          o 5.5.1. Basic Forms Module
          o 5.5.2. Forms Module
    * 5.10. Object Module
    * 5.11. Frames Module
    * 5.13. Iframe Module
    * 5.14. Intrinsic Events Module
    * 5.15. Metainformation Module
    * 5.16. Scripting Module
    * 5.17. Style Sheet Module
    * 5.19. Link Module
    * 5.20. Base Module

We will not be using W3C's XML Schemas or DTDs directly due to the lack
of robust tools for handling them (the main problem is that all the
current parsers are usually PHP 5 only and solely-validating, not
correcting).

The abstraction of the HTMLDefinition creation process will also
contribute to a need for a caching system. Cache invalidation would be
difficult, but could be done by comparing the HTML and Attr config
namespaces with a copy that was packaged along with the serialized
HTMLDefinition object.

== General Use-Case ==

The outwards API of HTMLDefinition has been largely preserved, not
only for backwards-compatibility but also by design. Instead,
HTMLDefinition can be retrieved "raw", in which it loads a structure
that closely resembles the modules of XHTML 1.1. This structure is very
dynamic, making it easy to make cascading changes to global content
sets or remove elements in bulk.

However, once HTML Purifier needs the actual definition, it retrieves
a finalized version of HTMLDefinition. The finalized definition involves
processing the modules into a form that it is optimized for multiple
calls. This final version is immutable and, even if editable, would
be extremely hard to change.

So, some code taking advantage of the XHTML modularization may look
like this:

<?php
    $config = HTMLPurifier_Config::createDefault();
    $def =& $config->getHTMLDefinition(true); // reference to raw
    unset($def->modules['Hypertext']); // rm ''a'' link
    $purifier = new HTMLPurifier($config);
    $purifier->purify($html); // now the definition is finalized
?>

== Inclusions ==

One of the nice features of HTMLDefinition is that piggy-backing off
of global attribute and content sets is extremely easy to do.

=== Attributes ===

HTMLModule->elements[$element]->attr stores attribute information for the
specific attributes of $element. This is quite close to the final
API that HTML Purifier interfaces with, but there's an important
extra feature: attr may also contain a array with a member index zero.

<?php
    HTMLModule->elements[$element]->attr[0] = array('AttrSet');
?>

Rather than map the attribute key 0 to an array (which should be
an AttrDef), it defines a number of attribute collections that should
be merged into this elements attribute array.

Furthermore, the value of an attribute key, attribute value pair need
not be a fully fledged AttrDef object. They can also be a string, which
signifies a AttrDef that is looked up from a centralized registry
AttrTypes. This allows more concise attribute definitions that look
more like W3C's declarations, as well as offering a centralized point
for modifying the behavior of one attribute type. And, of course, the
old method of manually instantiating an AttrDef still works.

=== Attribute Collections ===

Attribute collections are stored and processed in the AttrCollections
object, which is responsible for performing the inclusions signified
by the 0 index. These attribute collections, too, are mutable, by
using HTMLModule->attr_collections. You may add new attributes
to a collection or define an entirely new collection for your module's
use. Inclusions can also be cumulative.

Attribute collections allow us to get rid of so called "global attributes"
(which actually aren't so global).

=== Content Models and ChildDef ===

An implementation of the above-mentioned attributes and attribute
collections was applied to the ChildDef system. HTML Purifier uses
a proprietary system called ChildDef for performance and flexibility
reasons, but this does not line up very well with W3C's notion of
regexps for defining the allowed children of an element.

HTMLPurifier->elements[$element]->content_model and 
HTMLPurifier->elements[$element]->content_model_type store information
about the final ChildDef that will be stored in
HTMLPurifier->elements[$element]->child (we use a different variable
because the two forms are sufficiently different).

$content_model is an abstract, string representation of the internal
state of ChildDef, while $content_model_type is a string identifier
of which ChildDef subclass to instantiate. $content_model is processed
by substituting all content set identifiers (capitalized element names)
with their contents. It is then parsed and passed into the appropriate
ChildDef class, as defined by the ContentSets->getChildDef() or the
custom fallback HTMLModule->getChildDef() for custom child definitions
not in the core.

You'll need to use these facilities if you plan on referencing a content
set like "Inline" or "Block", and using them is recommended even if you're
not due to their conciseness.

A few notes on $content_model: it's structure can be as complicated
as you want, but the pipe symbol (|) is reserved for defining possible
choices, due to the content sets implementation. For example, a content
model that looks like:

"Inline -> Block -> a"

...when the Inline content set is defined as "span | b" and the Block
content set is defined as "div | blockquote", will expand into:

"span | b -> div | blockquote -> a"

The custom HTMLModule->getChildDef() function will need to be able to
then feed this information to ChildDef in a usable manner.

=== Content Sets ===

Content sets can be altered using HTMLModule->content_sets, an associative
array of content set names to content set contents. If the content set
already exists, your values are appended on to it (great for, say,
registering the font tag as an inline element), otherwise it is
created. They are substituted into content_model.
[1.3.0] Add some forward thinking documents. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@572 48356398-32a2-884e-a903-53898d9a118a 2006-11-23 22:33:07 +00:00
[1.5.0] Rewrite XHTML 1.1 document to describe HTMLDefinition's modularization - Use ElementDef->child to define a literal ChildDef object, rather than ElementDef->content_model. - Add notes on transforms, HTMLModule will be able to write those too - Fix some misc typos. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@729 48356398-32a2-884e-a903-53898d9a118a 2007-02-08 23:10:49 +00:00			`XHTML 1.1 and HTML Purifier`
[1.3.0] Add some forward thinking documents. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@572 48356398-32a2-884e-a903-53898d9a118a 2006-11-23 22:33:07 +00:00
[1.5.0] Rewrite XHTML 1.1 document to describe HTMLDefinition's modularization - Use ElementDef->child to define a literal ChildDef object, rather than ElementDef->content_model. - Add notes on transforms, HTMLModule will be able to write those too - Fix some misc typos. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@729 48356398-32a2-884e-a903-53898d9a118a 2007-02-08 23:10:49 +00:00			`Todo for XHTML 1.1 support <http://www.w3.org/TR/xhtml11/changes.html>`
[1.3.0] Add some forward thinking documents. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@572 48356398-32a2-884e-a903-53898d9a118a 2006-11-23 22:33:07 +00:00			`1. Scratch lang entirely in favor of xml:lang`
			`2. Scratch name entirely in favor of id (partially-done)`
			`3. Support Ruby <http://www.w3.org/TR/2001/REC-ruby-20010531/>`

[1.5.0] Rewrite XHTML 1.1 document to describe HTMLDefinition's modularization - Use ElementDef->child to define a literal ChildDef object, rather than ElementDef->content_model. - Add notes on transforms, HTMLModule will be able to write those too - Fix some misc typos. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@729 48356398-32a2-884e-a903-53898d9a118a 2007-02-08 23:10:49 +00:00			`HTML Purifier uses the modularization of XHTML`
			`<http://www.w3.org/TR/xhtml-modularization/> to organize the internals`
			`of HTMLDefinition into a more manageable and extensible fashion. Rather`
			`than have one super-object, HTMLDefinition is split into HTMLModules,`
			`each of which are responsible for defining elements, their attributes,`
			`and other properties (for a more indepth coverage, see`
			`/library/HTMLPurifier/HTMLModule.php's docblock comments).`
Update docs. Delineate XHTML 1.1 revamping of HTMLDefinition. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@705 48356398-32a2-884e-a903-53898d9a118a 2007-02-03 17:03:04 +00:00
[1.5.0] Rewrite XHTML 1.1 document to describe HTMLDefinition's modularization - Use ElementDef->child to define a literal ChildDef object, rather than ElementDef->content_model. - Add notes on transforms, HTMLModule will be able to write those too - Fix some misc typos. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@729 48356398-32a2-884e-a903-53898d9a118a 2007-02-08 23:10:49 +00:00			`The modules that W3C defines and we support are:`
Update docs. Delineate XHTML 1.1 revamping of HTMLDefinition. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@705 48356398-32a2-884e-a903-53898d9a118a 2007-02-03 17:03:04 +00:00
[1.5.0] Rewrite XHTML 1.1 document to describe HTMLDefinition's modularization - Use ElementDef->child to define a literal ChildDef object, rather than ElementDef->content_model. - Add notes on transforms, HTMLModule will be able to write those too - Fix some misc typos. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@729 48356398-32a2-884e-a903-53898d9a118a 2007-02-08 23:10:49 +00:00			`* 5.1. Attribute Collections (technically not a module`
Update docs. Delineate XHTML 1.1 revamping of HTMLDefinition. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@705 48356398-32a2-884e-a903-53898d9a118a 2007-02-03 17:03:04 +00:00			`* 5.2. Core Modules`
			`o 5.2.2. Text Module`
			`o 5.2.3. Hypertext Module`
			`o 5.2.4. List Module`
			`* 5.4. Text Extension Modules`
			`o 5.4.1. Presentation Module`
			`o 5.4.2. Edit Module`
			`o 5.4.3. Bi-directional Text Module`
			`* 5.6. Table Modules`
			`o 5.6.2. Tables Module`
			`* 5.7. Image Module`
[1.5.0] Rewrite XHTML 1.1 document to describe HTMLDefinition's modularization - Use ElementDef->child to define a literal ChildDef object, rather than ElementDef->content_model. - Add notes on transforms, HTMLModule will be able to write those too - Fix some misc typos. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@729 48356398-32a2-884e-a903-53898d9a118a 2007-02-08 23:10:49 +00:00			`* 5.18. Style Attribute Module`

			`Modules that we don't support but coul support are:`

			`* 5.6. Table Modules`
			`o 5.6.1. Basic Tables Module [?]`
Update docs. Delineate XHTML 1.1 revamping of HTMLDefinition. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@705 48356398-32a2-884e-a903-53898d9a118a 2007-02-03 17:03:04 +00:00			`* 5.8. Client-side Image Map Module [?]`
			`* 5.9. Server-side Image Map Module [?]`
			`* 5.12. Target Module [?]`
			`* 5.21. Name Identification Module [deprecated]`
			`* 5.22. Legacy Module [deprecated]`

[1.5.0] Rewrite XHTML 1.1 document to describe HTMLDefinition's modularization - Use ElementDef->child to define a literal ChildDef object, rather than ElementDef->content_model. - Add notes on transforms, HTMLModule will be able to write those too - Fix some misc typos. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@729 48356398-32a2-884e-a903-53898d9a118a 2007-02-08 23:10:49 +00:00			`These modules will not be implemented due to their dangerousness or`
			`inapplicability as an XHTML fragment:`
Update docs. Delineate XHTML 1.1 revamping of HTMLDefinition. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@705 48356398-32a2-884e-a903-53898d9a118a 2007-02-03 17:03:04 +00:00
			`* 5.2. Core Modules`
			`o 5.2.1. Structure Module`
			`* 5.3. Applet Module`
			`* 5.5. Forms Modules`
			`o 5.5.1. Basic Forms Module`
			`o 5.5.2. Forms Module`
			`* 5.10. Object Module`
			`* 5.11. Frames Module`
			`* 5.13. Iframe Module`
			`* 5.14. Intrinsic Events Module`
			`* 5.15. Metainformation Module`
			`* 5.16. Scripting Module`
			`* 5.17. Style Sheet Module`
			`* 5.19. Link Module`
			`* 5.20. Base Module`

[1.5.0] Rewrite XHTML 1.1 document to describe HTMLDefinition's modularization - Use ElementDef->child to define a literal ChildDef object, rather than ElementDef->content_model. - Add notes on transforms, HTMLModule will be able to write those too - Fix some misc typos. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@729 48356398-32a2-884e-a903-53898d9a118a 2007-02-08 23:10:49 +00:00			`We will not be using W3C's XML Schemas or DTDs directly due to the lack`
			`of robust tools for handling them (the main problem is that all the`
			`current parsers are usually PHP 5 only and solely-validating, not`
			`correcting).`

			`The abstraction of the HTMLDefinition creation process will also`
			`contribute to a need for a caching system. Cache invalidation would be`
			`difficult, but could be done by comparing the HTML and Attr config`
			`namespaces with a copy that was packaged along with the serialized`
			`HTMLDefinition object.`

			`== General Use-Case ==`

			`The outwards API of HTMLDefinition has been largely preserved, not`
			`only for backwards-compatibility but also by design. Instead,`
			`HTMLDefinition can be retrieved "raw", in which it loads a structure`
			`that closely resembles the modules of XHTML 1.1. This structure is very`
			`dynamic, making it easy to make cascading changes to global content`
			`sets or remove elements in bulk.`

			`However, once HTML Purifier needs the actual definition, it retrieves`
			`a finalized version of HTMLDefinition. The finalized definition involves`
			`processing the modules into a form that it is optimized for multiple`
			`calls. This final version is immutable and, even if editable, would`
			`be extremely hard to change.`

			`So, some code taking advantage of the XHTML modularization may look`
Update docs. Delineate XHTML 1.1 revamping of HTMLDefinition. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@705 48356398-32a2-884e-a903-53898d9a118a 2007-02-03 17:03:04 +00:00			`like this:`

[1.5.0] Rewrite XHTML 1.1 document to describe HTMLDefinition's modularization - Use ElementDef->child to define a literal ChildDef object, rather than ElementDef->content_model. - Add notes on transforms, HTMLModule will be able to write those too - Fix some misc typos. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@729 48356398-32a2-884e-a903-53898d9a118a 2007-02-08 23:10:49 +00:00			`<?php`
			`$config = HTMLPurifier_Config::createDefault();`
			`$def =& $config->getHTMLDefinition(true); // reference to raw`
			`unset($def->modules['Hypertext']); // rm ''a'' link`
			`$purifier = new HTMLPurifier($config);`
			`$purifier->purify($html); // now the definition is finalized`
			`?>`

			`== Inclusions ==`

			`One of the nice features of HTMLDefinition is that piggy-backing off`
			`of global attribute and content sets is extremely easy to do.`

			`=== Attributes ===`

			`HTMLModule->elements[$element]->attr stores attribute information for the`
			`specific attributes of $element. This is quite close to the final`
			`API that HTML Purifier interfaces with, but there's an important`
			`extra feature: attr may also contain a array with a member index zero.`

			`<?php`
			`HTMLModule->elements[$element]->attr[0] = array('AttrSet');`
			`?>`

			`Rather than map the attribute key 0 to an array (which should be`
			`an AttrDef), it defines a number of attribute collections that should`
			`be merged into this elements attribute array.`

			`Furthermore, the value of an attribute key, attribute value pair need`
			`not be a fully fledged AttrDef object. They can also be a string, which`
			`signifies a AttrDef that is looked up from a centralized registry`
			`AttrTypes. This allows more concise attribute definitions that look`
			`more like W3C's declarations, as well as offering a centralized point`
			`for modifying the behavior of one attribute type. And, of course, the`
			`old method of manually instantiating an AttrDef still works.`

			`=== Attribute Collections ===`

			`Attribute collections are stored and processed in the AttrCollections`
			`object, which is responsible for performing the inclusions signified`
			`by the 0 index. These attribute collections, too, are mutable, by`
			`using HTMLModule->attr_collections. You may add new attributes`
			`to a collection or define an entirely new collection for your module's`
			`use. Inclusions can also be cumulative.`

			`Attribute collections allow us to get rid of so called "global attributes"`
			`(which actually aren't so global).`

			`=== Content Models and ChildDef ===`

			`An implementation of the above-mentioned attributes and attribute`
			`collections was applied to the ChildDef system. HTML Purifier uses`
			`a proprietary system called ChildDef for performance and flexibility`
			`reasons, but this does not line up very well with W3C's notion of`
			`regexps for defining the allowed children of an element.`

			`HTMLPurifier->elements[$element]->content_model and`
			`HTMLPurifier->elements[$element]->content_model_type store information`
			`about the final ChildDef that will be stored in`
			`HTMLPurifier->elements[$element]->child (we use a different variable`
			`because the two forms are sufficiently different).`

			`$content_model is an abstract, string representation of the internal`
			`state of ChildDef, while $content_model_type is a string identifier`
			`of which ChildDef subclass to instantiate. $content_model is processed`
			`by substituting all content set identifiers (capitalized element names)`
			`with their contents. It is then parsed and passed into the appropriate`
			`ChildDef class, as defined by the ContentSets->getChildDef() or the`
			`custom fallback HTMLModule->getChildDef() for custom child definitions`
			`not in the core.`

			`You'll need to use these facilities if you plan on referencing a content`
			`set like "Inline" or "Block", and using them is recommended even if you're`
			`not due to their conciseness.`

			`A few notes on $content_model: it's structure can be as complicated`
			`as you want, but the pipe symbol (\|) is reserved for defining possible`
			`choices, due to the content sets implementation. For example, a content`
			`model that looks like:`

			`"Inline -> Block -> a"`

			`...when the Inline content set is defined as "span \| b" and the Block`
			`content set is defined as "div \| blockquote", will expand into:`

			`"span \| b -> div \| blockquote -> a"`

			`The custom HTMLModule->getChildDef() function will need to be able to`
			`then feed this information to ChildDef in a usable manner.`

			`=== Content Sets ===`

			`Content sets can be altered using HTMLModule->content_sets, an associative`
			`array of content set names to content set contents. If the content set`
			`already exists, your values are appended on to it (great for, say,`
			`registering the font tag as an inline element), otherwise it is`
			`created. They are substituted into content_model.`