htmlpurifier/docs/ref-xhtml-1.1.txt


Getting XHTML 1.1 Working

It's quite simple, according to <http://www.w3.org/TR/xhtml11/changes.html>

1. Scratch lang entirely in favor of xml:lang
2. Scratch name entirely in favor of id (partially-done)
3. Support Ruby <http://www.w3.org/TR/2001/REC-ruby-20010531/>

...but that's only an informative section. The true power of XHTML 1.1
is its modularization, defined at:
<http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/>

Modularization may very well be the next-generation implementation
of HTMLDefinition. The current, XHTML 1.0 DTD-based approach is
extremely brittle and doesn't lend well to extension by others, but
modularization fixes all that. The modules W3C defines that we
should take a look at are:

    * 5.1. Attribute Collections
    * 5.2. Core Modules
          o 5.2.2. Text Module
          o 5.2.3. Hypertext Module
          o 5.2.4. List Module
    * 5.4. Text Extension Modules
          o 5.4.1. Presentation Module
          o 5.4.2. Edit Module
          o 5.4.3. Bi-directional Text Module
    * 5.6. Table Modules
          o 5.6.1. Basic Tables Module [?]
          o 5.6.2. Tables Module
    * 5.7. Image Module
    * 5.8. Client-side Image Map Module [?]
    * 5.9. Server-side Image Map Module [?]
    * 5.12. Target Module [?]
    * 5.18. Style Attribute Module
    * 5.21. Name Identification Module [deprecated]
    * 5.22. Legacy Module [deprecated]

We exclude these modules due to their dangerousness or inapplicability
as a XHTML fragment:

    * 5.2. Core Modules
          o 5.2.1. Structure Module
    * 5.3. Applet Module
    * 5.5. Forms Modules
          o 5.5.1. Basic Forms Module
          o 5.5.2. Forms Module
    * 5.10. Object Module
    * 5.11. Frames Module
    * 5.13. Iframe Module
    * 5.14. Intrinsic Events Module
    * 5.15. Metainformation Module
    * 5.16. Scripting Module
    * 5.17. Style Sheet Module
    * 5.19. Link Module
    * 5.20. Base Module

Modularization also defines content sets:

    * Heading
    * Block
    * Inline
    * Flow {Heading | Block | Inline}
    * List
    * Form [x]
    * Formctrl [x]

Which may have elements dynamically added to them as more modules get
added.

== Implementation Details ==

We will not be using the XML Schemas or DTDs directly due to the lack
of robust tools in the area.

Since we will be performing a lot of abstracting, caching would be nice. Cache
invalidation could be done by comparing the HTML and Attr config namespaces
with a copy that was packaged along with this (we have no files to mtime)

We also have the trouble of preserving the current interface, which is
quite nice in terms of speed but not so good in terms of OO-ness. This
is fine: we may need to have a two-tiered setup approach, that goes
like this:

1. When getHTMLDefinition() is initially called, we prepare the default
   environment of content-sets, loaded modules, and allowed content-sets.
   This, while good for developers seeking to customize the tagset, is
   unusable by HTML Purifier internals. It represents the XML schema
2. When HTMLPurifier needs to use the definition, it calls a second setup
   function, which now performs any substitutions needed and instantiates
   all the objects which the internals will use.

In this manner, complicated observers are not necessary, you just
specify a content module like:

    $flow = '%Heading | %Block | %Inline';

And the second setup will perform the substitutions magically for you.

A module will have certain properties:

 - Elements
    - Attributes
    - Content model
 - Content sets
 - Content set extensions
 - Content model extensions [x] (seen only on structural elements)
 - Attribute collection extensions

In our case, the content model does a lot more than just define what
allowed children are: they also define exclusions. Also, if we refrain
from directly instantiating objects, we are posed with the problem of
how to signify which ChildDef to use. Remember: our specialized cases of
content models are proprietary optimizations that allow us to deal with
elements that don't belong rather than spit them out. Possible solutions:

1. Factory method that analyzes the definition and figures out who
   to defer to. It would also be responsible for parsing out omissions.
2. Don't use their content model syntax, just enumerate items and give
   the class-name of which one to use. If a complex definition is truly
   needed, then use content model syntax. A definition, then, would
   be composed of multiple parts:
    - True content-model definition, OR
    - Simple content-model definition
        - List of items in the definition (may be multiple if dealing
          with Chameleon)
        - Name of the type (optional, required, etc)

Flexibility is absolutely essential, so the API of some of these
ChildDefs may need to change to lend them better to uniform treatment.

Attributes are somewhat easier to manage, because we would be using
associative arrays of elements => attributes => AttrDef names, and there
would be an Attribute Types lookup array to get the appropriate AttrDef
(if the objects are stateful, they will need to be cloned). Attributes
for just one element can be specifically overridden through some mechanism
(probably a config lookup array as well as an internal counter, because
of HTML 4.01 semantics concerns.) An attribute set will also optionally
have an array at index 0 which defines what attribute collections to
agglutinate on when parsing. This may allow us to get rid of global
attributes, which were also a proprietary implementation detail.

Alright: let's get to work!