[1.5.0] Rewrite XHTML 1.1 document to describe HTMLDefinition's modularization

- Use ElementDef->child to define a literal ChildDef object, rather than ElementDef->content_model. - Add notes on transforms, HTMLModule will be able to write those too - Fix some misc typos. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@729 48356398-32a2-884e-a903-53898d9a118a
2024-12-22 16:31:53 +00:00 · 2007-02-08 23:10:49 +00:00 · 2007-02-08 23:10:49 +00:00 · d5491da77f
commit d5491da77f
parent 591fc0ae28
6 changed files with 147 additions and 95 deletions
--- a/docs/proposal-config.txt
+++ b/docs/proposal-config.txt
@ -7,7 +7,7 @@ value is used for.  This means decentralized configuration declarations that
 are nevertheless error checking and a centralized configuration object.

 Directives are divided into namespaces, indicating the major portion of
-functionality they cover (although there may be overlaps.  Please consult
+functionality they cover (although there may be overlaps).  Please consult
 the documentation in ConfigDef for more information on these namespaces.

 Since configuration is dependant on context, internal classes require a
--- a/docs/ref-xhtml-1.1.txt
+++ b/docs/ref-xhtml-1.1.txt
@ -1,23 +1,22 @@

-Getting XHTML 1.1 Working
-
-It's quite simple, according to <http://www.w3.org/TR/xhtml11/changes.html>
+XHTML 1.1 and HTML Purifier

+Todo for XHTML 1.1 support <http://www.w3.org/TR/xhtml11/changes.html>
 1. Scratch lang entirely in favor of xml:lang
 2. Scratch name entirely in favor of id (partially-done)
 3. Support Ruby <http://www.w3.org/TR/2001/REC-ruby-20010531/>

-...but that's only an informative section. The true power of XHTML 1.1
-is its modularization, defined at:
-<http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/>
+HTML Purifier uses the modularization of XHTML
+<http://www.w3.org/TR/xhtml-modularization/> to organize the internals
+of HTMLDefinition into a more manageable and extensible fashion. Rather
+than have one super-object, HTMLDefinition is split into HTMLModules,
+each of which are responsible for defining elements, their attributes,
+and other properties (for a more indepth coverage, see
+/library/HTMLPurifier/HTMLModule.php's docblock comments).

-Modularization may very well be the next-generation implementation
-of HTMLDefinition. The current, XHTML 1.0 DTD-based approach is
-extremely brittle and doesn't lend well to extension by others, but
-modularization fixes all that. The modules W3C defines that we
-should take a look at are:
+The modules that W3C defines and we support are:

-    * 5.1. Attribute Collections
+    * 5.1. Attribute Collections (technically not a module
    * 5.2. Core Modules
          o 5.2.2. Text Module
          o 5.2.3. Hypertext Module
@ -27,18 +26,22 @@ should take a look at are:
          o 5.4.2. Edit Module
          o 5.4.3. Bi-directional Text Module
    * 5.6. Table Modules
-          o 5.6.1. Basic Tables Module [?]
          o 5.6.2. Tables Module
    * 5.7. Image Module
+    * 5.18. Style Attribute Module
+
+Modules that we don't support but coul support are:
+
+    * 5.6. Table Modules
+          o 5.6.1. Basic Tables Module [?]
    * 5.8. Client-side Image Map Module [?]
    * 5.9. Server-side Image Map Module [?]
    * 5.12. Target Module [?]
-    * 5.18. Style Attribute Module
    * 5.21. Name Identification Module [deprecated]
    * 5.22. Legacy Module [deprecated]

-We exclude these modules due to their dangerousness or inapplicability
-as a XHTML fragment:
+These modules will not be implemented due to their dangerousness or
+inapplicability as an XHTML fragment:

    * 5.2. Core Modules
          o 5.2.1. Structure Module
@ -56,89 +59,129 @@ as a XHTML fragment:
    * 5.19. Link Module
    * 5.20. Base Module

-Modularization also defines content sets:
+We will not be using W3C's XML Schemas or DTDs directly due to the lack
+of robust tools for handling them (the main problem is that all the
+current parsers are usually PHP 5 only and solely-validating, not
+correcting).

-    * Heading
-    * Block
-    * Inline
-    * Flow {Heading | Block | Inline}
-    * List
-    * Form [x]
-    * Formctrl [x]
+The abstraction of the HTMLDefinition creation process will also
+contribute to a need for a caching system. Cache invalidation would be
+difficult, but could be done by comparing the HTML and Attr config
+namespaces with a copy that was packaged along with the serialized
+HTMLDefinition object.

-Which may have elements dynamically added to them as more modules get
-added.
+== General Use-Case ==

-== Implementation Details ==
+The outwards API of HTMLDefinition has been largely preserved, not
+only for backwards-compatibility but also by design. Instead,
+HTMLDefinition can be retrieved "raw", in which it loads a structure
+that closely resembles the modules of XHTML 1.1. This structure is very
+dynamic, making it easy to make cascading changes to global content
+sets or remove elements in bulk.

-We will not be using the XML Schemas or DTDs directly due to the lack
-of robust tools in the area.
+However, once HTML Purifier needs the actual definition, it retrieves
+a finalized version of HTMLDefinition. The finalized definition involves
+processing the modules into a form that it is optimized for multiple
+calls. This final version is immutable and, even if editable, would
+be extremely hard to change.

-Since we will be performing a lot of abstracting, caching would be nice. Cache
-invalidation could be done by comparing the HTML and Attr config namespaces
-with a copy that was packaged along with this (we have no files to mtime)
-
-We also have the trouble of preserving the current interface, which is
-quite nice in terms of speed but not so good in terms of OO-ness. This
-is fine: we may need to have a two-tiered setup approach, that goes
+So, some code taking advantage of the XHTML modularization may look
 like this:

-1. When getHTMLDefinition() is initially called, we prepare the default
-   environment of content-sets, loaded modules, and allowed content-sets.
-   This, while good for developers seeking to customize the tagset, is
-   unusable by HTML Purifier internals. It represents the XML schema
-2. When HTMLPurifier needs to use the definition, it calls a second setup
-   function, which now performs any substitutions needed and instantiates
-   all the objects which the internals will use.
+<?php
+    $config = HTMLPurifier_Config::createDefault();
+    $def =& $config->getHTMLDefinition(true); // reference to raw
+    unset($def->modules['Hypertext']); // rm ''a'' link
+    $purifier = new HTMLPurifier($config);
+    $purifier->purify($html); // now the definition is finalized
+?>

-In this manner, complicated observers are not necessary, you just
-specify a content module like:
+== Inclusions ==

-    $flow = '%Heading | %Block | %Inline';
+One of the nice features of HTMLDefinition is that piggy-backing off
+of global attribute and content sets is extremely easy to do.

-And the second setup will perform the substitutions magically for you.
+=== Attributes ===

-A module will have certain properties:
+HTMLModule->elements[$element]->attr stores attribute information for the
+specific attributes of $element. This is quite close to the final
+API that HTML Purifier interfaces with, but there's an important
+extra feature: attr may also contain a array with a member index zero.

- - Elements
-    - Attributes
-    - Content model
- - Content sets
- - Content set extensions
- - Content model extensions [x] (seen only on structural elements)
- - Attribute collection extensions
+<?php
+    HTMLModule->elements[$element]->attr[0] = array('AttrSet');
+?>

-In our case, the content model does a lot more than just define what
-allowed children are: they also define exclusions. Also, if we refrain
-from directly instantiating objects, we are posed with the problem of
-how to signify which ChildDef to use. Remember: our specialized cases of
-content models are proprietary optimizations that allow us to deal with
-elements that don't belong rather than spit them out. Possible solutions:
+Rather than map the attribute key 0 to an array (which should be
+an AttrDef), it defines a number of attribute collections that should
+be merged into this elements attribute array.

-1. Factory method that analyzes the definition and figures out who
-   to defer to. It would also be responsible for parsing out omissions.
-2. Don't use their content model syntax, just enumerate items and give
-   the class-name of which one to use. If a complex definition is truly
-   needed, then use content model syntax. A definition, then, would
-   be composed of multiple parts:
-    - True content-model definition, OR
-    - Simple content-model definition
-        - List of items in the definition (may be multiple if dealing
-          with Chameleon)
-        - Name of the type (optional, required, etc)
+Furthermore, the value of an attribute key, attribute value pair need
+not be a fully fledged AttrDef object. They can also be a string, which
+signifies a AttrDef that is looked up from a centralized registry
+AttrTypes. This allows more concise attribute definitions that look
+more like W3C's declarations, as well as offering a centralized point
+for modifying the behavior of one attribute type. And, of course, the
+old method of manually instantiating an AttrDef still works.

-Flexibility is absolutely essential, so the API of some of these
-ChildDefs may need to change to lend them better to uniform treatment.
+=== Attribute Collections ===

-Attributes are somewhat easier to manage, because we would be using
-associative arrays of elements => attributes => AttrDef names, and there
-would be an Attribute Types lookup array to get the appropriate AttrDef
-(if the objects are stateful, they will need to be cloned). Attributes
-for just one element can be specifically overridden through some mechanism
-(probably a config lookup array as well as an internal counter, because
-of HTML 4.01 semantics concerns.) An attribute set will also optionally
-have an array at index 0 which defines what attribute collections to
-agglutinate on when parsing. This may allow us to get rid of global
-attributes, which were also a proprietary implementation detail.
+Attribute collections are stored and processed in the AttrCollections
+object, which is responsible for performing the inclusions signified
+by the 0 index. These attribute collections, too, are mutable, by
+using HTMLModule->attr_collections. You may add new attributes
+to a collection or define an entirely new collection for your module's
+use. Inclusions can also be cumulative.

-Alright: let's get to work!
+Attribute collections allow us to get rid of so called "global attributes"
+(which actually aren't so global).
+
+=== Content Models and ChildDef ===
+
+An implementation of the above-mentioned attributes and attribute
+collections was applied to the ChildDef system. HTML Purifier uses
+a proprietary system called ChildDef for performance and flexibility
+reasons, but this does not line up very well with W3C's notion of
+regexps for defining the allowed children of an element.
+
+HTMLPurifier->elements[$element]->content_model and 
+HTMLPurifier->elements[$element]->content_model_type store information
+about the final ChildDef that will be stored in
+HTMLPurifier->elements[$element]->child (we use a different variable
+because the two forms are sufficiently different).
+
+$content_model is an abstract, string representation of the internal
+state of ChildDef, while $content_model_type is a string identifier
+of which ChildDef subclass to instantiate. $content_model is processed
+by substituting all content set identifiers (capitalized element names)
+with their contents. It is then parsed and passed into the appropriate
+ChildDef class, as defined by the ContentSets->getChildDef() or the
+custom fallback HTMLModule->getChildDef() for custom child definitions
+not in the core.
+
+You'll need to use these facilities if you plan on referencing a content
+set like "Inline" or "Block", and using them is recommended even if you're
+not due to their conciseness.
+
+A few notes on $content_model: it's structure can be as complicated
+as you want, but the pipe symbol (|) is reserved for defining possible
+choices, due to the content sets implementation. For example, a content
+model that looks like:
+
+"Inline -> Block -> a"
+
+...when the Inline content set is defined as "span | b" and the Block
+content set is defined as "div | blockquote", will expand into:
+
+"span | b -> div | blockquote -> a"
+
+The custom HTMLModule->getChildDef() function will need to be able to
+then feed this information to ChildDef in a usable manner.
+
+=== Content Sets ===
+
+Content sets can be altered using HTMLModule->content_sets, an associative
+array of content set names to content set contents. If the content set
+already exists, your values are appended on to it (great for, say,
+registering the font tag as an inline element), otherwise it is
+created. They are substituted into content_model.
--- a/library/HTMLPurifier/ContentSets.php
+++ b/library/HTMLPurifier/ContentSets.php
@ -77,6 +77,7 @@ class HTMLPurifier_ContentSets
     * @param $module Module that defined the ElementDef
     */
    function generateChildDef(&$def, $module) {
+        if (!empty($def->child)) return; // already done!
        $content_model = $def->content_model;
        if (is_string($content_model)) {
            $def->content_model = str_replace(
@ -95,7 +96,14 @@ class HTMLPurifier_ContentSets
     */
    function getChildDef($def, $module) {
        $value = $def->content_model;
-        if (is_object($value)) return $value; // direct object, return
+        if (is_object($value)) {
+            trigger_error(
+                'Literal object child definitions should be stored in '.
+                'ElementDef->child not ElementDef->content_model',
+                E_USER_NOTICE
+            );
+            return $value;
+        }
        switch ($def->content_model_type) {
            case 'required':
                return new HTMLPurifier_ChildDef_Required($value);
@ -109,8 +117,10 @@ class HTMLPurifier_ContentSets
                return new HTMLPurifier_ChildDef_Custom($value);
        }
        // defer to its module
-        if (!$module->defines_child_def) continue; // save a func call
+        $return = false;
+        if ($module->defines_child_def) { // save a func call
            $return = $module->getChildDef($def);
+        }
        if ($return !== false) return $return;
        // error-out
        trigger_error(
--- a/library/HTMLPurifier/HTMLDefinition.php
+++ b/library/HTMLPurifier/HTMLDefinition.php
@ -168,19 +168,19 @@ class HTMLPurifier_HTMLDefinition
    /**
     * Associative array of deprecated tag name to HTMLPurifier_TagTransform
     * @public
-     */
+     */ // use + operator
    var $info_tag_transform = array();
    
    /**
     * List of HTMLPurifier_AttrTransform to be performed before validation.
     * @public
-     */
+     */ // use array_merge or a foreach loop
    var $info_attr_transform_pre = array();
    
    /**
     * List of HTMLPurifier_AttrTransform to be performed after validation.
     * @public
-     */
+     */ // use array_merge or a foreach loop
    var $info_attr_transform_post = array();
    
    /**
--- a/library/HTMLPurifier/HTMLModule/Tables.php
+++ b/library/HTMLPurifier/HTMLModule/Tables.php
@ -58,8 +58,7 @@ class HTMLPurifier_HTMLModule_Tables extends HTMLPurifier_HTMLModule
        // Is done directly because it doesn't leverage substitution
        // mechanisms. True model is:
        // 'caption?, ( col* | colgroup* ), (( thead?, tfoot?, tbody+ ) | ( tr+ ))'
-        $this->info['table']->content_model = new HTMLPurifier_ChildDef_Table();
-        $this->info['table']->content_model_type = 'table';
+        $this->info['table']->child = new HTMLPurifier_ChildDef_Table();
        
        $this->info['td']->content_model = 
        $this->info['th']->content_model = '#PCDATA | Flow';
--- a/library/HTMLPurifier/URISchemeRegistry.php
+++ b/library/HTMLPurifier/URISchemeRegistry.php
@ -10,7 +10,7 @@ HTMLPurifier_ConfigSchema::define(
        'irc'   => true, // "Internet Relay Chat", usually needs another app
        // for Usenet, these two are similar, but distinct
        'nntp'  => true, // individual Netnews articles
-        'news'  => true  // newsgroup or individual Netnews articles),
+        'news'  => true  // newsgroup or individual Netnews articles
    ), 'lookup',
    'Whitelist that defines the schemes that a URI is allowed to have.  This '.
    'prevents XSS attacks from using pseudo-schemes like javascript or mocha.'