From d5491da77f0155b32c60b6431809c00515a9168e Mon Sep 17 00:00:00 2001
From: "Edward Z. Yang" <edwardzyang@thewritingpot.com>
Date: Thu, 8 Feb 2007 23:10:49 +0000
Subject: [PATCH] [1.5.0] Rewrite XHTML 1.1 document to describe
 HTMLDefinition's modularization - Use ElementDef->child to define a literal
 ChildDef object, rather than ElementDef->content_model. - Add notes on
 transforms, HTMLModule will be able to write those too - Fix some misc typos.

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@729 48356398-32a2-884e-a903-53898d9a118a
---
 docs/proposal-config.txt                   |   2 +-
 docs/ref-xhtml-1.1.txt                     | 213 +++++++++++++--------
 library/HTMLPurifier/ContentSets.php       |  16 +-
 library/HTMLPurifier/HTMLDefinition.php    |   6 +-
 library/HTMLPurifier/HTMLModule/Tables.php |   3 +-
 library/HTMLPurifier/URISchemeRegistry.php |   2 +-
 6 files changed, 147 insertions(+), 95 deletions(-)

diff --git a/docs/proposal-config.txt b/docs/proposal-config.txt
index 265bf8f3..93314122 100644
--- a/docs/proposal-config.txt
+++ b/docs/proposal-config.txt
@@ -7,7 +7,7 @@ value is used for.  This means decentralized configuration declarations that
 are nevertheless error checking and a centralized configuration object.
 
 Directives are divided into namespaces, indicating the major portion of
-functionality they cover (although there may be overlaps.  Please consult
+functionality they cover (although there may be overlaps).  Please consult
 the documentation in ConfigDef for more information on these namespaces.
 
 Since configuration is dependant on context, internal classes require a
diff --git a/docs/ref-xhtml-1.1.txt b/docs/ref-xhtml-1.1.txt
index a195e329..b32db5a8 100644
--- a/docs/ref-xhtml-1.1.txt
+++ b/docs/ref-xhtml-1.1.txt
@@ -1,23 +1,22 @@
 
-Getting XHTML 1.1 Working
-
-It's quite simple, according to <http://www.w3.org/TR/xhtml11/changes.html>
+XHTML 1.1 and HTML Purifier
 
+Todo for XHTML 1.1 support <http://www.w3.org/TR/xhtml11/changes.html>
 1. Scratch lang entirely in favor of xml:lang
 2. Scratch name entirely in favor of id (partially-done)
 3. Support Ruby <http://www.w3.org/TR/2001/REC-ruby-20010531/>
 
-...but that's only an informative section. The true power of XHTML 1.1
-is its modularization, defined at:
-<http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/>
+HTML Purifier uses the modularization of XHTML
+<http://www.w3.org/TR/xhtml-modularization/> to organize the internals
+of HTMLDefinition into a more manageable and extensible fashion. Rather
+than have one super-object, HTMLDefinition is split into HTMLModules,
+each of which are responsible for defining elements, their attributes,
+and other properties (for a more indepth coverage, see
+/library/HTMLPurifier/HTMLModule.php's docblock comments).
 
-Modularization may very well be the next-generation implementation
-of HTMLDefinition. The current, XHTML 1.0 DTD-based approach is
-extremely brittle and doesn't lend well to extension by others, but
-modularization fixes all that. The modules W3C defines that we
-should take a look at are:
+The modules that W3C defines and we support are:
 
-    * 5.1. Attribute Collections
+    * 5.1. Attribute Collections (technically not a module
     * 5.2. Core Modules
           o 5.2.2. Text Module
           o 5.2.3. Hypertext Module
@@ -27,18 +26,22 @@ should take a look at are:
           o 5.4.2. Edit Module
           o 5.4.3. Bi-directional Text Module
     * 5.6. Table Modules
-          o 5.6.1. Basic Tables Module [?]
           o 5.6.2. Tables Module
     * 5.7. Image Module
+    * 5.18. Style Attribute Module
+
+Modules that we don't support but coul support are:
+
+    * 5.6. Table Modules
+          o 5.6.1. Basic Tables Module [?]
     * 5.8. Client-side Image Map Module [?]
     * 5.9. Server-side Image Map Module [?]
     * 5.12. Target Module [?]
-    * 5.18. Style Attribute Module
     * 5.21. Name Identification Module [deprecated]
     * 5.22. Legacy Module [deprecated]
 
-We exclude these modules due to their dangerousness or inapplicability
-as a XHTML fragment:
+These modules will not be implemented due to their dangerousness or
+inapplicability as an XHTML fragment:
 
     * 5.2. Core Modules
           o 5.2.1. Structure Module
@@ -56,89 +59,129 @@ as a XHTML fragment:
     * 5.19. Link Module
     * 5.20. Base Module
 
-Modularization also defines content sets:
+We will not be using W3C's XML Schemas or DTDs directly due to the lack
+of robust tools for handling them (the main problem is that all the
+current parsers are usually PHP 5 only and solely-validating, not
+correcting).
 
-    * Heading
-    * Block
-    * Inline
-    * Flow {Heading | Block | Inline}
-    * List
-    * Form [x]
-    * Formctrl [x]
+The abstraction of the HTMLDefinition creation process will also
+contribute to a need for a caching system. Cache invalidation would be
+difficult, but could be done by comparing the HTML and Attr config
+namespaces with a copy that was packaged along with the serialized
+HTMLDefinition object.
 
-Which may have elements dynamically added to them as more modules get
-added.
+== General Use-Case ==
 
-== Implementation Details ==
+The outwards API of HTMLDefinition has been largely preserved, not
+only for backwards-compatibility but also by design. Instead,
+HTMLDefinition can be retrieved "raw", in which it loads a structure
+that closely resembles the modules of XHTML 1.1. This structure is very
+dynamic, making it easy to make cascading changes to global content
+sets or remove elements in bulk.
 
-We will not be using the XML Schemas or DTDs directly due to the lack
-of robust tools in the area.
+However, once HTML Purifier needs the actual definition, it retrieves
+a finalized version of HTMLDefinition. The finalized definition involves
+processing the modules into a form that it is optimized for multiple
+calls. This final version is immutable and, even if editable, would
+be extremely hard to change.
 
-Since we will be performing a lot of abstracting, caching would be nice. Cache
-invalidation could be done by comparing the HTML and Attr config namespaces
-with a copy that was packaged along with this (we have no files to mtime)
-
-We also have the trouble of preserving the current interface, which is
-quite nice in terms of speed but not so good in terms of OO-ness. This
-is fine: we may need to have a two-tiered setup approach, that goes
+So, some code taking advantage of the XHTML modularization may look
 like this:
 
-1. When getHTMLDefinition() is initially called, we prepare the default
-   environment of content-sets, loaded modules, and allowed content-sets.
-   This, while good for developers seeking to customize the tagset, is
-   unusable by HTML Purifier internals. It represents the XML schema
-2. When HTMLPurifier needs to use the definition, it calls a second setup
-   function, which now performs any substitutions needed and instantiates
-   all the objects which the internals will use.
+<?php
+    $config = HTMLPurifier_Config::createDefault();
+    $def =& $config->getHTMLDefinition(true); // reference to raw
+    unset($def->modules['Hypertext']); // rm ''a'' link
+    $purifier = new HTMLPurifier($config);
+    $purifier->purify($html); // now the definition is finalized
+?>
 
-In this manner, complicated observers are not necessary, you just
-specify a content module like:
+== Inclusions ==
 
-    $flow = '%Heading | %Block | %Inline';
+One of the nice features of HTMLDefinition is that piggy-backing off
+of global attribute and content sets is extremely easy to do.
 
-And the second setup will perform the substitutions magically for you.
+=== Attributes ===
 
-A module will have certain properties:
+HTMLModule->elements[$element]->attr stores attribute information for the
+specific attributes of $element. This is quite close to the final
+API that HTML Purifier interfaces with, but there's an important
+extra feature: attr may also contain a array with a member index zero.
 
- - Elements
-    - Attributes
-    - Content model
- - Content sets
- - Content set extensions
- - Content model extensions [x] (seen only on structural elements)
- - Attribute collection extensions
+<?php
+    HTMLModule->elements[$element]->attr[0] = array('AttrSet');
+?>
 
-In our case, the content model does a lot more than just define what
-allowed children are: they also define exclusions. Also, if we refrain
-from directly instantiating objects, we are posed with the problem of
-how to signify which ChildDef to use. Remember: our specialized cases of
-content models are proprietary optimizations that allow us to deal with
-elements that don't belong rather than spit them out. Possible solutions:
+Rather than map the attribute key 0 to an array (which should be
+an AttrDef), it defines a number of attribute collections that should
+be merged into this elements attribute array.
 
-1. Factory method that analyzes the definition and figures out who
-   to defer to. It would also be responsible for parsing out omissions.
-2. Don't use their content model syntax, just enumerate items and give
-   the class-name of which one to use. If a complex definition is truly
-   needed, then use content model syntax. A definition, then, would
-   be composed of multiple parts:
-    - True content-model definition, OR
-    - Simple content-model definition
-        - List of items in the definition (may be multiple if dealing
-          with Chameleon)
-        - Name of the type (optional, required, etc)
+Furthermore, the value of an attribute key, attribute value pair need
+not be a fully fledged AttrDef object. They can also be a string, which
+signifies a AttrDef that is looked up from a centralized registry
+AttrTypes. This allows more concise attribute definitions that look
+more like W3C's declarations, as well as offering a centralized point
+for modifying the behavior of one attribute type. And, of course, the
+old method of manually instantiating an AttrDef still works.
 
-Flexibility is absolutely essential, so the API of some of these
-ChildDefs may need to change to lend them better to uniform treatment.
+=== Attribute Collections ===
 
-Attributes are somewhat easier to manage, because we would be using
-associative arrays of elements => attributes => AttrDef names, and there
-would be an Attribute Types lookup array to get the appropriate AttrDef
-(if the objects are stateful, they will need to be cloned). Attributes
-for just one element can be specifically overridden through some mechanism
-(probably a config lookup array as well as an internal counter, because
-of HTML 4.01 semantics concerns.) An attribute set will also optionally
-have an array at index 0 which defines what attribute collections to
-agglutinate on when parsing. This may allow us to get rid of global
-attributes, which were also a proprietary implementation detail.
+Attribute collections are stored and processed in the AttrCollections
+object, which is responsible for performing the inclusions signified
+by the 0 index. These attribute collections, too, are mutable, by
+using HTMLModule->attr_collections. You may add new attributes
+to a collection or define an entirely new collection for your module's
+use. Inclusions can also be cumulative.
 
-Alright: let's get to work!
\ No newline at end of file
+Attribute collections allow us to get rid of so called "global attributes"
+(which actually aren't so global).
+
+=== Content Models and ChildDef ===
+
+An implementation of the above-mentioned attributes and attribute
+collections was applied to the ChildDef system. HTML Purifier uses
+a proprietary system called ChildDef for performance and flexibility
+reasons, but this does not line up very well with W3C's notion of
+regexps for defining the allowed children of an element.
+
+HTMLPurifier->elements[$element]->content_model and 
+HTMLPurifier->elements[$element]->content_model_type store information
+about the final ChildDef that will be stored in
+HTMLPurifier->elements[$element]->child (we use a different variable
+because the two forms are sufficiently different).
+
+$content_model is an abstract, string representation of the internal
+state of ChildDef, while $content_model_type is a string identifier
+of which ChildDef subclass to instantiate. $content_model is processed
+by substituting all content set identifiers (capitalized element names)
+with their contents. It is then parsed and passed into the appropriate
+ChildDef class, as defined by the ContentSets->getChildDef() or the
+custom fallback HTMLModule->getChildDef() for custom child definitions
+not in the core.
+
+You'll need to use these facilities if you plan on referencing a content
+set like "Inline" or "Block", and using them is recommended even if you're
+not due to their conciseness.
+
+A few notes on $content_model: it's structure can be as complicated
+as you want, but the pipe symbol (|) is reserved for defining possible
+choices, due to the content sets implementation. For example, a content
+model that looks like:
+
+"Inline -> Block -> a"
+
+...when the Inline content set is defined as "span | b" and the Block
+content set is defined as "div | blockquote", will expand into:
+
+"span | b -> div | blockquote -> a"
+
+The custom HTMLModule->getChildDef() function will need to be able to
+then feed this information to ChildDef in a usable manner.
+
+=== Content Sets ===
+
+Content sets can be altered using HTMLModule->content_sets, an associative
+array of content set names to content set contents. If the content set
+already exists, your values are appended on to it (great for, say,
+registering the font tag as an inline element), otherwise it is
+created. They are substituted into content_model.
\ No newline at end of file
diff --git a/library/HTMLPurifier/ContentSets.php b/library/HTMLPurifier/ContentSets.php
index 0897d08e..97b6a43e 100644
--- a/library/HTMLPurifier/ContentSets.php
+++ b/library/HTMLPurifier/ContentSets.php
@@ -77,6 +77,7 @@ class HTMLPurifier_ContentSets
      * @param $module Module that defined the ElementDef
      */
     function generateChildDef(&$def, $module) {
+        if (!empty($def->child)) return; // already done!
         $content_model = $def->content_model;
         if (is_string($content_model)) {
             $def->content_model = str_replace(
@@ -95,7 +96,14 @@ class HTMLPurifier_ContentSets
      */
     function getChildDef($def, $module) {
         $value = $def->content_model;
-        if (is_object($value)) return $value; // direct object, return
+        if (is_object($value)) {
+            trigger_error(
+                'Literal object child definitions should be stored in '.
+                'ElementDef->child not ElementDef->content_model',
+                E_USER_NOTICE
+            );
+            return $value;
+        }
         switch ($def->content_model_type) {
             case 'required':
                 return new HTMLPurifier_ChildDef_Required($value);
@@ -109,8 +117,10 @@ class HTMLPurifier_ContentSets
                 return new HTMLPurifier_ChildDef_Custom($value);
         }
         // defer to its module
-        if (!$module->defines_child_def) continue; // save a func call
-        $return = $module->getChildDef($def);
+        $return = false;
+        if ($module->defines_child_def) { // save a func call
+            $return = $module->getChildDef($def);
+        }
         if ($return !== false) return $return;
         // error-out
         trigger_error(
diff --git a/library/HTMLPurifier/HTMLDefinition.php b/library/HTMLPurifier/HTMLDefinition.php
index 97a1d5c4..4bc4e9cf 100644
--- a/library/HTMLPurifier/HTMLDefinition.php
+++ b/library/HTMLPurifier/HTMLDefinition.php
@@ -168,19 +168,19 @@ class HTMLPurifier_HTMLDefinition
     /**
      * Associative array of deprecated tag name to HTMLPurifier_TagTransform
      * @public
-     */
+     */ // use + operator
     var $info_tag_transform = array();
     
     /**
      * List of HTMLPurifier_AttrTransform to be performed before validation.
      * @public
-     */
+     */ // use array_merge or a foreach loop
     var $info_attr_transform_pre = array();
     
     /**
      * List of HTMLPurifier_AttrTransform to be performed after validation.
      * @public
-     */
+     */ // use array_merge or a foreach loop
     var $info_attr_transform_post = array();
     
     /**
diff --git a/library/HTMLPurifier/HTMLModule/Tables.php b/library/HTMLPurifier/HTMLModule/Tables.php
index 2865b4f4..ffe90ded 100644
--- a/library/HTMLPurifier/HTMLModule/Tables.php
+++ b/library/HTMLPurifier/HTMLModule/Tables.php
@@ -58,8 +58,7 @@ class HTMLPurifier_HTMLModule_Tables extends HTMLPurifier_HTMLModule
         // Is done directly because it doesn't leverage substitution
         // mechanisms. True model is:
         // 'caption?, ( col* | colgroup* ), (( thead?, tfoot?, tbody+ ) | ( tr+ ))'
-        $this->info['table']->content_model = new HTMLPurifier_ChildDef_Table();
-        $this->info['table']->content_model_type = 'table';
+        $this->info['table']->child = new HTMLPurifier_ChildDef_Table();
         
         $this->info['td']->content_model = 
         $this->info['th']->content_model = '#PCDATA | Flow';
diff --git a/library/HTMLPurifier/URISchemeRegistry.php b/library/HTMLPurifier/URISchemeRegistry.php
index 82fd9601..d840068a 100644
--- a/library/HTMLPurifier/URISchemeRegistry.php
+++ b/library/HTMLPurifier/URISchemeRegistry.php
@@ -10,7 +10,7 @@ HTMLPurifier_ConfigSchema::define(
         'irc'   => true, // "Internet Relay Chat", usually needs another app
         // for Usenet, these two are similar, but distinct
         'nntp'  => true, // individual Netnews articles
-        'news'  => true  // newsgroup or individual Netnews articles),
+        'news'  => true  // newsgroup or individual Netnews articles
     ), 'lookup',
     'Whitelist that defines the schemes that a URI is allowed to have.  This '.
     'prevents XSS attacks from using pseudo-schemes like javascript or mocha.'