From d5491da77f0155b32c60b6431809c00515a9168e Mon Sep 17 00:00:00 2001 From: "Edward Z. Yang" Date: Thu, 8 Feb 2007 23:10:49 +0000 Subject: [PATCH] [1.5.0] Rewrite XHTML 1.1 document to describe HTMLDefinition's modularization - Use ElementDef->child to define a literal ChildDef object, rather than ElementDef->content_model. - Add notes on transforms, HTMLModule will be able to write those too - Fix some misc typos. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@729 48356398-32a2-884e-a903-53898d9a118a --- docs/proposal-config.txt | 2 +- docs/ref-xhtml-1.1.txt | 213 +++++++++++++-------- library/HTMLPurifier/ContentSets.php | 16 +- library/HTMLPurifier/HTMLDefinition.php | 6 +- library/HTMLPurifier/HTMLModule/Tables.php | 3 +- library/HTMLPurifier/URISchemeRegistry.php | 2 +- 6 files changed, 147 insertions(+), 95 deletions(-) diff --git a/docs/proposal-config.txt b/docs/proposal-config.txt index 265bf8f3..93314122 100644 --- a/docs/proposal-config.txt +++ b/docs/proposal-config.txt @@ -7,7 +7,7 @@ value is used for. This means decentralized configuration declarations that are nevertheless error checking and a centralized configuration object. Directives are divided into namespaces, indicating the major portion of -functionality they cover (although there may be overlaps. Please consult +functionality they cover (although there may be overlaps). Please consult the documentation in ConfigDef for more information on these namespaces. Since configuration is dependant on context, internal classes require a diff --git a/docs/ref-xhtml-1.1.txt b/docs/ref-xhtml-1.1.txt index a195e329..b32db5a8 100644 --- a/docs/ref-xhtml-1.1.txt +++ b/docs/ref-xhtml-1.1.txt @@ -1,23 +1,22 @@ -Getting XHTML 1.1 Working - -It's quite simple, according to +XHTML 1.1 and HTML Purifier +Todo for XHTML 1.1 support 1. Scratch lang entirely in favor of xml:lang 2. Scratch name entirely in favor of id (partially-done) 3. Support Ruby -...but that's only an informative section. The true power of XHTML 1.1 -is its modularization, defined at: - +HTML Purifier uses the modularization of XHTML + to organize the internals +of HTMLDefinition into a more manageable and extensible fashion. Rather +than have one super-object, HTMLDefinition is split into HTMLModules, +each of which are responsible for defining elements, their attributes, +and other properties (for a more indepth coverage, see +/library/HTMLPurifier/HTMLModule.php's docblock comments). -Modularization may very well be the next-generation implementation -of HTMLDefinition. The current, XHTML 1.0 DTD-based approach is -extremely brittle and doesn't lend well to extension by others, but -modularization fixes all that. The modules W3C defines that we -should take a look at are: +The modules that W3C defines and we support are: - * 5.1. Attribute Collections + * 5.1. Attribute Collections (technically not a module * 5.2. Core Modules o 5.2.2. Text Module o 5.2.3. Hypertext Module @@ -27,18 +26,22 @@ should take a look at are: o 5.4.2. Edit Module o 5.4.3. Bi-directional Text Module * 5.6. Table Modules - o 5.6.1. Basic Tables Module [?] o 5.6.2. Tables Module * 5.7. Image Module + * 5.18. Style Attribute Module + +Modules that we don't support but coul support are: + + * 5.6. Table Modules + o 5.6.1. Basic Tables Module [?] * 5.8. Client-side Image Map Module [?] * 5.9. Server-side Image Map Module [?] * 5.12. Target Module [?] - * 5.18. Style Attribute Module * 5.21. Name Identification Module [deprecated] * 5.22. Legacy Module [deprecated] -We exclude these modules due to their dangerousness or inapplicability -as a XHTML fragment: +These modules will not be implemented due to their dangerousness or +inapplicability as an XHTML fragment: * 5.2. Core Modules o 5.2.1. Structure Module @@ -56,89 +59,129 @@ as a XHTML fragment: * 5.19. Link Module * 5.20. Base Module -Modularization also defines content sets: +We will not be using W3C's XML Schemas or DTDs directly due to the lack +of robust tools for handling them (the main problem is that all the +current parsers are usually PHP 5 only and solely-validating, not +correcting). - * Heading - * Block - * Inline - * Flow {Heading | Block | Inline} - * List - * Form [x] - * Formctrl [x] +The abstraction of the HTMLDefinition creation process will also +contribute to a need for a caching system. Cache invalidation would be +difficult, but could be done by comparing the HTML and Attr config +namespaces with a copy that was packaged along with the serialized +HTMLDefinition object. -Which may have elements dynamically added to them as more modules get -added. +== General Use-Case == -== Implementation Details == +The outwards API of HTMLDefinition has been largely preserved, not +only for backwards-compatibility but also by design. Instead, +HTMLDefinition can be retrieved "raw", in which it loads a structure +that closely resembles the modules of XHTML 1.1. This structure is very +dynamic, making it easy to make cascading changes to global content +sets or remove elements in bulk. -We will not be using the XML Schemas or DTDs directly due to the lack -of robust tools in the area. +However, once HTML Purifier needs the actual definition, it retrieves +a finalized version of HTMLDefinition. The finalized definition involves +processing the modules into a form that it is optimized for multiple +calls. This final version is immutable and, even if editable, would +be extremely hard to change. -Since we will be performing a lot of abstracting, caching would be nice. Cache -invalidation could be done by comparing the HTML and Attr config namespaces -with a copy that was packaged along with this (we have no files to mtime) - -We also have the trouble of preserving the current interface, which is -quite nice in terms of speed but not so good in terms of OO-ness. This -is fine: we may need to have a two-tiered setup approach, that goes +So, some code taking advantage of the XHTML modularization may look like this: -1. When getHTMLDefinition() is initially called, we prepare the default - environment of content-sets, loaded modules, and allowed content-sets. - This, while good for developers seeking to customize the tagset, is - unusable by HTML Purifier internals. It represents the XML schema -2. When HTMLPurifier needs to use the definition, it calls a second setup - function, which now performs any substitutions needed and instantiates - all the objects which the internals will use. +getHTMLDefinition(true); // reference to raw + unset($def->modules['Hypertext']); // rm ''a'' link + $purifier = new HTMLPurifier($config); + $purifier->purify($html); // now the definition is finalized +?> -In this manner, complicated observers are not necessary, you just -specify a content module like: +== Inclusions == - $flow = '%Heading | %Block | %Inline'; +One of the nice features of HTMLDefinition is that piggy-backing off +of global attribute and content sets is extremely easy to do. -And the second setup will perform the substitutions magically for you. +=== Attributes === -A module will have certain properties: +HTMLModule->elements[$element]->attr stores attribute information for the +specific attributes of $element. This is quite close to the final +API that HTML Purifier interfaces with, but there's an important +extra feature: attr may also contain a array with a member index zero. - - Elements - - Attributes - - Content model - - Content sets - - Content set extensions - - Content model extensions [x] (seen only on structural elements) - - Attribute collection extensions +elements[$element]->attr[0] = array('AttrSet'); +?> -In our case, the content model does a lot more than just define what -allowed children are: they also define exclusions. Also, if we refrain -from directly instantiating objects, we are posed with the problem of -how to signify which ChildDef to use. Remember: our specialized cases of -content models are proprietary optimizations that allow us to deal with -elements that don't belong rather than spit them out. Possible solutions: +Rather than map the attribute key 0 to an array (which should be +an AttrDef), it defines a number of attribute collections that should +be merged into this elements attribute array. -1. Factory method that analyzes the definition and figures out who - to defer to. It would also be responsible for parsing out omissions. -2. Don't use their content model syntax, just enumerate items and give - the class-name of which one to use. If a complex definition is truly - needed, then use content model syntax. A definition, then, would - be composed of multiple parts: - - True content-model definition, OR - - Simple content-model definition - - List of items in the definition (may be multiple if dealing - with Chameleon) - - Name of the type (optional, required, etc) +Furthermore, the value of an attribute key, attribute value pair need +not be a fully fledged AttrDef object. They can also be a string, which +signifies a AttrDef that is looked up from a centralized registry +AttrTypes. This allows more concise attribute definitions that look +more like W3C's declarations, as well as offering a centralized point +for modifying the behavior of one attribute type. And, of course, the +old method of manually instantiating an AttrDef still works. -Flexibility is absolutely essential, so the API of some of these -ChildDefs may need to change to lend them better to uniform treatment. +=== Attribute Collections === -Attributes are somewhat easier to manage, because we would be using -associative arrays of elements => attributes => AttrDef names, and there -would be an Attribute Types lookup array to get the appropriate AttrDef -(if the objects are stateful, they will need to be cloned). Attributes -for just one element can be specifically overridden through some mechanism -(probably a config lookup array as well as an internal counter, because -of HTML 4.01 semantics concerns.) An attribute set will also optionally -have an array at index 0 which defines what attribute collections to -agglutinate on when parsing. This may allow us to get rid of global -attributes, which were also a proprietary implementation detail. +Attribute collections are stored and processed in the AttrCollections +object, which is responsible for performing the inclusions signified +by the 0 index. These attribute collections, too, are mutable, by +using HTMLModule->attr_collections. You may add new attributes +to a collection or define an entirely new collection for your module's +use. Inclusions can also be cumulative. -Alright: let's get to work! \ No newline at end of file +Attribute collections allow us to get rid of so called "global attributes" +(which actually aren't so global). + +=== Content Models and ChildDef === + +An implementation of the above-mentioned attributes and attribute +collections was applied to the ChildDef system. HTML Purifier uses +a proprietary system called ChildDef for performance and flexibility +reasons, but this does not line up very well with W3C's notion of +regexps for defining the allowed children of an element. + +HTMLPurifier->elements[$element]->content_model and +HTMLPurifier->elements[$element]->content_model_type store information +about the final ChildDef that will be stored in +HTMLPurifier->elements[$element]->child (we use a different variable +because the two forms are sufficiently different). + +$content_model is an abstract, string representation of the internal +state of ChildDef, while $content_model_type is a string identifier +of which ChildDef subclass to instantiate. $content_model is processed +by substituting all content set identifiers (capitalized element names) +with their contents. It is then parsed and passed into the appropriate +ChildDef class, as defined by the ContentSets->getChildDef() or the +custom fallback HTMLModule->getChildDef() for custom child definitions +not in the core. + +You'll need to use these facilities if you plan on referencing a content +set like "Inline" or "Block", and using them is recommended even if you're +not due to their conciseness. + +A few notes on $content_model: it's structure can be as complicated +as you want, but the pipe symbol (|) is reserved for defining possible +choices, due to the content sets implementation. For example, a content +model that looks like: + +"Inline -> Block -> a" + +...when the Inline content set is defined as "span | b" and the Block +content set is defined as "div | blockquote", will expand into: + +"span | b -> div | blockquote -> a" + +The custom HTMLModule->getChildDef() function will need to be able to +then feed this information to ChildDef in a usable manner. + +=== Content Sets === + +Content sets can be altered using HTMLModule->content_sets, an associative +array of content set names to content set contents. If the content set +already exists, your values are appended on to it (great for, say, +registering the font tag as an inline element), otherwise it is +created. They are substituted into content_model. \ No newline at end of file diff --git a/library/HTMLPurifier/ContentSets.php b/library/HTMLPurifier/ContentSets.php index 0897d08e..97b6a43e 100644 --- a/library/HTMLPurifier/ContentSets.php +++ b/library/HTMLPurifier/ContentSets.php @@ -77,6 +77,7 @@ class HTMLPurifier_ContentSets * @param $module Module that defined the ElementDef */ function generateChildDef(&$def, $module) { + if (!empty($def->child)) return; // already done! $content_model = $def->content_model; if (is_string($content_model)) { $def->content_model = str_replace( @@ -95,7 +96,14 @@ class HTMLPurifier_ContentSets */ function getChildDef($def, $module) { $value = $def->content_model; - if (is_object($value)) return $value; // direct object, return + if (is_object($value)) { + trigger_error( + 'Literal object child definitions should be stored in '. + 'ElementDef->child not ElementDef->content_model', + E_USER_NOTICE + ); + return $value; + } switch ($def->content_model_type) { case 'required': return new HTMLPurifier_ChildDef_Required($value); @@ -109,8 +117,10 @@ class HTMLPurifier_ContentSets return new HTMLPurifier_ChildDef_Custom($value); } // defer to its module - if (!$module->defines_child_def) continue; // save a func call - $return = $module->getChildDef($def); + $return = false; + if ($module->defines_child_def) { // save a func call + $return = $module->getChildDef($def); + } if ($return !== false) return $return; // error-out trigger_error( diff --git a/library/HTMLPurifier/HTMLDefinition.php b/library/HTMLPurifier/HTMLDefinition.php index 97a1d5c4..4bc4e9cf 100644 --- a/library/HTMLPurifier/HTMLDefinition.php +++ b/library/HTMLPurifier/HTMLDefinition.php @@ -168,19 +168,19 @@ class HTMLPurifier_HTMLDefinition /** * Associative array of deprecated tag name to HTMLPurifier_TagTransform * @public - */ + */ // use + operator var $info_tag_transform = array(); /** * List of HTMLPurifier_AttrTransform to be performed before validation. * @public - */ + */ // use array_merge or a foreach loop var $info_attr_transform_pre = array(); /** * List of HTMLPurifier_AttrTransform to be performed after validation. * @public - */ + */ // use array_merge or a foreach loop var $info_attr_transform_post = array(); /** diff --git a/library/HTMLPurifier/HTMLModule/Tables.php b/library/HTMLPurifier/HTMLModule/Tables.php index 2865b4f4..ffe90ded 100644 --- a/library/HTMLPurifier/HTMLModule/Tables.php +++ b/library/HTMLPurifier/HTMLModule/Tables.php @@ -58,8 +58,7 @@ class HTMLPurifier_HTMLModule_Tables extends HTMLPurifier_HTMLModule // Is done directly because it doesn't leverage substitution // mechanisms. True model is: // 'caption?, ( col* | colgroup* ), (( thead?, tfoot?, tbody+ ) | ( tr+ ))' - $this->info['table']->content_model = new HTMLPurifier_ChildDef_Table(); - $this->info['table']->content_model_type = 'table'; + $this->info['table']->child = new HTMLPurifier_ChildDef_Table(); $this->info['td']->content_model = $this->info['th']->content_model = '#PCDATA | Flow'; diff --git a/library/HTMLPurifier/URISchemeRegistry.php b/library/HTMLPurifier/URISchemeRegistry.php index 82fd9601..d840068a 100644 --- a/library/HTMLPurifier/URISchemeRegistry.php +++ b/library/HTMLPurifier/URISchemeRegistry.php @@ -10,7 +10,7 @@ HTMLPurifier_ConfigSchema::define( 'irc' => true, // "Internet Relay Chat", usually needs another app // for Usenet, these two are similar, but distinct 'nntp' => true, // individual Netnews articles - 'news' => true // newsgroup or individual Netnews articles), + 'news' => true // newsgroup or individual Netnews articles ), 'lookup', 'Whitelist that defines the schemes that a URI is allowed to have. This '. 'prevents XSS attacks from using pseudo-schemes like javascript or mocha.'