0
0
mirror of https://github.com/ezyang/htmlpurifier.git synced 2025-01-08 15:11:51 +00:00

[1.5.0] Rewrite XHTML 1.1 document to describe HTMLDefinition's modularization

- Use ElementDef->child to define a literal ChildDef object, rather than ElementDef->content_model.
- Add notes on transforms, HTMLModule will be able to write those too
- Fix some misc typos.

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@729 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
Edward Z. Yang 2007-02-08 23:10:49 +00:00
parent 591fc0ae28
commit d5491da77f
6 changed files with 147 additions and 95 deletions

View File

@ -7,7 +7,7 @@ value is used for. This means decentralized configuration declarations that
are nevertheless error checking and a centralized configuration object. are nevertheless error checking and a centralized configuration object.
Directives are divided into namespaces, indicating the major portion of Directives are divided into namespaces, indicating the major portion of
functionality they cover (although there may be overlaps. Please consult functionality they cover (although there may be overlaps). Please consult
the documentation in ConfigDef for more information on these namespaces. the documentation in ConfigDef for more information on these namespaces.
Since configuration is dependant on context, internal classes require a Since configuration is dependant on context, internal classes require a

View File

@ -1,23 +1,22 @@
Getting XHTML 1.1 Working XHTML 1.1 and HTML Purifier
It's quite simple, according to <http://www.w3.org/TR/xhtml11/changes.html>
Todo for XHTML 1.1 support <http://www.w3.org/TR/xhtml11/changes.html>
1. Scratch lang entirely in favor of xml:lang 1. Scratch lang entirely in favor of xml:lang
2. Scratch name entirely in favor of id (partially-done) 2. Scratch name entirely in favor of id (partially-done)
3. Support Ruby <http://www.w3.org/TR/2001/REC-ruby-20010531/> 3. Support Ruby <http://www.w3.org/TR/2001/REC-ruby-20010531/>
...but that's only an informative section. The true power of XHTML 1.1 HTML Purifier uses the modularization of XHTML
is its modularization, defined at: <http://www.w3.org/TR/xhtml-modularization/> to organize the internals
<http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/> of HTMLDefinition into a more manageable and extensible fashion. Rather
than have one super-object, HTMLDefinition is split into HTMLModules,
each of which are responsible for defining elements, their attributes,
and other properties (for a more indepth coverage, see
/library/HTMLPurifier/HTMLModule.php's docblock comments).
Modularization may very well be the next-generation implementation The modules that W3C defines and we support are:
of HTMLDefinition. The current, XHTML 1.0 DTD-based approach is
extremely brittle and doesn't lend well to extension by others, but
modularization fixes all that. The modules W3C defines that we
should take a look at are:
* 5.1. Attribute Collections * 5.1. Attribute Collections (technically not a module
* 5.2. Core Modules * 5.2. Core Modules
o 5.2.2. Text Module o 5.2.2. Text Module
o 5.2.3. Hypertext Module o 5.2.3. Hypertext Module
@ -27,18 +26,22 @@ should take a look at are:
o 5.4.2. Edit Module o 5.4.2. Edit Module
o 5.4.3. Bi-directional Text Module o 5.4.3. Bi-directional Text Module
* 5.6. Table Modules * 5.6. Table Modules
o 5.6.1. Basic Tables Module [?]
o 5.6.2. Tables Module o 5.6.2. Tables Module
* 5.7. Image Module * 5.7. Image Module
* 5.18. Style Attribute Module
Modules that we don't support but coul support are:
* 5.6. Table Modules
o 5.6.1. Basic Tables Module [?]
* 5.8. Client-side Image Map Module [?] * 5.8. Client-side Image Map Module [?]
* 5.9. Server-side Image Map Module [?] * 5.9. Server-side Image Map Module [?]
* 5.12. Target Module [?] * 5.12. Target Module [?]
* 5.18. Style Attribute Module
* 5.21. Name Identification Module [deprecated] * 5.21. Name Identification Module [deprecated]
* 5.22. Legacy Module [deprecated] * 5.22. Legacy Module [deprecated]
We exclude these modules due to their dangerousness or inapplicability These modules will not be implemented due to their dangerousness or
as a XHTML fragment: inapplicability as an XHTML fragment:
* 5.2. Core Modules * 5.2. Core Modules
o 5.2.1. Structure Module o 5.2.1. Structure Module
@ -56,89 +59,129 @@ as a XHTML fragment:
* 5.19. Link Module * 5.19. Link Module
* 5.20. Base Module * 5.20. Base Module
Modularization also defines content sets: We will not be using W3C's XML Schemas or DTDs directly due to the lack
of robust tools for handling them (the main problem is that all the
current parsers are usually PHP 5 only and solely-validating, not
correcting).
* Heading The abstraction of the HTMLDefinition creation process will also
* Block contribute to a need for a caching system. Cache invalidation would be
* Inline difficult, but could be done by comparing the HTML and Attr config
* Flow {Heading | Block | Inline} namespaces with a copy that was packaged along with the serialized
* List HTMLDefinition object.
* Form [x]
* Formctrl [x]
Which may have elements dynamically added to them as more modules get == General Use-Case ==
added.
== Implementation Details == The outwards API of HTMLDefinition has been largely preserved, not
only for backwards-compatibility but also by design. Instead,
HTMLDefinition can be retrieved "raw", in which it loads a structure
that closely resembles the modules of XHTML 1.1. This structure is very
dynamic, making it easy to make cascading changes to global content
sets or remove elements in bulk.
We will not be using the XML Schemas or DTDs directly due to the lack However, once HTML Purifier needs the actual definition, it retrieves
of robust tools in the area. a finalized version of HTMLDefinition. The finalized definition involves
processing the modules into a form that it is optimized for multiple
calls. This final version is immutable and, even if editable, would
be extremely hard to change.
Since we will be performing a lot of abstracting, caching would be nice. Cache So, some code taking advantage of the XHTML modularization may look
invalidation could be done by comparing the HTML and Attr config namespaces
with a copy that was packaged along with this (we have no files to mtime)
We also have the trouble of preserving the current interface, which is
quite nice in terms of speed but not so good in terms of OO-ness. This
is fine: we may need to have a two-tiered setup approach, that goes
like this: like this:
1. When getHTMLDefinition() is initially called, we prepare the default <?php
environment of content-sets, loaded modules, and allowed content-sets. $config = HTMLPurifier_Config::createDefault();
This, while good for developers seeking to customize the tagset, is $def =& $config->getHTMLDefinition(true); // reference to raw
unusable by HTML Purifier internals. It represents the XML schema unset($def->modules['Hypertext']); // rm ''a'' link
2. When HTMLPurifier needs to use the definition, it calls a second setup $purifier = new HTMLPurifier($config);
function, which now performs any substitutions needed and instantiates $purifier->purify($html); // now the definition is finalized
all the objects which the internals will use. ?>
In this manner, complicated observers are not necessary, you just == Inclusions ==
specify a content module like:
$flow = '%Heading | %Block | %Inline'; One of the nice features of HTMLDefinition is that piggy-backing off
of global attribute and content sets is extremely easy to do.
And the second setup will perform the substitutions magically for you. === Attributes ===
A module will have certain properties: HTMLModule->elements[$element]->attr stores attribute information for the
specific attributes of $element. This is quite close to the final
API that HTML Purifier interfaces with, but there's an important
extra feature: attr may also contain a array with a member index zero.
- Elements <?php
- Attributes HTMLModule->elements[$element]->attr[0] = array('AttrSet');
- Content model ?>
- Content sets
- Content set extensions
- Content model extensions [x] (seen only on structural elements)
- Attribute collection extensions
In our case, the content model does a lot more than just define what Rather than map the attribute key 0 to an array (which should be
allowed children are: they also define exclusions. Also, if we refrain an AttrDef), it defines a number of attribute collections that should
from directly instantiating objects, we are posed with the problem of be merged into this elements attribute array.
how to signify which ChildDef to use. Remember: our specialized cases of
content models are proprietary optimizations that allow us to deal with
elements that don't belong rather than spit them out. Possible solutions:
1. Factory method that analyzes the definition and figures out who Furthermore, the value of an attribute key, attribute value pair need
to defer to. It would also be responsible for parsing out omissions. not be a fully fledged AttrDef object. They can also be a string, which
2. Don't use their content model syntax, just enumerate items and give signifies a AttrDef that is looked up from a centralized registry
the class-name of which one to use. If a complex definition is truly AttrTypes. This allows more concise attribute definitions that look
needed, then use content model syntax. A definition, then, would more like W3C's declarations, as well as offering a centralized point
be composed of multiple parts: for modifying the behavior of one attribute type. And, of course, the
- True content-model definition, OR old method of manually instantiating an AttrDef still works.
- Simple content-model definition
- List of items in the definition (may be multiple if dealing
with Chameleon)
- Name of the type (optional, required, etc)
Flexibility is absolutely essential, so the API of some of these === Attribute Collections ===
ChildDefs may need to change to lend them better to uniform treatment.
Attributes are somewhat easier to manage, because we would be using Attribute collections are stored and processed in the AttrCollections
associative arrays of elements => attributes => AttrDef names, and there object, which is responsible for performing the inclusions signified
would be an Attribute Types lookup array to get the appropriate AttrDef by the 0 index. These attribute collections, too, are mutable, by
(if the objects are stateful, they will need to be cloned). Attributes using HTMLModule->attr_collections. You may add new attributes
for just one element can be specifically overridden through some mechanism to a collection or define an entirely new collection for your module's
(probably a config lookup array as well as an internal counter, because use. Inclusions can also be cumulative.
of HTML 4.01 semantics concerns.) An attribute set will also optionally
have an array at index 0 which defines what attribute collections to
agglutinate on when parsing. This may allow us to get rid of global
attributes, which were also a proprietary implementation detail.
Alright: let's get to work! Attribute collections allow us to get rid of so called "global attributes"
(which actually aren't so global).
=== Content Models and ChildDef ===
An implementation of the above-mentioned attributes and attribute
collections was applied to the ChildDef system. HTML Purifier uses
a proprietary system called ChildDef for performance and flexibility
reasons, but this does not line up very well with W3C's notion of
regexps for defining the allowed children of an element.
HTMLPurifier->elements[$element]->content_model and
HTMLPurifier->elements[$element]->content_model_type store information
about the final ChildDef that will be stored in
HTMLPurifier->elements[$element]->child (we use a different variable
because the two forms are sufficiently different).
$content_model is an abstract, string representation of the internal
state of ChildDef, while $content_model_type is a string identifier
of which ChildDef subclass to instantiate. $content_model is processed
by substituting all content set identifiers (capitalized element names)
with their contents. It is then parsed and passed into the appropriate
ChildDef class, as defined by the ContentSets->getChildDef() or the
custom fallback HTMLModule->getChildDef() for custom child definitions
not in the core.
You'll need to use these facilities if you plan on referencing a content
set like "Inline" or "Block", and using them is recommended even if you're
not due to their conciseness.
A few notes on $content_model: it's structure can be as complicated
as you want, but the pipe symbol (|) is reserved for defining possible
choices, due to the content sets implementation. For example, a content
model that looks like:
"Inline -> Block -> a"
...when the Inline content set is defined as "span | b" and the Block
content set is defined as "div | blockquote", will expand into:
"span | b -> div | blockquote -> a"
The custom HTMLModule->getChildDef() function will need to be able to
then feed this information to ChildDef in a usable manner.
=== Content Sets ===
Content sets can be altered using HTMLModule->content_sets, an associative
array of content set names to content set contents. If the content set
already exists, your values are appended on to it (great for, say,
registering the font tag as an inline element), otherwise it is
created. They are substituted into content_model.

View File

@ -77,6 +77,7 @@ class HTMLPurifier_ContentSets
* @param $module Module that defined the ElementDef * @param $module Module that defined the ElementDef
*/ */
function generateChildDef(&$def, $module) { function generateChildDef(&$def, $module) {
if (!empty($def->child)) return; // already done!
$content_model = $def->content_model; $content_model = $def->content_model;
if (is_string($content_model)) { if (is_string($content_model)) {
$def->content_model = str_replace( $def->content_model = str_replace(
@ -95,7 +96,14 @@ class HTMLPurifier_ContentSets
*/ */
function getChildDef($def, $module) { function getChildDef($def, $module) {
$value = $def->content_model; $value = $def->content_model;
if (is_object($value)) return $value; // direct object, return if (is_object($value)) {
trigger_error(
'Literal object child definitions should be stored in '.
'ElementDef->child not ElementDef->content_model',
E_USER_NOTICE
);
return $value;
}
switch ($def->content_model_type) { switch ($def->content_model_type) {
case 'required': case 'required':
return new HTMLPurifier_ChildDef_Required($value); return new HTMLPurifier_ChildDef_Required($value);
@ -109,8 +117,10 @@ class HTMLPurifier_ContentSets
return new HTMLPurifier_ChildDef_Custom($value); return new HTMLPurifier_ChildDef_Custom($value);
} }
// defer to its module // defer to its module
if (!$module->defines_child_def) continue; // save a func call $return = false;
$return = $module->getChildDef($def); if ($module->defines_child_def) { // save a func call
$return = $module->getChildDef($def);
}
if ($return !== false) return $return; if ($return !== false) return $return;
// error-out // error-out
trigger_error( trigger_error(

View File

@ -168,19 +168,19 @@ class HTMLPurifier_HTMLDefinition
/** /**
* Associative array of deprecated tag name to HTMLPurifier_TagTransform * Associative array of deprecated tag name to HTMLPurifier_TagTransform
* @public * @public
*/ */ // use + operator
var $info_tag_transform = array(); var $info_tag_transform = array();
/** /**
* List of HTMLPurifier_AttrTransform to be performed before validation. * List of HTMLPurifier_AttrTransform to be performed before validation.
* @public * @public
*/ */ // use array_merge or a foreach loop
var $info_attr_transform_pre = array(); var $info_attr_transform_pre = array();
/** /**
* List of HTMLPurifier_AttrTransform to be performed after validation. * List of HTMLPurifier_AttrTransform to be performed after validation.
* @public * @public
*/ */ // use array_merge or a foreach loop
var $info_attr_transform_post = array(); var $info_attr_transform_post = array();
/** /**

View File

@ -58,8 +58,7 @@ class HTMLPurifier_HTMLModule_Tables extends HTMLPurifier_HTMLModule
// Is done directly because it doesn't leverage substitution // Is done directly because it doesn't leverage substitution
// mechanisms. True model is: // mechanisms. True model is:
// 'caption?, ( col* | colgroup* ), (( thead?, tfoot?, tbody+ ) | ( tr+ ))' // 'caption?, ( col* | colgroup* ), (( thead?, tfoot?, tbody+ ) | ( tr+ ))'
$this->info['table']->content_model = new HTMLPurifier_ChildDef_Table(); $this->info['table']->child = new HTMLPurifier_ChildDef_Table();
$this->info['table']->content_model_type = 'table';
$this->info['td']->content_model = $this->info['td']->content_model =
$this->info['th']->content_model = '#PCDATA | Flow'; $this->info['th']->content_model = '#PCDATA | Flow';

View File

@ -10,7 +10,7 @@ HTMLPurifier_ConfigSchema::define(
'irc' => true, // "Internet Relay Chat", usually needs another app 'irc' => true, // "Internet Relay Chat", usually needs another app
// for Usenet, these two are similar, but distinct // for Usenet, these two are similar, but distinct
'nntp' => true, // individual Netnews articles 'nntp' => true, // individual Netnews articles
'news' => true // newsgroup or individual Netnews articles), 'news' => true // newsgroup or individual Netnews articles
), 'lookup', ), 'lookup',
'Whitelist that defines the schemes that a URI is allowed to have. This '. 'Whitelist that defines the schemes that a URI is allowed to have. This '.
'prevents XSS attacks from using pseudo-schemes like javascript or mocha.' 'prevents XSS attacks from using pseudo-schemes like javascript or mocha.'