mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2025-01-08 15:11:51 +00:00
[1.5.0] Rewrite XHTML 1.1 document to describe HTMLDefinition's modularization
- Use ElementDef->child to define a literal ChildDef object, rather than ElementDef->content_model. - Add notes on transforms, HTMLModule will be able to write those too - Fix some misc typos. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@729 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
parent
591fc0ae28
commit
d5491da77f
@ -7,7 +7,7 @@ value is used for. This means decentralized configuration declarations that
|
|||||||
are nevertheless error checking and a centralized configuration object.
|
are nevertheless error checking and a centralized configuration object.
|
||||||
|
|
||||||
Directives are divided into namespaces, indicating the major portion of
|
Directives are divided into namespaces, indicating the major portion of
|
||||||
functionality they cover (although there may be overlaps. Please consult
|
functionality they cover (although there may be overlaps). Please consult
|
||||||
the documentation in ConfigDef for more information on these namespaces.
|
the documentation in ConfigDef for more information on these namespaces.
|
||||||
|
|
||||||
Since configuration is dependant on context, internal classes require a
|
Since configuration is dependant on context, internal classes require a
|
||||||
|
@ -1,23 +1,22 @@
|
|||||||
|
|
||||||
Getting XHTML 1.1 Working
|
XHTML 1.1 and HTML Purifier
|
||||||
|
|
||||||
It's quite simple, according to <http://www.w3.org/TR/xhtml11/changes.html>
|
|
||||||
|
|
||||||
|
Todo for XHTML 1.1 support <http://www.w3.org/TR/xhtml11/changes.html>
|
||||||
1. Scratch lang entirely in favor of xml:lang
|
1. Scratch lang entirely in favor of xml:lang
|
||||||
2. Scratch name entirely in favor of id (partially-done)
|
2. Scratch name entirely in favor of id (partially-done)
|
||||||
3. Support Ruby <http://www.w3.org/TR/2001/REC-ruby-20010531/>
|
3. Support Ruby <http://www.w3.org/TR/2001/REC-ruby-20010531/>
|
||||||
|
|
||||||
...but that's only an informative section. The true power of XHTML 1.1
|
HTML Purifier uses the modularization of XHTML
|
||||||
is its modularization, defined at:
|
<http://www.w3.org/TR/xhtml-modularization/> to organize the internals
|
||||||
<http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/>
|
of HTMLDefinition into a more manageable and extensible fashion. Rather
|
||||||
|
than have one super-object, HTMLDefinition is split into HTMLModules,
|
||||||
|
each of which are responsible for defining elements, their attributes,
|
||||||
|
and other properties (for a more indepth coverage, see
|
||||||
|
/library/HTMLPurifier/HTMLModule.php's docblock comments).
|
||||||
|
|
||||||
Modularization may very well be the next-generation implementation
|
The modules that W3C defines and we support are:
|
||||||
of HTMLDefinition. The current, XHTML 1.0 DTD-based approach is
|
|
||||||
extremely brittle and doesn't lend well to extension by others, but
|
|
||||||
modularization fixes all that. The modules W3C defines that we
|
|
||||||
should take a look at are:
|
|
||||||
|
|
||||||
* 5.1. Attribute Collections
|
* 5.1. Attribute Collections (technically not a module
|
||||||
* 5.2. Core Modules
|
* 5.2. Core Modules
|
||||||
o 5.2.2. Text Module
|
o 5.2.2. Text Module
|
||||||
o 5.2.3. Hypertext Module
|
o 5.2.3. Hypertext Module
|
||||||
@ -27,18 +26,22 @@ should take a look at are:
|
|||||||
o 5.4.2. Edit Module
|
o 5.4.2. Edit Module
|
||||||
o 5.4.3. Bi-directional Text Module
|
o 5.4.3. Bi-directional Text Module
|
||||||
* 5.6. Table Modules
|
* 5.6. Table Modules
|
||||||
o 5.6.1. Basic Tables Module [?]
|
|
||||||
o 5.6.2. Tables Module
|
o 5.6.2. Tables Module
|
||||||
* 5.7. Image Module
|
* 5.7. Image Module
|
||||||
|
* 5.18. Style Attribute Module
|
||||||
|
|
||||||
|
Modules that we don't support but coul support are:
|
||||||
|
|
||||||
|
* 5.6. Table Modules
|
||||||
|
o 5.6.1. Basic Tables Module [?]
|
||||||
* 5.8. Client-side Image Map Module [?]
|
* 5.8. Client-side Image Map Module [?]
|
||||||
* 5.9. Server-side Image Map Module [?]
|
* 5.9. Server-side Image Map Module [?]
|
||||||
* 5.12. Target Module [?]
|
* 5.12. Target Module [?]
|
||||||
* 5.18. Style Attribute Module
|
|
||||||
* 5.21. Name Identification Module [deprecated]
|
* 5.21. Name Identification Module [deprecated]
|
||||||
* 5.22. Legacy Module [deprecated]
|
* 5.22. Legacy Module [deprecated]
|
||||||
|
|
||||||
We exclude these modules due to their dangerousness or inapplicability
|
These modules will not be implemented due to their dangerousness or
|
||||||
as a XHTML fragment:
|
inapplicability as an XHTML fragment:
|
||||||
|
|
||||||
* 5.2. Core Modules
|
* 5.2. Core Modules
|
||||||
o 5.2.1. Structure Module
|
o 5.2.1. Structure Module
|
||||||
@ -56,89 +59,129 @@ as a XHTML fragment:
|
|||||||
* 5.19. Link Module
|
* 5.19. Link Module
|
||||||
* 5.20. Base Module
|
* 5.20. Base Module
|
||||||
|
|
||||||
Modularization also defines content sets:
|
We will not be using W3C's XML Schemas or DTDs directly due to the lack
|
||||||
|
of robust tools for handling them (the main problem is that all the
|
||||||
|
current parsers are usually PHP 5 only and solely-validating, not
|
||||||
|
correcting).
|
||||||
|
|
||||||
* Heading
|
The abstraction of the HTMLDefinition creation process will also
|
||||||
* Block
|
contribute to a need for a caching system. Cache invalidation would be
|
||||||
* Inline
|
difficult, but could be done by comparing the HTML and Attr config
|
||||||
* Flow {Heading | Block | Inline}
|
namespaces with a copy that was packaged along with the serialized
|
||||||
* List
|
HTMLDefinition object.
|
||||||
* Form [x]
|
|
||||||
* Formctrl [x]
|
|
||||||
|
|
||||||
Which may have elements dynamically added to them as more modules get
|
== General Use-Case ==
|
||||||
added.
|
|
||||||
|
|
||||||
== Implementation Details ==
|
The outwards API of HTMLDefinition has been largely preserved, not
|
||||||
|
only for backwards-compatibility but also by design. Instead,
|
||||||
|
HTMLDefinition can be retrieved "raw", in which it loads a structure
|
||||||
|
that closely resembles the modules of XHTML 1.1. This structure is very
|
||||||
|
dynamic, making it easy to make cascading changes to global content
|
||||||
|
sets or remove elements in bulk.
|
||||||
|
|
||||||
We will not be using the XML Schemas or DTDs directly due to the lack
|
However, once HTML Purifier needs the actual definition, it retrieves
|
||||||
of robust tools in the area.
|
a finalized version of HTMLDefinition. The finalized definition involves
|
||||||
|
processing the modules into a form that it is optimized for multiple
|
||||||
|
calls. This final version is immutable and, even if editable, would
|
||||||
|
be extremely hard to change.
|
||||||
|
|
||||||
Since we will be performing a lot of abstracting, caching would be nice. Cache
|
So, some code taking advantage of the XHTML modularization may look
|
||||||
invalidation could be done by comparing the HTML and Attr config namespaces
|
|
||||||
with a copy that was packaged along with this (we have no files to mtime)
|
|
||||||
|
|
||||||
We also have the trouble of preserving the current interface, which is
|
|
||||||
quite nice in terms of speed but not so good in terms of OO-ness. This
|
|
||||||
is fine: we may need to have a two-tiered setup approach, that goes
|
|
||||||
like this:
|
like this:
|
||||||
|
|
||||||
1. When getHTMLDefinition() is initially called, we prepare the default
|
<?php
|
||||||
environment of content-sets, loaded modules, and allowed content-sets.
|
$config = HTMLPurifier_Config::createDefault();
|
||||||
This, while good for developers seeking to customize the tagset, is
|
$def =& $config->getHTMLDefinition(true); // reference to raw
|
||||||
unusable by HTML Purifier internals. It represents the XML schema
|
unset($def->modules['Hypertext']); // rm ''a'' link
|
||||||
2. When HTMLPurifier needs to use the definition, it calls a second setup
|
$purifier = new HTMLPurifier($config);
|
||||||
function, which now performs any substitutions needed and instantiates
|
$purifier->purify($html); // now the definition is finalized
|
||||||
all the objects which the internals will use.
|
?>
|
||||||
|
|
||||||
In this manner, complicated observers are not necessary, you just
|
== Inclusions ==
|
||||||
specify a content module like:
|
|
||||||
|
|
||||||
$flow = '%Heading | %Block | %Inline';
|
One of the nice features of HTMLDefinition is that piggy-backing off
|
||||||
|
of global attribute and content sets is extremely easy to do.
|
||||||
|
|
||||||
And the second setup will perform the substitutions magically for you.
|
=== Attributes ===
|
||||||
|
|
||||||
A module will have certain properties:
|
HTMLModule->elements[$element]->attr stores attribute information for the
|
||||||
|
specific attributes of $element. This is quite close to the final
|
||||||
|
API that HTML Purifier interfaces with, but there's an important
|
||||||
|
extra feature: attr may also contain a array with a member index zero.
|
||||||
|
|
||||||
- Elements
|
<?php
|
||||||
- Attributes
|
HTMLModule->elements[$element]->attr[0] = array('AttrSet');
|
||||||
- Content model
|
?>
|
||||||
- Content sets
|
|
||||||
- Content set extensions
|
|
||||||
- Content model extensions [x] (seen only on structural elements)
|
|
||||||
- Attribute collection extensions
|
|
||||||
|
|
||||||
In our case, the content model does a lot more than just define what
|
Rather than map the attribute key 0 to an array (which should be
|
||||||
allowed children are: they also define exclusions. Also, if we refrain
|
an AttrDef), it defines a number of attribute collections that should
|
||||||
from directly instantiating objects, we are posed with the problem of
|
be merged into this elements attribute array.
|
||||||
how to signify which ChildDef to use. Remember: our specialized cases of
|
|
||||||
content models are proprietary optimizations that allow us to deal with
|
|
||||||
elements that don't belong rather than spit them out. Possible solutions:
|
|
||||||
|
|
||||||
1. Factory method that analyzes the definition and figures out who
|
Furthermore, the value of an attribute key, attribute value pair need
|
||||||
to defer to. It would also be responsible for parsing out omissions.
|
not be a fully fledged AttrDef object. They can also be a string, which
|
||||||
2. Don't use their content model syntax, just enumerate items and give
|
signifies a AttrDef that is looked up from a centralized registry
|
||||||
the class-name of which one to use. If a complex definition is truly
|
AttrTypes. This allows more concise attribute definitions that look
|
||||||
needed, then use content model syntax. A definition, then, would
|
more like W3C's declarations, as well as offering a centralized point
|
||||||
be composed of multiple parts:
|
for modifying the behavior of one attribute type. And, of course, the
|
||||||
- True content-model definition, OR
|
old method of manually instantiating an AttrDef still works.
|
||||||
- Simple content-model definition
|
|
||||||
- List of items in the definition (may be multiple if dealing
|
|
||||||
with Chameleon)
|
|
||||||
- Name of the type (optional, required, etc)
|
|
||||||
|
|
||||||
Flexibility is absolutely essential, so the API of some of these
|
=== Attribute Collections ===
|
||||||
ChildDefs may need to change to lend them better to uniform treatment.
|
|
||||||
|
|
||||||
Attributes are somewhat easier to manage, because we would be using
|
Attribute collections are stored and processed in the AttrCollections
|
||||||
associative arrays of elements => attributes => AttrDef names, and there
|
object, which is responsible for performing the inclusions signified
|
||||||
would be an Attribute Types lookup array to get the appropriate AttrDef
|
by the 0 index. These attribute collections, too, are mutable, by
|
||||||
(if the objects are stateful, they will need to be cloned). Attributes
|
using HTMLModule->attr_collections. You may add new attributes
|
||||||
for just one element can be specifically overridden through some mechanism
|
to a collection or define an entirely new collection for your module's
|
||||||
(probably a config lookup array as well as an internal counter, because
|
use. Inclusions can also be cumulative.
|
||||||
of HTML 4.01 semantics concerns.) An attribute set will also optionally
|
|
||||||
have an array at index 0 which defines what attribute collections to
|
|
||||||
agglutinate on when parsing. This may allow us to get rid of global
|
|
||||||
attributes, which were also a proprietary implementation detail.
|
|
||||||
|
|
||||||
Alright: let's get to work!
|
Attribute collections allow us to get rid of so called "global attributes"
|
||||||
|
(which actually aren't so global).
|
||||||
|
|
||||||
|
=== Content Models and ChildDef ===
|
||||||
|
|
||||||
|
An implementation of the above-mentioned attributes and attribute
|
||||||
|
collections was applied to the ChildDef system. HTML Purifier uses
|
||||||
|
a proprietary system called ChildDef for performance and flexibility
|
||||||
|
reasons, but this does not line up very well with W3C's notion of
|
||||||
|
regexps for defining the allowed children of an element.
|
||||||
|
|
||||||
|
HTMLPurifier->elements[$element]->content_model and
|
||||||
|
HTMLPurifier->elements[$element]->content_model_type store information
|
||||||
|
about the final ChildDef that will be stored in
|
||||||
|
HTMLPurifier->elements[$element]->child (we use a different variable
|
||||||
|
because the two forms are sufficiently different).
|
||||||
|
|
||||||
|
$content_model is an abstract, string representation of the internal
|
||||||
|
state of ChildDef, while $content_model_type is a string identifier
|
||||||
|
of which ChildDef subclass to instantiate. $content_model is processed
|
||||||
|
by substituting all content set identifiers (capitalized element names)
|
||||||
|
with their contents. It is then parsed and passed into the appropriate
|
||||||
|
ChildDef class, as defined by the ContentSets->getChildDef() or the
|
||||||
|
custom fallback HTMLModule->getChildDef() for custom child definitions
|
||||||
|
not in the core.
|
||||||
|
|
||||||
|
You'll need to use these facilities if you plan on referencing a content
|
||||||
|
set like "Inline" or "Block", and using them is recommended even if you're
|
||||||
|
not due to their conciseness.
|
||||||
|
|
||||||
|
A few notes on $content_model: it's structure can be as complicated
|
||||||
|
as you want, but the pipe symbol (|) is reserved for defining possible
|
||||||
|
choices, due to the content sets implementation. For example, a content
|
||||||
|
model that looks like:
|
||||||
|
|
||||||
|
"Inline -> Block -> a"
|
||||||
|
|
||||||
|
...when the Inline content set is defined as "span | b" and the Block
|
||||||
|
content set is defined as "div | blockquote", will expand into:
|
||||||
|
|
||||||
|
"span | b -> div | blockquote -> a"
|
||||||
|
|
||||||
|
The custom HTMLModule->getChildDef() function will need to be able to
|
||||||
|
then feed this information to ChildDef in a usable manner.
|
||||||
|
|
||||||
|
=== Content Sets ===
|
||||||
|
|
||||||
|
Content sets can be altered using HTMLModule->content_sets, an associative
|
||||||
|
array of content set names to content set contents. If the content set
|
||||||
|
already exists, your values are appended on to it (great for, say,
|
||||||
|
registering the font tag as an inline element), otherwise it is
|
||||||
|
created. They are substituted into content_model.
|
@ -77,6 +77,7 @@ class HTMLPurifier_ContentSets
|
|||||||
* @param $module Module that defined the ElementDef
|
* @param $module Module that defined the ElementDef
|
||||||
*/
|
*/
|
||||||
function generateChildDef(&$def, $module) {
|
function generateChildDef(&$def, $module) {
|
||||||
|
if (!empty($def->child)) return; // already done!
|
||||||
$content_model = $def->content_model;
|
$content_model = $def->content_model;
|
||||||
if (is_string($content_model)) {
|
if (is_string($content_model)) {
|
||||||
$def->content_model = str_replace(
|
$def->content_model = str_replace(
|
||||||
@ -95,7 +96,14 @@ class HTMLPurifier_ContentSets
|
|||||||
*/
|
*/
|
||||||
function getChildDef($def, $module) {
|
function getChildDef($def, $module) {
|
||||||
$value = $def->content_model;
|
$value = $def->content_model;
|
||||||
if (is_object($value)) return $value; // direct object, return
|
if (is_object($value)) {
|
||||||
|
trigger_error(
|
||||||
|
'Literal object child definitions should be stored in '.
|
||||||
|
'ElementDef->child not ElementDef->content_model',
|
||||||
|
E_USER_NOTICE
|
||||||
|
);
|
||||||
|
return $value;
|
||||||
|
}
|
||||||
switch ($def->content_model_type) {
|
switch ($def->content_model_type) {
|
||||||
case 'required':
|
case 'required':
|
||||||
return new HTMLPurifier_ChildDef_Required($value);
|
return new HTMLPurifier_ChildDef_Required($value);
|
||||||
@ -109,8 +117,10 @@ class HTMLPurifier_ContentSets
|
|||||||
return new HTMLPurifier_ChildDef_Custom($value);
|
return new HTMLPurifier_ChildDef_Custom($value);
|
||||||
}
|
}
|
||||||
// defer to its module
|
// defer to its module
|
||||||
if (!$module->defines_child_def) continue; // save a func call
|
$return = false;
|
||||||
$return = $module->getChildDef($def);
|
if ($module->defines_child_def) { // save a func call
|
||||||
|
$return = $module->getChildDef($def);
|
||||||
|
}
|
||||||
if ($return !== false) return $return;
|
if ($return !== false) return $return;
|
||||||
// error-out
|
// error-out
|
||||||
trigger_error(
|
trigger_error(
|
||||||
|
@ -168,19 +168,19 @@ class HTMLPurifier_HTMLDefinition
|
|||||||
/**
|
/**
|
||||||
* Associative array of deprecated tag name to HTMLPurifier_TagTransform
|
* Associative array of deprecated tag name to HTMLPurifier_TagTransform
|
||||||
* @public
|
* @public
|
||||||
*/
|
*/ // use + operator
|
||||||
var $info_tag_transform = array();
|
var $info_tag_transform = array();
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* List of HTMLPurifier_AttrTransform to be performed before validation.
|
* List of HTMLPurifier_AttrTransform to be performed before validation.
|
||||||
* @public
|
* @public
|
||||||
*/
|
*/ // use array_merge or a foreach loop
|
||||||
var $info_attr_transform_pre = array();
|
var $info_attr_transform_pre = array();
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* List of HTMLPurifier_AttrTransform to be performed after validation.
|
* List of HTMLPurifier_AttrTransform to be performed after validation.
|
||||||
* @public
|
* @public
|
||||||
*/
|
*/ // use array_merge or a foreach loop
|
||||||
var $info_attr_transform_post = array();
|
var $info_attr_transform_post = array();
|
||||||
|
|
||||||
/**
|
/**
|
||||||
|
@ -58,8 +58,7 @@ class HTMLPurifier_HTMLModule_Tables extends HTMLPurifier_HTMLModule
|
|||||||
// Is done directly because it doesn't leverage substitution
|
// Is done directly because it doesn't leverage substitution
|
||||||
// mechanisms. True model is:
|
// mechanisms. True model is:
|
||||||
// 'caption?, ( col* | colgroup* ), (( thead?, tfoot?, tbody+ ) | ( tr+ ))'
|
// 'caption?, ( col* | colgroup* ), (( thead?, tfoot?, tbody+ ) | ( tr+ ))'
|
||||||
$this->info['table']->content_model = new HTMLPurifier_ChildDef_Table();
|
$this->info['table']->child = new HTMLPurifier_ChildDef_Table();
|
||||||
$this->info['table']->content_model_type = 'table';
|
|
||||||
|
|
||||||
$this->info['td']->content_model =
|
$this->info['td']->content_model =
|
||||||
$this->info['th']->content_model = '#PCDATA | Flow';
|
$this->info['th']->content_model = '#PCDATA | Flow';
|
||||||
|
@ -10,7 +10,7 @@ HTMLPurifier_ConfigSchema::define(
|
|||||||
'irc' => true, // "Internet Relay Chat", usually needs another app
|
'irc' => true, // "Internet Relay Chat", usually needs another app
|
||||||
// for Usenet, these two are similar, but distinct
|
// for Usenet, these two are similar, but distinct
|
||||||
'nntp' => true, // individual Netnews articles
|
'nntp' => true, // individual Netnews articles
|
||||||
'news' => true // newsgroup or individual Netnews articles),
|
'news' => true // newsgroup or individual Netnews articles
|
||||||
), 'lookup',
|
), 'lookup',
|
||||||
'Whitelist that defines the schemes that a URI is allowed to have. This '.
|
'Whitelist that defines the schemes that a URI is allowed to have. This '.
|
||||||
'prevents XSS attacks from using pseudo-schemes like javascript or mocha.'
|
'prevents XSS attacks from using pseudo-schemes like javascript or mocha.'
|
||||||
|
Loading…
Reference in New Issue
Block a user