mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2024-12-22 16:31:53 +00:00
[1.5.0] Rewrite XHTML 1.1 document to describe HTMLDefinition's modularization
- Use ElementDef->child to define a literal ChildDef object, rather than ElementDef->content_model. - Add notes on transforms, HTMLModule will be able to write those too - Fix some misc typos. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@729 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
parent
591fc0ae28
commit
d5491da77f
@ -7,7 +7,7 @@ value is used for. This means decentralized configuration declarations that
|
||||
are nevertheless error checking and a centralized configuration object.
|
||||
|
||||
Directives are divided into namespaces, indicating the major portion of
|
||||
functionality they cover (although there may be overlaps. Please consult
|
||||
functionality they cover (although there may be overlaps). Please consult
|
||||
the documentation in ConfigDef for more information on these namespaces.
|
||||
|
||||
Since configuration is dependant on context, internal classes require a
|
||||
|
@ -1,23 +1,22 @@
|
||||
|
||||
Getting XHTML 1.1 Working
|
||||
|
||||
It's quite simple, according to <http://www.w3.org/TR/xhtml11/changes.html>
|
||||
XHTML 1.1 and HTML Purifier
|
||||
|
||||
Todo for XHTML 1.1 support <http://www.w3.org/TR/xhtml11/changes.html>
|
||||
1. Scratch lang entirely in favor of xml:lang
|
||||
2. Scratch name entirely in favor of id (partially-done)
|
||||
3. Support Ruby <http://www.w3.org/TR/2001/REC-ruby-20010531/>
|
||||
|
||||
...but that's only an informative section. The true power of XHTML 1.1
|
||||
is its modularization, defined at:
|
||||
<http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/>
|
||||
HTML Purifier uses the modularization of XHTML
|
||||
<http://www.w3.org/TR/xhtml-modularization/> to organize the internals
|
||||
of HTMLDefinition into a more manageable and extensible fashion. Rather
|
||||
than have one super-object, HTMLDefinition is split into HTMLModules,
|
||||
each of which are responsible for defining elements, their attributes,
|
||||
and other properties (for a more indepth coverage, see
|
||||
/library/HTMLPurifier/HTMLModule.php's docblock comments).
|
||||
|
||||
Modularization may very well be the next-generation implementation
|
||||
of HTMLDefinition. The current, XHTML 1.0 DTD-based approach is
|
||||
extremely brittle and doesn't lend well to extension by others, but
|
||||
modularization fixes all that. The modules W3C defines that we
|
||||
should take a look at are:
|
||||
The modules that W3C defines and we support are:
|
||||
|
||||
* 5.1. Attribute Collections
|
||||
* 5.1. Attribute Collections (technically not a module
|
||||
* 5.2. Core Modules
|
||||
o 5.2.2. Text Module
|
||||
o 5.2.3. Hypertext Module
|
||||
@ -27,18 +26,22 @@ should take a look at are:
|
||||
o 5.4.2. Edit Module
|
||||
o 5.4.3. Bi-directional Text Module
|
||||
* 5.6. Table Modules
|
||||
o 5.6.1. Basic Tables Module [?]
|
||||
o 5.6.2. Tables Module
|
||||
* 5.7. Image Module
|
||||
* 5.18. Style Attribute Module
|
||||
|
||||
Modules that we don't support but coul support are:
|
||||
|
||||
* 5.6. Table Modules
|
||||
o 5.6.1. Basic Tables Module [?]
|
||||
* 5.8. Client-side Image Map Module [?]
|
||||
* 5.9. Server-side Image Map Module [?]
|
||||
* 5.12. Target Module [?]
|
||||
* 5.18. Style Attribute Module
|
||||
* 5.21. Name Identification Module [deprecated]
|
||||
* 5.22. Legacy Module [deprecated]
|
||||
|
||||
We exclude these modules due to their dangerousness or inapplicability
|
||||
as a XHTML fragment:
|
||||
These modules will not be implemented due to their dangerousness or
|
||||
inapplicability as an XHTML fragment:
|
||||
|
||||
* 5.2. Core Modules
|
||||
o 5.2.1. Structure Module
|
||||
@ -56,89 +59,129 @@ as a XHTML fragment:
|
||||
* 5.19. Link Module
|
||||
* 5.20. Base Module
|
||||
|
||||
Modularization also defines content sets:
|
||||
We will not be using W3C's XML Schemas or DTDs directly due to the lack
|
||||
of robust tools for handling them (the main problem is that all the
|
||||
current parsers are usually PHP 5 only and solely-validating, not
|
||||
correcting).
|
||||
|
||||
* Heading
|
||||
* Block
|
||||
* Inline
|
||||
* Flow {Heading | Block | Inline}
|
||||
* List
|
||||
* Form [x]
|
||||
* Formctrl [x]
|
||||
The abstraction of the HTMLDefinition creation process will also
|
||||
contribute to a need for a caching system. Cache invalidation would be
|
||||
difficult, but could be done by comparing the HTML and Attr config
|
||||
namespaces with a copy that was packaged along with the serialized
|
||||
HTMLDefinition object.
|
||||
|
||||
Which may have elements dynamically added to them as more modules get
|
||||
added.
|
||||
== General Use-Case ==
|
||||
|
||||
== Implementation Details ==
|
||||
The outwards API of HTMLDefinition has been largely preserved, not
|
||||
only for backwards-compatibility but also by design. Instead,
|
||||
HTMLDefinition can be retrieved "raw", in which it loads a structure
|
||||
that closely resembles the modules of XHTML 1.1. This structure is very
|
||||
dynamic, making it easy to make cascading changes to global content
|
||||
sets or remove elements in bulk.
|
||||
|
||||
We will not be using the XML Schemas or DTDs directly due to the lack
|
||||
of robust tools in the area.
|
||||
However, once HTML Purifier needs the actual definition, it retrieves
|
||||
a finalized version of HTMLDefinition. The finalized definition involves
|
||||
processing the modules into a form that it is optimized for multiple
|
||||
calls. This final version is immutable and, even if editable, would
|
||||
be extremely hard to change.
|
||||
|
||||
Since we will be performing a lot of abstracting, caching would be nice. Cache
|
||||
invalidation could be done by comparing the HTML and Attr config namespaces
|
||||
with a copy that was packaged along with this (we have no files to mtime)
|
||||
|
||||
We also have the trouble of preserving the current interface, which is
|
||||
quite nice in terms of speed but not so good in terms of OO-ness. This
|
||||
is fine: we may need to have a two-tiered setup approach, that goes
|
||||
So, some code taking advantage of the XHTML modularization may look
|
||||
like this:
|
||||
|
||||
1. When getHTMLDefinition() is initially called, we prepare the default
|
||||
environment of content-sets, loaded modules, and allowed content-sets.
|
||||
This, while good for developers seeking to customize the tagset, is
|
||||
unusable by HTML Purifier internals. It represents the XML schema
|
||||
2. When HTMLPurifier needs to use the definition, it calls a second setup
|
||||
function, which now performs any substitutions needed and instantiates
|
||||
all the objects which the internals will use.
|
||||
<?php
|
||||
$config = HTMLPurifier_Config::createDefault();
|
||||
$def =& $config->getHTMLDefinition(true); // reference to raw
|
||||
unset($def->modules['Hypertext']); // rm ''a'' link
|
||||
$purifier = new HTMLPurifier($config);
|
||||
$purifier->purify($html); // now the definition is finalized
|
||||
?>
|
||||
|
||||
In this manner, complicated observers are not necessary, you just
|
||||
specify a content module like:
|
||||
== Inclusions ==
|
||||
|
||||
$flow = '%Heading | %Block | %Inline';
|
||||
One of the nice features of HTMLDefinition is that piggy-backing off
|
||||
of global attribute and content sets is extremely easy to do.
|
||||
|
||||
And the second setup will perform the substitutions magically for you.
|
||||
=== Attributes ===
|
||||
|
||||
A module will have certain properties:
|
||||
HTMLModule->elements[$element]->attr stores attribute information for the
|
||||
specific attributes of $element. This is quite close to the final
|
||||
API that HTML Purifier interfaces with, but there's an important
|
||||
extra feature: attr may also contain a array with a member index zero.
|
||||
|
||||
- Elements
|
||||
- Attributes
|
||||
- Content model
|
||||
- Content sets
|
||||
- Content set extensions
|
||||
- Content model extensions [x] (seen only on structural elements)
|
||||
- Attribute collection extensions
|
||||
<?php
|
||||
HTMLModule->elements[$element]->attr[0] = array('AttrSet');
|
||||
?>
|
||||
|
||||
In our case, the content model does a lot more than just define what
|
||||
allowed children are: they also define exclusions. Also, if we refrain
|
||||
from directly instantiating objects, we are posed with the problem of
|
||||
how to signify which ChildDef to use. Remember: our specialized cases of
|
||||
content models are proprietary optimizations that allow us to deal with
|
||||
elements that don't belong rather than spit them out. Possible solutions:
|
||||
Rather than map the attribute key 0 to an array (which should be
|
||||
an AttrDef), it defines a number of attribute collections that should
|
||||
be merged into this elements attribute array.
|
||||
|
||||
1. Factory method that analyzes the definition and figures out who
|
||||
to defer to. It would also be responsible for parsing out omissions.
|
||||
2. Don't use their content model syntax, just enumerate items and give
|
||||
the class-name of which one to use. If a complex definition is truly
|
||||
needed, then use content model syntax. A definition, then, would
|
||||
be composed of multiple parts:
|
||||
- True content-model definition, OR
|
||||
- Simple content-model definition
|
||||
- List of items in the definition (may be multiple if dealing
|
||||
with Chameleon)
|
||||
- Name of the type (optional, required, etc)
|
||||
Furthermore, the value of an attribute key, attribute value pair need
|
||||
not be a fully fledged AttrDef object. They can also be a string, which
|
||||
signifies a AttrDef that is looked up from a centralized registry
|
||||
AttrTypes. This allows more concise attribute definitions that look
|
||||
more like W3C's declarations, as well as offering a centralized point
|
||||
for modifying the behavior of one attribute type. And, of course, the
|
||||
old method of manually instantiating an AttrDef still works.
|
||||
|
||||
Flexibility is absolutely essential, so the API of some of these
|
||||
ChildDefs may need to change to lend them better to uniform treatment.
|
||||
=== Attribute Collections ===
|
||||
|
||||
Attributes are somewhat easier to manage, because we would be using
|
||||
associative arrays of elements => attributes => AttrDef names, and there
|
||||
would be an Attribute Types lookup array to get the appropriate AttrDef
|
||||
(if the objects are stateful, they will need to be cloned). Attributes
|
||||
for just one element can be specifically overridden through some mechanism
|
||||
(probably a config lookup array as well as an internal counter, because
|
||||
of HTML 4.01 semantics concerns.) An attribute set will also optionally
|
||||
have an array at index 0 which defines what attribute collections to
|
||||
agglutinate on when parsing. This may allow us to get rid of global
|
||||
attributes, which were also a proprietary implementation detail.
|
||||
Attribute collections are stored and processed in the AttrCollections
|
||||
object, which is responsible for performing the inclusions signified
|
||||
by the 0 index. These attribute collections, too, are mutable, by
|
||||
using HTMLModule->attr_collections. You may add new attributes
|
||||
to a collection or define an entirely new collection for your module's
|
||||
use. Inclusions can also be cumulative.
|
||||
|
||||
Alright: let's get to work!
|
||||
Attribute collections allow us to get rid of so called "global attributes"
|
||||
(which actually aren't so global).
|
||||
|
||||
=== Content Models and ChildDef ===
|
||||
|
||||
An implementation of the above-mentioned attributes and attribute
|
||||
collections was applied to the ChildDef system. HTML Purifier uses
|
||||
a proprietary system called ChildDef for performance and flexibility
|
||||
reasons, but this does not line up very well with W3C's notion of
|
||||
regexps for defining the allowed children of an element.
|
||||
|
||||
HTMLPurifier->elements[$element]->content_model and
|
||||
HTMLPurifier->elements[$element]->content_model_type store information
|
||||
about the final ChildDef that will be stored in
|
||||
HTMLPurifier->elements[$element]->child (we use a different variable
|
||||
because the two forms are sufficiently different).
|
||||
|
||||
$content_model is an abstract, string representation of the internal
|
||||
state of ChildDef, while $content_model_type is a string identifier
|
||||
of which ChildDef subclass to instantiate. $content_model is processed
|
||||
by substituting all content set identifiers (capitalized element names)
|
||||
with their contents. It is then parsed and passed into the appropriate
|
||||
ChildDef class, as defined by the ContentSets->getChildDef() or the
|
||||
custom fallback HTMLModule->getChildDef() for custom child definitions
|
||||
not in the core.
|
||||
|
||||
You'll need to use these facilities if you plan on referencing a content
|
||||
set like "Inline" or "Block", and using them is recommended even if you're
|
||||
not due to their conciseness.
|
||||
|
||||
A few notes on $content_model: it's structure can be as complicated
|
||||
as you want, but the pipe symbol (|) is reserved for defining possible
|
||||
choices, due to the content sets implementation. For example, a content
|
||||
model that looks like:
|
||||
|
||||
"Inline -> Block -> a"
|
||||
|
||||
...when the Inline content set is defined as "span | b" and the Block
|
||||
content set is defined as "div | blockquote", will expand into:
|
||||
|
||||
"span | b -> div | blockquote -> a"
|
||||
|
||||
The custom HTMLModule->getChildDef() function will need to be able to
|
||||
then feed this information to ChildDef in a usable manner.
|
||||
|
||||
=== Content Sets ===
|
||||
|
||||
Content sets can be altered using HTMLModule->content_sets, an associative
|
||||
array of content set names to content set contents. If the content set
|
||||
already exists, your values are appended on to it (great for, say,
|
||||
registering the font tag as an inline element), otherwise it is
|
||||
created. They are substituted into content_model.
|
@ -77,6 +77,7 @@ class HTMLPurifier_ContentSets
|
||||
* @param $module Module that defined the ElementDef
|
||||
*/
|
||||
function generateChildDef(&$def, $module) {
|
||||
if (!empty($def->child)) return; // already done!
|
||||
$content_model = $def->content_model;
|
||||
if (is_string($content_model)) {
|
||||
$def->content_model = str_replace(
|
||||
@ -95,7 +96,14 @@ class HTMLPurifier_ContentSets
|
||||
*/
|
||||
function getChildDef($def, $module) {
|
||||
$value = $def->content_model;
|
||||
if (is_object($value)) return $value; // direct object, return
|
||||
if (is_object($value)) {
|
||||
trigger_error(
|
||||
'Literal object child definitions should be stored in '.
|
||||
'ElementDef->child not ElementDef->content_model',
|
||||
E_USER_NOTICE
|
||||
);
|
||||
return $value;
|
||||
}
|
||||
switch ($def->content_model_type) {
|
||||
case 'required':
|
||||
return new HTMLPurifier_ChildDef_Required($value);
|
||||
@ -109,8 +117,10 @@ class HTMLPurifier_ContentSets
|
||||
return new HTMLPurifier_ChildDef_Custom($value);
|
||||
}
|
||||
// defer to its module
|
||||
if (!$module->defines_child_def) continue; // save a func call
|
||||
$return = false;
|
||||
if ($module->defines_child_def) { // save a func call
|
||||
$return = $module->getChildDef($def);
|
||||
}
|
||||
if ($return !== false) return $return;
|
||||
// error-out
|
||||
trigger_error(
|
||||
|
@ -168,19 +168,19 @@ class HTMLPurifier_HTMLDefinition
|
||||
/**
|
||||
* Associative array of deprecated tag name to HTMLPurifier_TagTransform
|
||||
* @public
|
||||
*/
|
||||
*/ // use + operator
|
||||
var $info_tag_transform = array();
|
||||
|
||||
/**
|
||||
* List of HTMLPurifier_AttrTransform to be performed before validation.
|
||||
* @public
|
||||
*/
|
||||
*/ // use array_merge or a foreach loop
|
||||
var $info_attr_transform_pre = array();
|
||||
|
||||
/**
|
||||
* List of HTMLPurifier_AttrTransform to be performed after validation.
|
||||
* @public
|
||||
*/
|
||||
*/ // use array_merge or a foreach loop
|
||||
var $info_attr_transform_post = array();
|
||||
|
||||
/**
|
||||
|
@ -58,8 +58,7 @@ class HTMLPurifier_HTMLModule_Tables extends HTMLPurifier_HTMLModule
|
||||
// Is done directly because it doesn't leverage substitution
|
||||
// mechanisms. True model is:
|
||||
// 'caption?, ( col* | colgroup* ), (( thead?, tfoot?, tbody+ ) | ( tr+ ))'
|
||||
$this->info['table']->content_model = new HTMLPurifier_ChildDef_Table();
|
||||
$this->info['table']->content_model_type = 'table';
|
||||
$this->info['table']->child = new HTMLPurifier_ChildDef_Table();
|
||||
|
||||
$this->info['td']->content_model =
|
||||
$this->info['th']->content_model = '#PCDATA | Flow';
|
||||
|
@ -10,7 +10,7 @@ HTMLPurifier_ConfigSchema::define(
|
||||
'irc' => true, // "Internet Relay Chat", usually needs another app
|
||||
// for Usenet, these two are similar, but distinct
|
||||
'nntp' => true, // individual Netnews articles
|
||||
'news' => true // newsgroup or individual Netnews articles),
|
||||
'news' => true // newsgroup or individual Netnews articles
|
||||
), 'lookup',
|
||||
'Whitelist that defines the schemes that a URI is allowed to have. This '.
|
||||
'prevents XSS attacks from using pseudo-schemes like javascript or mocha.'
|
||||
|
Loading…
Reference in New Issue
Block a user