From 826a57a04ad785df07994a40dcf2ec739a903037 Mon Sep 17 00:00:00 2001
From: "Edward Z. Yang" It makes no sense to adopt a HTML Purifier currently natively supports only a subset of HTML's
+allowed elements, attributes, and behavior. This is by design,
+but as the user is always right, they'll need some method to overload
+these behaviors. Our goals are to let the user: For basic use, the user will have to specify some basic parameters. This
+is not strictly necessary, as HTML Purifier's default setting will always
+output safe code, but is required for standards-compliant output. By default, users will use a doctype-based, permissive but secure
-whitelist. They must define a doctype, and this serves
-as the first method of determining a filterset. The first thing to select is the doctype. This
+is essential for standards-compliant output. This identifier is based
on the name the W3C has given to the document type and not
@@ -61,114 +63,106 @@ the DTD identifier. Due to legacy, the default option is XHTML 1.0 Transitional, however, we
-really shouldn't be guessing what the user's doctype is. Fortunantely,
-people who can't be bothered to set this won't be bothered when their
-pages stop validating. However, selecting this doctype doesn't mean much, because if we
-adhered exactly to the definition we would be letting XSS and other
-nasties through. HTML Purifier must, in its filterset, allow a subset
-of the doctype, which we shall call a filterset. By default, HTML Purifier will use the Rich
-filterset, which allows as many elements as possible with untrusted
-sources. Other possible filtersets could be: Extension-authors would be able to define custom filtersets for
-other users to use. A possible call to select a filterset would be: Due to historical reasons, the default doctype is XHTML 1.0
+Transitional, however, we really shouldn't be guessing what the user's
+doctype is. Fortunantely, people who can't be bothered to set this won't
+be bothered when their pages stop validating. Within filtersets, there are various modes of operation.
+ Within doctypes, there are various modes of operation.
These indicate variant behaviors that, while not strictly changing the
-allowed set of elements and attributes, will definitely affect the output.
+allowed set of elements and attributes, definitely affect the output.
Currently, we have two modes, which may be used together: Deprecated elements and attributes will be transformed into
+ standards-compliant alternatives when explicitly disallowed. For example, in the XHTML 1.0 Strict doctype, a one-size-fits-all
approach to
-filtersets: therefore, users must be able to define their own sets of
-allowed
elements, as well as switch in-between doctypes of HTML.
-
-
Select
+Selecting a Doctype
-$config->set('HTML', 'Doctype', 'XHTML 1.0 Transitional');
-Selecting a Filterset
-
-
-
-
-$config->set('HTML', 'Filterset', 'Rich');
+Selecting Mode
-
center
- tag would be turned into a div
with the CSS property
+ center
+ element would be turned into a div
with the CSS property
text-align:center;
, but in XHTML 1.0 Transitional
- the tag would be preserved. This mode is on by default.center
tag would
- be transformed in both cases. However, tags without a
+ the element would be preserved.
This mode is on by default.
+ +Deprecated elements and attributes will be transformed into + standards-compliant alternatives whenever possible. + It may have various levels of operation.
+Referring back to the previous example, the center
element would
+ be transformed in both cases. However, elements without a
reasonable standards-compliant alternative will be preserved
- in their form. This mode is on by default. It may have
- various levels of operation.
A user may want to correct certain deprecated attributes, but
+ not others. For example, the bgcolor
attribute may be
+ acceptable, but the center
element not; also, possibly,
+ an HTML Purifier transformation may be buggy, so the user wants
+ to forgo it. Thus, correctional accepts an array defining which
+ elements and attributes to cleanup, or no parameter at all, which
+ means everything gets corrected. This also means that each
+ correction needs to be given a unique ID that can be referenced
+ in this manner. (We may also allow globbing, like *.name or a.*
+ for mass-enabling correction, and subtractive mode, where things
+ specified stop correction.) This array gets passed into the
+ constructor of the mode's module.
This mode is on by default.
+A possible call to select modes would be:
$config->set('HTML', 'Mode', array('correctional', 'lenient'));-
If modes have extra parameters, a hash might work well:
+If modes have extra parameters, a hash is necessary:
$config->set('HTML', 'Mode', array( - 'correctional' => 9, // strongest level + 'correctional' => 'center,a.name', 'lenient' => true // this one's just boolean ));-
Modes may possibly be wrapped up with the filterset declaration:
+Modes may be specified along with the doctype declaration (we may want +to get a better set of separator characters):
-$config->set('HTML', 'Filterset', 'Rich: correctional, lenient');+
$config->setDoctype('XHTML Transitional 1.0', '+correctional[center,a.name] -lenient');-
Further investigation in this field is necessary.
- -With regards to the various levels of operation conjectured in the +
+With regards to the various levels of operation conjectured in the
Correctional mode, this is prompted by the fact that a user may want to
correct certain problems but not others, for example, fix the center
-tag but not the u
tag, both of which are deprecated.
+element but not the u
element, both of which are deprecated.
Having an integer level
will not work very well for such fine
grained tweaking, but an array of specific settings might.
If this cookie cutter approach doesn't appeal to a user, they may -decide to roll their own filterset by selecting modules, tags and +decide to roll their own filterset by selecting modules, elements and attributes to allow.
This would make use of the same facilities
as a filterset author would use, except that it would go under an
anonymous
filterset that would be auto-selected if any of the
-relevant module/tag/attribute selection configuration directives were
+relevant module/elements/attribute selection configuration directives were
non-null.
In practice, this is the most commonly demanded feature. Most users are perfectly happy defining a filterset that looks like:
-$config->setAllowedHTML('a[href,title],em,p,blockquote');- -
We currently support a separated interface, which also must be preserved:
- -$config->set('HTML', 'AllowedTags', 'a,em,p,blockquote'); -$config->set('HTML', 'AllowedAttributes', 'a.href,a.title');+
$config->setAllowedHTML('a[href,title];em;p;blockquote');
The directive %HTML.Allowed is a convenience function that may be fully expressed with the legacy interface, and thus is given its own setter.
+We currently support a separated interface, which also must be preserved:
+ +$config->set('HTML', 'AllowedElements', 'a,em,p,blockquote'); +$config->set('HTML', 'AllowedAttributes', 'a.href,a.title');+
A user may also choose to allow modules:
$config->set('HTML', 'AllowedModules', 'Hypertext,Text,Lists'); // or @@ -178,13 +172,16 @@ $config->setAllowedHTML('Hypertext,Text,Lists');
The granularity of these modules is too coarse for
the average user (for example, the core module loads everything from
-the essential p
tag to the not-so-safe h1
-tag). How do we make this still a viable solution?
p
element to the not-so-safe h1
+element). How do we make this still a viable solution? Possible answers
+may be sub-modules or module parameters. This may not even be a problem,
+considering that most people won't be selecting modules.
-Modules are distinguished from regular tags by the -case of their first letter. While XML distinguishes between lower and uppercase -letters, in practice, most well-known XML languages use only lower-case -tag names for sake of consistency.
+Modules are distinguished from regular elements by the +case of their first letter. While XML distinguishes between and allows +lower and uppercase letters in element names, most well-known XML +languages use only lower-case +element names for sake of consistency.
Considering that, internally speaking, as mandated by the XHTML 1.1 Modularization specification, we have organized our @@ -202,6 +199,89 @@ for selecting a filterset. Possibility:
...which is simply a light wrapper over the individual configuration calls. A custom config file format or text format could also be adopted.
+By reviewing topic posts in the support forum, we determined that +there were two primarily demanded customization features people wanted: +to add an attribute to an existing element, and to add an element. +Thus, we'll want to create convenience functions for these common +use-cases.
+ +Note that the functions described here are only available if
+a raw copy of HTMLPurifier_HTMLDefinition
was retrieved.
+addAttribute
may work on a processed copy, but for
+consistency's sake we will mandate this for everything.
An attribute is bound to an element by a name and has a specific
+AttrDef
that validates it. Thus, the interface should
+be:
function addAttribute($element, $attribute, $attribute_def);+ +
With a use-case that looks like:
+ +$def->addAttribute('a', 'rel', new HTMLPurifier_AttrDef_Enum(array('nofollow')));+ +
The $attribute_def
value can be a little flexible,
+to make things simpler. We'll let it also be:
HTMLPurifier_AttrDef_Anonymous
+ class with that function registered as a callback.HTMLPurifier_AttrTypes
+ enum(
: We'll explode it and stuff it in an
+ HTMLPurifier_AttrDef_Enum
for you.Making the previous example written as:
+ +$def->addAttribute('a', 'rel', 'enum(nofollow)');+ +
An element requires certain information as specified by
+HTMLPurifier_ElementDef
. However, not all of it is necessary,
+the usual things required are:
This suggests an API like this:
+ +function addElement($element, $type, $content_model, $attributes = array());+ +
Each parameter explained in depth:
+ +$element
$type
$content_model
HTMLPurifier_ElementDef
's member variables
+ $content_model
and $content_model_type
,
+ where the form is Type: Model, ex. 'Optional: Inline'.
$attributes
A possible usage:
+ +$def->addElement('font', 'Inline', 'Optional: Inline', + array(0 => array('Common'), 'color' => 'Color'));+ +
We may want to Common attribute collection inclusion to be added +by default.
+