HTML Purifier currently natively supports only a subset of HTML's allowed elements, attributes, and behavior. This is by design, but as the user is always right, they'll need some method to overload these behaviors.
Our goals are to let the user:
For basic use, the user will have to specify some basic parameters. This is not strictly necessary, as HTML Purifier's default setting will always output safe code, but is required for standards-compliant output.
The first thing to select is the doctype. This is essential for standards-compliant output.
This identifier is based on the name the W3C has given to the document type and not the DTD identifier.
This parameter is set via the configuration object:
$config->set('HTML', 'Doctype', 'XHTML 1.0 Transitional');
Due to historical reasons, the default doctype is XHTML 1.0 Transitional, however, we really shouldn't be guessing what the user's doctype is. Fortunantely, people who can't be bothered to set this won't be bothered when their pages stop validating.
Within doctypes, there are various modes of operation. These indicate variant behaviors that, while not strictly changing the allowed set of elements and attributes, definitely affect the output. Currently, we have two modes, which may be used together:
Deprecated elements and attributes will be transformed into standards-compliant alternatives when explicitly disallowed.
For example, in the XHTML 1.0 Strict doctype, a center
element would be turned into a div
with the CSS property
text-align:center;
, but in XHTML 1.0 Transitional
the element would be preserved.
This mode is on by default.
Deprecated elements and attributes will be transformed into standards-compliant alternatives whenever possible. It may have various levels of operation.
Referring back to the previous example, the center
element would
be transformed in both cases. However, elements without a
reasonable standards-compliant alternative will be preserved
in their form.
A user may want to correct certain deprecated attributes, but
not others. For example, the bgcolor
attribute may be
acceptable, but the center
element not; also, possibly,
an HTML Purifier transformation may be buggy, so the user wants
to forgo it. Thus, correctional accepts an array defining which
elements and attributes to cleanup, or no parameter at all, which
means everything gets corrected. This also means that each
correction needs to be given a unique ID that can be referenced
in this manner. (We may also allow globbing, like *.name or a.*
for mass-enabling correction, and subtractive mode, where things
specified stop correction.) This array gets passed into the
constructor of the mode's module.
This mode is on by default.
A possible call to select modes would be:
$config->set('HTML', 'Mode', array('correctional', 'lenient'));
If modes have extra parameters, a hash is necessary:
$config->set('HTML', 'Mode', array( 'correctional' => 'center,a.name', 'lenient' => true // this one's just boolean ));
Modes may be specified along with the doctype declaration (we may want to get a better set of separator characters):
$config->setDoctype('XHTML Transitional 1.0', '+correctional[center,a.name] -lenient');
With regards to the various levels of operation conjectured in the
Correctional mode, this is prompted by the fact that a user may want to
correct certain problems but not others, for example, fix the center
element but not the u
element, both of which are deprecated.
Having an integer level
will not work very well for such fine
grained tweaking, but an array of specific settings might.
If this cookie cutter approach doesn't appeal to a user, they may decide to roll their own filterset by selecting modules, elements and attributes to allow.
This would make use of the same facilities
as a filterset author would use, except that it would go under an
anonymous
filterset that would be auto-selected if any of the
relevant module/elements/attribute selection configuration directives were
non-null.
In practice, this is the most commonly demanded feature. Most users are perfectly happy defining a filterset that looks like:
$config->setAllowedHTML('a[href,title];em;p;blockquote');
The directive %HTML.Allowed is a convenience function that may be fully expressed with the legacy interface, and thus is given its own setter.
We currently support a separated interface, which also must be preserved:
$config->set('HTML', 'AllowedElements', 'a,em,p,blockquote'); $config->set('HTML', 'AllowedAttributes', 'a.href,a.title');
A user may also choose to allow modules:
$config->set('HTML', 'AllowedModules', 'Hypertext,Text,Lists'); // or $config->setAllowedHTML('Hypertext,Text,Lists');
But it is not expected that this feature will be widely used.
The granularity of these modules is too coarse for
the average user (for example, the core module loads everything from
the essential p
element to the not-so-safe h1
element). How do we make this still a viable solution? Possible answers
may be sub-modules or module parameters. This may not even be a problem,
considering that most people won't be selecting modules.
Modules are distinguished from regular elements by the case of their first letter. While XML distinguishes between and allows lower and uppercase letters in element names, most well-known XML languages use only lower-case element names for sake of consistency.
Considering that, internally speaking, as mandated by the XHTML 1.1 Modularization specification, we have organized our elements around modules, considerable gymnastics will be needed to get this sort of functionality working.
Because selecting each and every one of these configuration options is a chore, we may wish to offer a specialized configuration method for selecting a filterset. Possibility:
function selectFilter($doctype, $filterset, $mode)
...which is simply a light wrapper over the individual configuration calls. A custom config file format or text format could also be adopted.
By reviewing topic posts in the support forum, we determined that there were two primarily demanded customization features people wanted: to add an attribute to an existing element, and to add an element. Thus, we'll want to create convenience functions for these common use-cases.
Note that the functions described here are only available if
a raw copy of HTMLPurifier_HTMLDefinition
was retrieved.
addAttribute
may work on a processed copy, but for
consistency's sake we will mandate this for everything.
An attribute is bound to an element by a name and has a specific
AttrDef
that validates it. Thus, the interface should
be:
function addAttribute($element, $attribute, $attribute_def);
With a use-case that looks like:
$def->addAttribute('a', 'rel', new HTMLPurifier_AttrDef_Enum(array('nofollow')));
The $attribute_def
value can be a little flexible,
to make things simpler. We'll let it also be:
HTMLPurifier_AttrDef_Anonymous
class with that function registered as a callback.HTMLPurifier_AttrTypes
enum(
: We'll explode it and stuff it in an
HTMLPurifier_AttrDef_Enum
for you.Making the previous example written as:
$def->addAttribute('a', 'rel', 'enum(nofollow)');
An element requires certain information as specified by
HTMLPurifier_ElementDef
. However, not all of it is necessary,
the usual things required are:
This suggests an API like this:
function addElement($element, $type, $content_model, $attributes = array());
Each parameter explained in depth:
$element
$type
$content_model
HTMLPurifier_ElementDef
's member variables
$content_model
and $content_model_type
,
where the form is Type: Model, ex. 'Optional: Inline'.
$attributes
A possible usage:
$def->addElement('font', 'Inline', 'Optional: Inline', array(0 => array('Common'), 'color' => 'Color'));
We may want to Common attribute collection inclusion to be added by default.