Advanced API

Filed under Development
Return to the index.
HTML Purifier End-User Documentation

HTML Purifier currently natively supports only a subset of HTML's allowed elements, attributes, and behavior. This is by design, but as the user is always right, they'll need some method to overload these behaviors.

Our goals are to let the user:

Select
Customize
Internals

Select

For basic use, the user will have to specify some basic parameters. This is not strictly necessary, as HTML Purifier's default setting will always output safe code, but is required for standards-compliant output.

Selecting a Doctype

The first thing to select is the doctype. This is essential for standards-compliant output.

This identifier is based on the name the W3C has given to the document type and not the DTD identifier.

This parameter is set via the configuration object:

$config->set('HTML', 'Doctype', 'XHTML 1.0 Transitional');

Due to historical reasons, the default doctype is XHTML 1.0 Transitional, however, we really shouldn't be guessing what the user's doctype is. Fortunantely, people who can't be bothered to set this won't be bothered when their pages stop validating.

Selecting Mode

Within doctypes, there are various modes of operation. These indicate variant behaviors that, while not strictly changing the allowed set of elements and attributes, definitely affect the output. Currently, we have two modes, which may be used together:

Lenient

Deprecated elements and attributes will be transformed into standards-compliant alternatives when explicitly disallowed.

For example, in the XHTML 1.0 Strict doctype, a center element would be turned into a div with the CSS property text-align:center;, but in XHTML 1.0 Transitional the element would be preserved.

This mode is on by default.

Correctional[items to correct]

Deprecated elements and attributes will be transformed into standards-compliant alternatives whenever possible. It may have various levels of operation.

Referring back to the previous example, the center element would be transformed in both cases. However, elements without a reasonable standards-compliant alternative will be preserved in their form.

A user may want to correct certain deprecated attributes, but not others. For example, the bgcolor attribute may be acceptable, but the center element not; also, possibly, an HTML Purifier transformation may be buggy, so the user wants to forgo it. Thus, correctional accepts an array defining which elements and attributes to cleanup, or no parameter at all, which means everything gets corrected. This also means that each correction needs to be given a unique ID that can be referenced in this manner. (We may also allow globbing, like *.name or a.* for mass-enabling correction, and subtractive mode, where things specified stop correction.) This array gets passed into the constructor of the mode's module.

This mode is on by default.

A possible call to select modes would be:

$config->set('HTML', 'Mode', array('correctional', 'lenient'));

If modes have extra parameters, a hash is necessary:

$config->set('HTML', 'Mode', array(
    'correctional' => 'center,a.name',
    'lenient' => true // this one's just boolean
));

Modes may be specified along with the doctype declaration (we may want to get a better set of separator characters):

$config->setDoctype('XHTML Transitional 1.0', '+correctional[center,a.name] -lenient');

With regards to the various levels of operation conjectured in the Correctional mode, this is prompted by the fact that a user may want to correct certain problems but not others, for example, fix the center element but not the u element, both of which are deprecated. Having an integer level will not work very well for such fine grained tweaking, but an array of specific settings might.

Selecting Elements / Attributes / Modules

If this cookie cutter approach doesn't appeal to a user, they may decide to roll their own filterset by selecting modules, elements and attributes to allow.

This would make use of the same facilities as a filterset author would use, except that it would go under an anonymous filterset that would be auto-selected if any of the relevant module/elements/attribute selection configuration directives were non-null.

In practice, this is the most commonly demanded feature. Most users are perfectly happy defining a filterset that looks like:

$config->setAllowedHTML('a[href,title];em;p;blockquote');

The directive %HTML.Allowed is a convenience function that may be fully expressed with the legacy interface, and thus is given its own setter.

We currently support a separated interface, which also must be preserved:

$config->set('HTML', 'AllowedElements', 'a,em,p,blockquote');
$config->set('HTML', 'AllowedAttributes', 'a.href,a.title');

A user may also choose to allow modules:

$config->set('HTML', 'AllowedModules', 'Hypertext,Text,Lists'); // or
$config->setAllowedHTML('Hypertext,Text,Lists');

But it is not expected that this feature will be widely used.

The granularity of these modules is too coarse for the average user (for example, the core module loads everything from the essential p element to the not-so-safe h1 element). How do we make this still a viable solution? Possible answers may be sub-modules or module parameters. This may not even be a problem, considering that most people won't be selecting modules.

Modules are distinguished from regular elements by the case of their first letter. While XML distinguishes between and allows lower and uppercase letters in element names, most well-known XML languages use only lower-case element names for sake of consistency.

Considering that, internally speaking, as mandated by the XHTML 1.1 Modularization specification, we have organized our elements around modules, considerable gymnastics will be needed to get this sort of functionality working.

Unified selector

Because selecting each and every one of these configuration options is a chore, we may wish to offer a specialized configuration method for selecting a filterset. Possibility:

function selectFilter($doctype, $filterset, $mode)

...which is simply a light wrapper over the individual configuration calls. A custom config file format or text format could also be adopted.

Customize

By reviewing topic posts in the support forum, we determined that there were two primarily demanded customization features people wanted: to add an attribute to an existing element, and to add an element. Thus, we'll want to create convenience functions for these common use-cases.

Note that the functions described here are only available if a raw copy of HTMLPurifier_HTMLDefinition was retrieved. addAttribute may work on a processed copy, but for consistency's sake we will mandate this for everything.

Attributes

An attribute is bound to an element by a name and has a specific AttrDef that validates it. Thus, the interface should be:

function addAttribute($element, $attribute, $attribute_def);

With a use-case that looks like:

$def->addAttribute('a', 'rel', new HTMLPurifier_AttrDef_Enum(array('nofollow')));

The $attribute_def value can be a little flexible, to make things simpler. We'll let it also be:

Making the previous example written as:

$def->addAttribute('a', 'rel', 'enum(nofollow)');

Elements

An element requires certain information as specified by HTMLPurifier_ElementDef. However, not all of it is necessary, the usual things required are:

This suggests an API like this:

function addElement($element, $type, $content_model, $attributes = array());

Each parameter explained in depth:

$element
Element name, ex. 'label'
$type
Content set to register in, ex. 'Inline' or 'Flow'
$content_model
Description of allowed children. This is a merged form of HTMLPurifier_ElementDef's member variables $content_model and $content_model_type, where the form is Type: Model, ex. 'Optional: Inline'.
$attributes
Array of attribute names to attribute definitions, much like the above-described attribute customization.

A possible usage:

$def->addElement('font', 'Inline', 'Optional: Inline',
    array(0 => array('Common'), 'color' => 'Color'));

We may want to Common attribute collection inclusion to be added by default.

$Id$