Advanced API

Filed under Development
Return to the index.
HTML Purifier End-User Documentation

HTML Purifier currently natively supports only a subset of HTML's allowed elements, attributes, and behavior; specifically, this subset is the set of elements that are safe for untrusted users to use. However, HTML Purifier is often utilized to ensure standards-compliance from input that is trusted (making it a sort of Tidy substitute), and often users need to define new elements or attributes. The advanced API is oriented specifically for these use-cases.

Our goals are to let the user:

Select
Customize

Select

For basic use, the user will have to specify some basic parameters. This is not strictly necessary, as HTML Purifier's default setting will always output safe code, but is required for standards-compliant output.

Selecting a Doctype

The first thing to select is the doctype. This is essential for standards-compliant output.

This identifier is based on the name the W3C has given to the document type and not the DTD identifier, although that may be included as an alias.

This parameter is set via the configuration object:

$config->set('HTML', 'Doctype', 'XHTML 1.0 Transitional');

Due to historical reasons, the default doctype is XHTML 1.0 Transitional, however, we really shouldn't be guessing what the user's doctype is. Fortunantely, people who can't be bothered to set this won't be bothered when their pages stop validating.

Selecting Elements / Attributes / Modules

HTML Purifier will, by default, allow as many elements and attributes as possible. However, a user may decide to roll their own filterset by selecting modules, elements and attributes to allow for their own specific use-case.

The currently un-documented Filterset interface will offer a way of encapsulating the following declarations, so that a user can pick a recipe of tags that is thought to be commonly used.

In practice, this is the most commonly demanded feature. Most users are perfectly happy defining a filterset that looks like:

$config->setAllowedHTML('a[href,title];em;p;blockquote');

The directive %HTML.Allowed is a convenience function that may be fully expressed with the legacy interface, and thus is given its own setter, or implemented by intercepting the set() function call, parsing, and assigning to the finer grained directives accordingly.

We currently support a separated interface, which also must be preserved:

$config->set('HTML', 'AllowedElements', 'a,em,p,blockquote');
$config->set('HTML', 'AllowedAttributes', 'a.href,a.title');

A user may also choose to allow modules:

$config->set('HTML', 'AllowedModules', 'Hypertext,Text,Lists'); // or
$config->setAllowedHTML('Hypertext,Text,Lists');

But it is not expected that this feature will be widely used.

Module selection will work slightly differently from the other AllowedElements and AllowedAttributes directives by directly modifying the doctype you are operating in. You cannot, however, add modules: there is a separate interface for that.

Modules are distinguished from regular elements by the case of their first letter. While XML distinguishes between and allows lower and uppercase letters in element names, XHTML uses only lower-case element names for sake of consistency.

Selecting Tidy

The name of this segment of functionality is inspired off of Dave Ragget's program HTML Tidy, which purported to help clean up HTML. In HTML Purifier, Tidy functionality involves turning unsupported and deprecated elements into standards-compliant ones, maintaining backwards compatibility, and enforcing best practices.

Tidy is optional, when on, it has several coarse levels of operations, as well as directives that can be used to fine-tune the output. The coarse levels, set at %HTML.TidyLevel, are:

Lenient
Preserve any non standards-compliant aspects by transforming them into standards-compliant equivalents.
Correctional
Default: Be lenient and enforce good practices.
Aggressive
Be correctional and transform as many deprecated elements as possible to CSS forms

The distinction between correctional and aggressive is fuzzy, so the user will also have %HTML.TidyAdd and %HTML.TidyRemove, in which they may list the names of transforms they want and don't want, using the broad level as a starting point. The naming convention has not been established yet, but it will be something along the lines of 'element.attribute', with globs and special cases supported.

Unified selector

Because selecting each and every one of these configuration options is a chore, we may wish to offer a specialized configuration method for selecting a filterset. Possibility:

function selectFilter($doctype, $filterset, $tidy)

...which is simply a light wrapper over the individual configuration calls. A custom config file format or text format could also be adopted.

Customize

By reviewing topic posts in the support forum, we determined that there were two primarily demanded customization features people wanted: to add an attribute to an existing element, and to add an element. Thus, we'll want to create convenience functions for these common use-cases.

Note that the functions described here are only available if a raw copy of HTMLPurifier_HTMLDefinition was retrieved. addAttribute may work on a processed copy, but for consistency's sake we will mandate this for everything.

Attributes

An attribute is bound to an element by a name and has a specific AttrDef that validates it. Thus, the interface should be:

function addAttribute($element, $attribute, $attribute_def);

With a use-case that looks like:

$def->addAttribute('a', 'rel', new HTMLPurifier_AttrDef_Enum(array('nofollow')));

The $attribute_def value can be a little flexible, to make things simpler. We'll let it also be:

Making the previous example written as:

$def->addAttribute('a', 'rel', 'enum(nofollow)');

Elements

An element requires certain information as specified by HTMLPurifier_ElementDef. However, not all of it is necessary, the usual things required are:

This suggests an API like this:

function addElement($element, $type, $contents,
    $attr_collections = array(); $attributes = array());

Each parameter explained in depth:

$element
Element name, ex. 'label'
$type
Content set to register in, ex. 'Inline' or 'Flow'
$contents
Description of allowed children. This is a merged form of HTMLPurifier_ElementDef's member variables $content_model and $content_model_type, where the form is Type: Model, ex. 'Optional: Inline'. There are also a number of predefined templates one may use.
$attr_collections
Array (or string if only one) of attribute collection(s) to merge into the attributes array.
$attributes
Array of attribute names to attribute definitions, much like the above-described attribute customization.

A possible usage:

$def->addElement('font', 'Inline', 'Optional: Inline', 'Common',
    array('color' => 'Color'));

See HTMLPurifier/HTMLModule.php for details.

$Id$