It makes no sense to adopt a one-size-fits-all
approach to
filtersets: therefore, users must be able to define their own sets of
allowed
elements, as well as switch in-between doctypes of HTML.
Our goals are to let the user:
By default, users will use a doctype-based, permissive but secure whitelist. They must define a doctype, and this serves as the first method of determining a filterset.
This identifier is based on the name the W3C has given to the document type and not the DTD identifier.
This parameter is set via the configuration object:
$config->set('HTML', 'Doctype', 'XHTML 1.0 Transitional');
Due to legacy, the default option is XHTML 1.0 Transitional, however, we really shouldn't be guessing what the user's doctype is. Fortunantely, people who can't be bothered to set this won't be bothered when their pages stop validating.
However, selecting this doctype doesn't mean much, because if we adhered exactly to the definition we would be letting XSS and other nasties through. HTML Purifier must, in its filterset, allow a subset of the doctype, which we shall call a filterset.
By default, HTML Purifier will use the Rich filterset, which allows as many elements as possible with untrusted sources. Other possible filtersets could be:
Extension-authors would be able to define custom filtersets for other users to use.
A possible call to select a filterset would be:
$config->set('HTML', 'Filterset', 'Rich');
Within filtersets, there are various modes of operation. These indicate variant behaviors that, while not strictly changing the allowed set of elements and attributes, will definitely affect the output. Currently, we have two modes, which may be used together:
center
tag would be turned into a div
with the CSS property
text-align:center;
, but in XHTML 1.0 Transitional
the tag would be preserved. This mode is on by default.center
tag would
be transformed in both cases. However, tags without a
reasonable standards-compliant alternative will be preserved
in their form. This mode is on by default. It may have
various levels of operation.A possible call to select modes would be:
$config->set('HTML', 'Mode', array('correctional', 'lenient'));
If modes have extra parameters, a hash might work well:
$config->set('HTML', 'Mode', array( 'correctional' => 9, // strongest level 'lenient' => true // this one's just boolean ));
Modes may possibly be wrapped up with the filterset declaration:
$config->set('HTML', 'Filterset', 'Rich: correctional, lenient');
Further investigation in this field is necessary.
With regards to the various levels of operation conjectured in the
Correctional mode, this is prompted by the fact that a user may want to
correct certain problems but not others, for example, fix the center
tag but not the u
tag, both of which are deprecated.
Having an integer level
will not work very well for such fine
grained tweaking, but an array of specific settings might.
If this cookie cutter approach doesn't appeal to a user, they may decide to roll their own filterset by selecting modules, tags and attributes to allow.
This would make use of the same facilities
as a filterset author would use, except that it would go under an
anonymous
filterset that would be auto-selected if any of the
relevant module/tag/attribute selection configuration directives were
non-null.
In practice, this is the most commonly demanded feature. Most users are perfectly happy defining a filterset that looks like:
$config->setAllowedHTML('a[href,title],em,p,blockquote');
We currently support a separated interface, which also must be preserved:
$config->set('HTML', 'AllowedTags', 'a,em,p,blockquote'); $config->set('HTML', 'AllowedAttributes', 'a.href,a.title');
The directive %HTML.Allowed is a convenience function that may be fully expressed with the legacy interface, and thus is given its own setter.
A user may also choose to allow modules:
$config->set('HTML', 'AllowedModules', 'Hypertext,Text,Lists'); // or $config->setAllowedHTML('Hypertext,Text,Lists');
But it is not expected that this feature will be widely used.
The granularity of these modules is too coarse for
the average user (for example, the core module loads everything from
the essential p
tag to the not-so-safe h1
tag). How do we make this still a viable solution?
Modules are distinguished from regular tags by the case of their first letter. While XML distinguishes between lower and uppercase letters, in practice, most well-known XML languages use only lower-case tag names for sake of consistency.
Considering that, internally speaking, as mandated by the XHTML 1.1 Modularization specification, we have organized our elements around modules, considerable gymnastics will be needed to get this sort of functionality working.
Because selecting each and every one of these configuration options is a chore, we may wish to offer a specialized configuration method for selecting a filterset. Possibility:
function selectFilter($doctype, $filterset, $mode)
...which is simply a light wrapper over the individual configuration calls. A custom config file format or text format could also be adopted.