From 826a57a04ad785df07994a40dcf2ec739a903037 Mon Sep 17 00:00:00 2001 From: "Edward Z. Yang" Date: Sun, 1 Apr 2007 18:21:43 +0000 Subject: [PATCH] Update Advanced API with various edits and Customization section. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@928 48356398-32a2-884e-a903-53898d9a118a --- docs/dev-advanced-api.html | 246 ++++++++++++++++++++++++------------- 1 file changed, 163 insertions(+), 83 deletions(-) diff --git a/docs/dev-advanced-api.html b/docs/dev-advanced-api.html index c19a84cc..abc83025 100644 --- a/docs/dev-advanced-api.html +++ b/docs/dev-advanced-api.html @@ -16,9 +16,10 @@
Return to the index.
HTML Purifier End-User Documentation
-

It makes no sense to adopt a one-size-fits-all approach to -filtersets: therefore, users must be able to define their own sets of -allowed elements, as well as switch in-between doctypes of HTML.

+

HTML Purifier currently natively supports only a subset of HTML's +allowed elements, attributes, and behavior. This is by design, +but as the user is always right, they'll need some method to overload +these behaviors.

Our goals are to let the user:

@@ -26,20 +27,18 @@ filtersets: therefore, users must be able to define their own sets of
Select
Customize
-
Create
+
Internals
@@ -47,11 +46,14 @@ filtersets: therefore, users must be able to define their own sets of

Select

+

For basic use, the user will have to specify some basic parameters. This +is not strictly necessary, as HTML Purifier's default setting will always +output safe code, but is required for standards-compliant output.

+

Selecting a Doctype

-

By default, users will use a doctype-based, permissive but secure -whitelist. They must define a doctype, and this serves -as the first method of determining a filterset.

+

The first thing to select is the doctype. This +is essential for standards-compliant output.

This identifier is based on the name the W3C has given to the document type and not @@ -61,114 +63,106 @@ the DTD identifier.

$config->set('HTML', 'Doctype', 'XHTML 1.0 Transitional');
-

Due to legacy, the default option is XHTML 1.0 Transitional, however, we -really shouldn't be guessing what the user's doctype is. Fortunantely, -people who can't be bothered to set this won't be bothered when their -pages stop validating.

- -

Selecting a Filterset

- -

However, selecting this doctype doesn't mean much, because if we -adhered exactly to the definition we would be letting XSS and other -nasties through. HTML Purifier must, in its filterset, allow a subset -of the doctype, which we shall call a filterset.

- -

By default, HTML Purifier will use the Rich -filterset, which allows as many elements as possible with untrusted -sources. Other possible filtersets could be:

- -
-
Full
-
Allows the full span of elements in the doctype, good if you want - HTML Purifier to work as a Tidy substitute but not to strip - anything out.
-
Plain
-
Provides a minimum set of tags for semantic markup of things - like blog comments.
-
- -

Extension-authors would be able to define custom filtersets for -other users to use.

- -

A possible call to select a filterset would be:

- -
$config->set('HTML', 'Filterset', 'Rich');
+

Due to historical reasons, the default doctype is XHTML 1.0 +Transitional, however, we really shouldn't be guessing what the user's +doctype is. Fortunantely, people who can't be bothered to set this won't +be bothered when their pages stop validating.

Selecting Mode

-

Within filtersets, there are various modes of operation. +

Within doctypes, there are various modes of operation. These indicate variant behaviors that, while not strictly changing the -allowed set of elements and attributes, will definitely affect the output. +allowed set of elements and attributes, definitely affect the output. Currently, we have two modes, which may be used together:

Lenient
-
Deprecated elements and attributes will be transformed into - standards-compliant alternatives when explicitly disallowed. For - example, in the XHTML 1.0 Strict doctype, a center - tag would be turned into a div with the CSS property +
+

Deprecated elements and attributes will be transformed into + standards-compliant alternatives when explicitly disallowed.

+

For example, in the XHTML 1.0 Strict doctype, a center + element would be turned into a div with the CSS property text-align:center;, but in XHTML 1.0 Transitional - the tag would be preserved. This mode is on by default.

-
Correctional
-
Deprecated elements and attributes will be transformed into - standards-compliant alternatives whenever possible. Referring - back to the previous example, the center tag would - be transformed in both cases. However, tags without a + the element would be preserved.

+

This mode is on by default.

+
+
Correctional[items to correct]
+
+

Deprecated elements and attributes will be transformed into + standards-compliant alternatives whenever possible. + It may have various levels of operation.

+

Referring back to the previous example, the center element would + be transformed in both cases. However, elements without a reasonable standards-compliant alternative will be preserved - in their form. This mode is on by default. It may have - various levels of operation.

+ in their form.

+

A user may want to correct certain deprecated attributes, but + not others. For example, the bgcolor attribute may be + acceptable, but the center element not; also, possibly, + an HTML Purifier transformation may be buggy, so the user wants + to forgo it. Thus, correctional accepts an array defining which + elements and attributes to cleanup, or no parameter at all, which + means everything gets corrected. This also means that each + correction needs to be given a unique ID that can be referenced + in this manner. (We may also allow globbing, like *.name or a.* + for mass-enabling correction, and subtractive mode, where things + specified stop correction.) This array gets passed into the + constructor of the mode's module.

+

This mode is on by default.

+

A possible call to select modes would be:

$config->set('HTML', 'Mode', array('correctional', 'lenient'));
-

If modes have extra parameters, a hash might work well:

+

If modes have extra parameters, a hash is necessary:

$config->set('HTML', 'Mode', array(
-    'correctional' => 9, // strongest level
+    'correctional' => 'center,a.name',
     'lenient' => true // this one's just boolean
 ));
-

Modes may possibly be wrapped up with the filterset declaration:

+

Modes may be specified along with the doctype declaration (we may want +to get a better set of separator characters):

-
$config->set('HTML', 'Filterset', 'Rich: correctional, lenient');
+
$config->setDoctype('XHTML Transitional 1.0', '+correctional[center,a.name] -lenient');
-

Further investigation in this field is necessary.

- -

With regards to the various levels of operation conjectured in the +

+With regards to the various levels of operation conjectured in the Correctional mode, this is prompted by the fact that a user may want to correct certain problems but not others, for example, fix the center -tag but not the u tag, both of which are deprecated. +element but not the u element, both of which are deprecated. Having an integer level will not work very well for such fine grained tweaking, but an array of specific settings might.

-

Selecting Tags / Attributes / Modules

+

Selecting Elements / Attributes / Modules

+ +

If this cookie cutter approach doesn't appeal to a user, they may -decide to roll their own filterset by selecting modules, tags and +decide to roll their own filterset by selecting modules, elements and attributes to allow.

This would make use of the same facilities as a filterset author would use, except that it would go under an anonymous filterset that would be auto-selected if any of the -relevant module/tag/attribute selection configuration directives were +relevant module/elements/attribute selection configuration directives were non-null.

In practice, this is the most commonly demanded feature. Most users are perfectly happy defining a filterset that looks like:

-
$config->setAllowedHTML('a[href,title],em,p,blockquote');
- -

We currently support a separated interface, which also must be preserved:

- -
$config->set('HTML', 'AllowedTags', 'a,em,p,blockquote');
-$config->set('HTML', 'AllowedAttributes', 'a.href,a.title');
+
$config->setAllowedHTML('a[href,title];em;p;blockquote');

The directive %HTML.Allowed is a convenience function that may be fully expressed with the legacy interface, and thus is given its own setter.

+

We currently support a separated interface, which also must be preserved:

+ +
$config->set('HTML', 'AllowedElements', 'a,em,p,blockquote');
+$config->set('HTML', 'AllowedAttributes', 'a.href,a.title');
+

A user may also choose to allow modules:

$config->set('HTML', 'AllowedModules', 'Hypertext,Text,Lists'); // or
@@ -178,13 +172,16 @@ $config->setAllowedHTML('Hypertext,Text,Lists');

The granularity of these modules is too coarse for the average user (for example, the core module loads everything from -the essential p tag to the not-so-safe h1 -tag). How do we make this still a viable solution?

+the essential p element to the not-so-safe h1 +element). How do we make this still a viable solution? Possible answers +may be sub-modules or module parameters. This may not even be a problem, +considering that most people won't be selecting modules.

-

Modules are distinguished from regular tags by the -case of their first letter. While XML distinguishes between lower and uppercase -letters, in practice, most well-known XML languages use only lower-case -tag names for sake of consistency.

+

Modules are distinguished from regular elements by the +case of their first letter. While XML distinguishes between and allows +lower and uppercase letters in element names, most well-known XML +languages use only lower-case +element names for sake of consistency.

Considering that, internally speaking, as mandated by the XHTML 1.1 Modularization specification, we have organized our @@ -202,6 +199,89 @@ for selecting a filterset. Possibility:

...which is simply a light wrapper over the individual configuration calls. A custom config file format or text format could also be adopted.

+

Customize

+ +

By reviewing topic posts in the support forum, we determined that +there were two primarily demanded customization features people wanted: +to add an attribute to an existing element, and to add an element. +Thus, we'll want to create convenience functions for these common +use-cases.

+ +

Note that the functions described here are only available if +a raw copy of HTMLPurifier_HTMLDefinition was retrieved. +addAttribute may work on a processed copy, but for +consistency's sake we will mandate this for everything.

+ +

Attributes

+ +

An attribute is bound to an element by a name and has a specific +AttrDef that validates it. Thus, the interface should +be:

+ +
function addAttribute($element, $attribute, $attribute_def);
+ +

With a use-case that looks like:

+ +
$def->addAttribute('a', 'rel', new HTMLPurifier_AttrDef_Enum(array('nofollow')));
+ +

The $attribute_def value can be a little flexible, +to make things simpler. We'll let it also be:

+ + + +

Making the previous example written as:

+ +
$def->addAttribute('a', 'rel', 'enum(nofollow)');
+ +

Elements

+ +

An element requires certain information as specified by +HTMLPurifier_ElementDef. However, not all of it is necessary, +the usual things required are:

+ + + +

This suggests an API like this:

+ +
function addElement($element, $type, $content_model, $attributes = array());
+ +

Each parameter explained in depth:

+ +
+
$element
+
Element name, ex. 'label'
+
$type
+
Content set to register in, ex. 'Inline' or 'Flow'
+
$content_model
+
Description of allowed children. This is a merged form of + HTMLPurifier_ElementDef's member variables + $content_model and $content_model_type, + where the form is Type: Model, ex. 'Optional: Inline'.
+
$attributes
+
Array of attribute names to attribute definitions, much like + the above-described attribute customization.
+
+ +

A possible usage:

+ +
$def->addElement('font', 'Inline', 'Optional: Inline',
+    array(0 => array('Common'), 'color' => 'Color'));
+ +

We may want to Common attribute collection inclusion to be added +by default.

+
$Id$
\ No newline at end of file