0
0
mirror of https://github.com/ezyang/htmlpurifier.git synced 2025-01-18 11:41:52 +00:00

[1.7.0] Update Advanced API documentation to reflect new changes.

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1066 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
Edward Z. Yang 2007-05-16 03:35:57 +00:00
parent a5136b65e4
commit a846f4e70b

View File

@ -17,9 +17,12 @@
<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div>
<p>HTML Purifier currently natively supports only a subset of HTML's
allowed elements, attributes, and behavior. This is by design,
but as the user is always right, they'll need some method to overload
these behaviors.</p>
allowed elements, attributes, and behavior; specifically, this subset
is the set of elements that are safe for untrusted users to use.
However, HTML Purifier is often utilized to ensure standards-compliance
from input that is trusted (making it a sort of Tidy substitute),
and often users need to define new elements or attributes. The
advanced API is oriented specifically for these use-cases.</p>
<p>Our goals are to let the user:</p>
@ -27,20 +30,15 @@ these behaviors.</p>
<dt>Select</dt>
<dd><ul>
<li>Doctype</li>
<li>Mode: Lenient / Correctional</li>
<li><em>Filterset</em></li>
<li>Elements / Attributes / Modules</li>
<li>Filterset</li>
<li>Tidy</li>
</ul></dd>
<dt>Customize</dt>
<dd><ul>
<li>Attributes</li>
<li>Elements</li>
</ul></dd>
<dt>Internals</dt>
<dd><ul>
<li>Modules / Elements / Attributes / Attribute Types</li>
<li>Filtersets</li>
<li>Doctype</li>
<li>Doctypes</li>
</ul></dd>
</dl>
@ -57,7 +55,7 @@ is essential for standards-compliant output.</p>
<p class="technical">This identifier is based
on the name the W3C has given to the document type and <em>not</em>
the DTD identifier.</p>
the DTD identifier, although that may be included as an alias.</p>
<p>This parameter is set via the configuration object:</p>
@ -68,86 +66,16 @@ Transitional, however, we really shouldn't be guessing what the user's
doctype is. Fortunantely, people who can't be bothered to set this won't
be bothered when their pages stop validating.</p>
<h3>Selecting Mode</h3>
<p>Within doctypes, there are various <strong>modes</strong> of operation.
These indicate variant behaviors that, while not strictly changing the
allowed set of elements and attributes, definitely affect the output.
Currently, we have two modes, which may be used together:</p>
<dl>
<dt>Lenient</dt>
<dd>
<p>Deprecated elements and attributes will be transformed into
standards-compliant alternatives when explicitly disallowed.</p>
<p>For example, in the XHTML 1.0 Strict doctype, a <code>center</code>
element would be turned into a <code>div</code> with the CSS property
<code>text-align:center;</code>, but in XHTML 1.0 Transitional
the element would be preserved.</p>
<p>This mode is on by default.</p>
</dd>
<dt>Correctional[items to correct]</dt>
<dd>
<p>Deprecated elements and attributes will be transformed into
standards-compliant alternatives whenever possible.
It may have various levels of operation.</p>
<p>Referring back to the previous example, the <code>center</code> element would
be transformed in both cases. However, elements without a
reasonable standards-compliant alternative will be preserved
in their form.</p>
<p>A user may want to correct certain deprecated attributes, but
not others. For example, the <code>bgcolor</code> attribute may be
acceptable, but the <code>center</code> element not; also, possibly,
an HTML Purifier transformation may be buggy, so the user wants
to forgo it. Thus, correctional accepts an array defining which
elements and attributes to cleanup, or no parameter at all, which
means everything gets corrected. This also means that each
correction needs to be given a unique ID that can be referenced
in this manner. (We may also allow globbing, like *.name or a.*
for mass-enabling correction, and subtractive mode, where things
specified stop correction.) This array gets passed into the
constructor of the mode's module.</p>
<p>This mode is on by default.</p>
</dd>
</dl>
<p>A possible call to select modes would be:</p>
<pre>$config->set('HTML', 'Mode', array('correctional', 'lenient'));</pre>
<p>If modes have extra parameters, a hash is necessary:</p>
<pre>$config->set('HTML', 'Mode', array(
'correctional' => 'center,a.name',
'lenient' => true // this one's just boolean
));</pre>
<p>Modes may be specified along with the doctype declaration (we may want
to get a better set of separator characters):</p>
<pre>$config->setDoctype('XHTML Transitional 1.0', '+correctional[center,a.name] -lenient');</pre>
<p>
With regards to the various levels of operation conjectured in the
Correctional mode, this is prompted by the fact that a user may want to
correct certain problems but not others, for example, fix the <code>center</code>
element but not the <code>u</code> element, both of which are deprecated.
Having an integer <q>level</q> will not work very well for such fine
grained tweaking, but an array of specific settings might.</p>
<h3>Selecting Elements / Attributes / Modules</h3>
<p></p>
<p>HTML Purifier will, by default, allow as many elements and attributes
as possible. However, a user may decide to roll their own filterset by
selecting modules, elements and attributes to allow for their own
specific use-case.</p>
<p>If this cookie cutter approach doesn't appeal to a user, they may
decide to roll their own filterset by selecting modules, elements and
attributes to allow.</p>
<p class="technical">This would make use of the same facilities
as a filterset author would use, except that it would go under an
<q>anonymous</q> filterset that would be auto-selected if any of the
relevant module/elements/attribute selection configuration directives were
non-null.</p>
<p class="technical">The currently un-documented Filterset interface
will offer a way of encapsulating the following declarations, so that
a user can pick a recipe of tags that is thought to be commonly used.</p>
<p>In practice, this is the most commonly demanded feature. Most users are
perfectly happy defining a filterset that looks like:</p>
@ -156,7 +84,8 @@ perfectly happy defining a filterset that looks like:</p>
<p class="technical">The directive %HTML.Allowed is a convenience function
that may be fully expressed with the legacy interface, and thus is
given its own setter.</p>
given its own setter, or implemented by intercepting the set() function
call, parsing, and assigning to the finer grained directives accordingly.</p>
<p>We currently support a separated interface, which also must be preserved:</p>
@ -170,23 +99,45 @@ $config->setAllowedHTML('Hypertext,Text,Lists');</pre>
<p>But it is not expected that this feature will be widely used.</p>
<p class="fixme">The granularity of these modules is too coarse for
the average user (for example, the core module loads everything from
the essential <code>p</code> element to the not-so-safe <code>h1</code>
element). How do we make this still a viable solution? Possible answers
may be sub-modules or module parameters. This may not even be a problem,
considering that most people won't be selecting modules.</p>
<p class="technical">Module selection will work slightly differently
from the other AllowedElements and AllowedAttributes directives by
directly modifying the doctype you are operating in. You cannot,
however, add modules: there is a separate interface for that.</p>
<p class="technical">Modules are distinguished from regular elements by the
case of their first letter. While XML distinguishes between and allows
lower and uppercase letters in element names, most well-known XML
languages use only lower-case
lower and uppercase letters in element names, XHTML uses only lower-case
element names for sake of consistency.</p>
<p class="technical">Considering that, internally speaking, as mandated by
the XHTML 1.1 Modularization specification, we have organized our
elements around modules, considerable gymnastics will be needed to
get this sort of functionality working.</p>
<h3>Selecting Tidy</h3>
<p>The name of this segment of functionality is inspired off of Dave
Ragget's program HTML Tidy, which purported to help clean up HTML. In
HTML Purifier, Tidy functionality involves turning unsupported and
deprecated elements into standards-compliant ones, maintaining
backwards compatibility, and enforcing best practices.</p>
<p>Tidy is optional, when on, it has several coarse
levels of operations, as well as directives that can be used to fine-tune
the output. The coarse levels, set at %HTML.TidyLevel, are:</p>
<dl>
<dt>Lenient</dt>
<dd>Preserve any non standards-compliant aspects by transforming
them into standards-compliant equivalents.</dd>
<dt>Correctional</dt>
<dd>Default: Be lenient and enforce good practices.</dd>
<dt>Aggressive</dt>
<dd>Be correctional and transform as many deprecated elements as
possible to CSS forms</dd>
</dl>
<p>The distinction between correctional and aggressive is fuzzy,
so the user will also have %HTML.TidyAdd and %HTML.TidyRemove, in
which they may list the names of transforms they want and don't want,
using the broad level as a starting point. The naming convention
has not been established yet, but it will be something along the lines
of 'element.attribute', with globs and special cases supported.</p>
<h3>Unified selector</h3>
@ -194,7 +145,7 @@ get this sort of functionality working.</p>
is a chore, we may wish to offer a specialized configuration method
for selecting a filterset. Possibility:</p>
<pre>function selectFilter($doctype, $filterset, $mode)</pre>
<pre>function selectFilter($doctype, $filterset, $tidy)</pre>
<p>...which is simply a light wrapper over the individual configuration
calls. A custom config file format or text format could also be adopted.</p>
@ -255,7 +206,8 @@ the usual things required are:</p>
<p>This suggests an API like this:</p>
<pre>function addElement($element, $type, $content_model, $attributes = array());</pre>
<pre>function addElement($element, $type, $contents,
$attr_collections = array(); $attributes = array());</pre>
<p>Each parameter explained in depth:</p>
@ -264,11 +216,15 @@ the usual things required are:</p>
<dd>Element name, ex. 'label'</dd>
<dt><code>$type</code></dt>
<dd>Content set to register in, ex. 'Inline' or 'Flow'</dd>
<dt><code>$content_model</code></dt>
<dt><code>$contents</code></dt>
<dd>Description of allowed children. This is a merged form of
<code>HTMLPurifier_ElementDef</code>'s member variables
<code>$content_model</code> and <code>$content_model_type</code>,
where the form is <q>Type: Model</q>, ex. 'Optional: Inline'.</dd>
where the form is <q>Type: Model</q>, ex. 'Optional: Inline'.
There are also a number of predefined templates one may use.</dd>
<dt><code>$attr_collections</code></dt>
<dd>Array (or string if only one) of attribute collection(s) to
merge into the attributes array.</dd>
<dt><code>$attributes</code></dt>
<dd>Array of attribute names to attribute definitions, much like
the above-described attribute customization.</dd>
@ -276,11 +232,10 @@ the usual things required are:</p>
<p>A possible usage:</p>
<pre>$def->addElement('font', 'Inline', 'Optional: Inline',
array(0 => array('Common'), 'color' => 'Color'));</pre>
<pre>$def->addElement('font', 'Inline', 'Optional: Inline', 'Common',
array('color' => 'Color'));</pre>
<p>We may want to Common attribute collection inclusion to be added
by default.</p>
<p>See <code>HTMLPurifier/HTMLModule.php</code> for details.</p>
<div id="version">$Id$</div>