From b829e76bbf65edb8220bb03031ada1b2e6206c9b Mon Sep 17 00:00:00 2001
From: "Edward Z. Yang" It makes no sense to adopt a HTML Purifier currently natively supports only a subset of HTML's
+allowed elements, attributes, and behavior. This is by design,
+but as the user is always right, they'll need some method to overload
+these behaviors. Our goals are to let the user: For basic use, the user will have to specify some basic parameters. This
+is not strictly necessary, as HTML Purifier's default setting will always
+output safe code, but is required for standards-compliant output. By default, users will use a doctype-based, permissive but secure
-whitelist. They must define a doctype, and this serves
-as the first method of determining a filterset. The first thing to select is the doctype. This
+is essential for standards-compliant output. This identifier is based
on the name the W3C has given to the document type and not
@@ -61,117 +63,131 @@ the DTD identifier. However, selecting this doctype doesn't mean much, because if we
-adhered exactly to the definition we would be letting XSS and other
-nasties through. HTML Purifier must, in its filterset, allow a subset
-of the doctype, which we shall call a filterset. By default, HTML Purifier will use the Rich
-filterset, which allows as many elements as possible with untrusted
-sources. Other possible filtersets could be: Extension-authors would be able to define custom filtersets for
-other users to use. A possible call to select a filterset would be: Due to historical reasons, the default doctype is XHTML 1.0
+Transitional, however, we really shouldn't be guessing what the user's
+doctype is. Fortunantely, people who can't be bothered to set this won't
+be bothered when their pages stop validating. Within filtersets, there are various modes of operation.
+ Within doctypes, there are various modes of operation.
These indicate variant behaviors that, while not strictly changing the
-allowed set of elements and attributes, will definitely affect the output.
+allowed set of elements and attributes, definitely affect the output.
Currently, we have two modes, which may be used together: Deprecated elements and attributes will be transformed into
+ standards-compliant alternatives when explicitly disallowed. For example, in the XHTML 1.0 Strict doctype, a
)
@@ -45,7 +47,7 @@ TODO List
- Append something to duplicate IDs so they're still usable (impl. note: the
dupe detector would also need to detect the suffix as well)
-2.0 release
+2.0 release [Beyond HTML]
# Legit token based CSS parsing (will require revamping almost every
AttrDef class)
# Formatters for plaintext (COMPLEX)
@@ -54,31 +56,31 @@ TODO List
- Linkify URLs
- Smileys
- Linkification for HTML Purifier docs: notably configuration and classes
-
-3.0 release
- - Extended HTML capabilities based on namespacing and tag transforms (COMPLEX)
- - Hooks for adding custom processors to custom namespaced tags and
- attributes, offer default implementation
- - Lots of documentation and samples
- Allow tags to be "armored", an internal flag that protects them
from validation and passes them out unharmed
- - XHTML 1.1 support
- Fixes for Firefox's inability to handle COL alignment props (Bug 915)
- Automatically add non-breaking spaces to empty table cells when
empty-cells:show is applied to have compatibility with Internet Explorer
- Convert RTL/LTR override characters to tags, or vice versa on demand.
Also, enable disabling of directionality
+3.0 release [To XML and Beyond]
+ - Extended HTML capabilities based on namespacing and tag transforms (COMPLEX)
+ - Hooks for adding custom processors to custom namespaced tags and
+ attributes, offer default implementation
+ - Lots of documentation and samples
+ - XHTML 1.1 support
+
Ongoing
- Lots of profiling, make it faster!
- Plugins for major CMSes (COMPLEX)
- - WordPress
+ - WordPress (mostly written, needs beta-testing)
- eFiction
- more! (look for ones that use WYSIWYGs)
Unknown release (on a scratch-an-itch basis)
- - Have 'lang' attribute be checked against official lists
- ? Semi-lossy dumb alternate character encoding transformations, achieved by
+ ? Semi-lossy dumb alternate character encoding transfor
+ ? Have 'lang' attribute be checked against official lists, achieved by
encoding all characters that have string entity equivalents
Requested
diff --git a/docs/dev-advanced-api.html b/docs/dev-advanced-api.html
index 731397f2..abc83025 100644
--- a/docs/dev-advanced-api.html
+++ b/docs/dev-advanced-api.html
@@ -16,9 +16,10 @@
one-size-fits-all
approach to
-filtersets: therefore, users must be able to define their own sets of
-allowed
elements, as well as switch in-between doctypes of HTML.
-
-
Select
+Selecting a Doctype
-$config->set('HTML', 'Doctype', 'XHTML 1.0 Transitional');
-Selecting a Filterset
-
-
-
-
-$config->set('HTML', 'Filterset', 'Rich');
+Selecting Mode
-
center
- tag would be turned into a div
with the CSS property
+ center
+ element would be turned into a div
with the CSS property
text-align:center;
, but in XHTML 1.0 Transitional
- the tag would be preserved. This mode is on by default.center
tag would
- be transformed in both cases. However, tags without a
+ the element would be preserved.
This mode is on by default.
+ +Deprecated elements and attributes will be transformed into + standards-compliant alternatives whenever possible. + It may have various levels of operation.
+Referring back to the previous example, the center
element would
+ be transformed in both cases. However, elements without a
reasonable standards-compliant alternative will be preserved
- in their form. This mode is on by default. It may have
- various levels of operation.
A user may want to correct certain deprecated attributes, but
+ not others. For example, the bgcolor
attribute may be
+ acceptable, but the center
element not; also, possibly,
+ an HTML Purifier transformation may be buggy, so the user wants
+ to forgo it. Thus, correctional accepts an array defining which
+ elements and attributes to cleanup, or no parameter at all, which
+ means everything gets corrected. This also means that each
+ correction needs to be given a unique ID that can be referenced
+ in this manner. (We may also allow globbing, like *.name or a.*
+ for mass-enabling correction, and subtractive mode, where things
+ specified stop correction.) This array gets passed into the
+ constructor of the mode's module.
This mode is on by default.
+A possible call to select modes would be:
$config->set('HTML', 'Mode', array('correctional', 'lenient'));-
If modes have extra parameters, a hash might work well:
+If modes have extra parameters, a hash is necessary:
$config->set('HTML', 'Mode', array( - 'correctional' => 9, // strongest level + 'correctional' => 'center,a.name', 'lenient' => true // this one's just boolean ));-
Modes may possibly be wrapped up with the filterset declaration:
+Modes may be specified along with the doctype declaration (we may want +to get a better set of separator characters):
-$config->set('HTML', 'Filterset', 'Rich: correctional, lenient');+
$config->setDoctype('XHTML Transitional 1.0', '+correctional[center,a.name] -lenient');-
Further investigation in this field is necessary.
+
+With regards to the various levels of operation conjectured in the
+Correctional mode, this is prompted by the fact that a user may want to
+correct certain problems but not others, for example, fix the center
+element but not the u
element, both of which are deprecated.
+Having an integer level
will not work very well for such fine
+grained tweaking, but an array of specific settings might.
If this cookie cutter approach doesn't appeal to a user, they may -decide to roll their own filterset by selecting modules, tags and +decide to roll their own filterset by selecting modules, elements and attributes to allow.
This would make use of the same facilities
as a filterset author would use, except that it would go under an
anonymous
filterset that would be auto-selected if any of the
-relevant module/tag/attribute selection configuration directives were
+relevant module/elements/attribute selection configuration directives were
non-null.
On the highest level, a user will usually be most interested in -directly specifying which elements and attributes are desired. For -example:
+In practice, this is the most commonly demanded feature. Most users are +perfectly happy defining a filterset that looks like:
-$config->set('HTML', 'AllowedElements', 'a,b,em,p,blockquote,code,i');+
$config->setAllowedHTML('a[href,title];em;p;blockquote');-
Attribute declarations could be merged into this declaration as such:
+The directive %HTML.Allowed is a convenience function +that may be fully expressed with the legacy interface, and thus is +given its own setter.
-$config->set('HTML', 'Allowed', 'a[href,title],b,em,p[class],blockquote[cite],code,i');+
We currently support a separated interface, which also must be preserved:
-...or be kept separate:
+$config->set('HTML', 'AllowedElements', 'a,em,p,blockquote'); +$config->set('HTML', 'AllowedAttributes', 'a.href,a.title');-
$config->set('HTML', 'AllowedAttributes', 'a.href,a.title,p.class,blockquote.cite');+
A user may also choose to allow modules:
+ +$config->set('HTML', 'AllowedModules', 'Hypertext,Text,Lists'); // or +$config->setAllowedHTML('Hypertext,Text,Lists');+ +
But it is not expected that this feature will be widely used.
+ +The granularity of these modules is too coarse for
+the average user (for example, the core module loads everything from
+the essential p
element to the not-so-safe h1
+element). How do we make this still a viable solution? Possible answers
+may be sub-modules or module parameters. This may not even be a problem,
+considering that most people won't be selecting modules.
Modules are distinguished from regular elements by the +case of their first letter. While XML distinguishes between and allows +lower and uppercase letters in element names, most well-known XML +languages use only lower-case +element names for sake of consistency.
Considering that, internally speaking, as mandated by the XHTML 1.1 Modularization specification, we have organized our elements around modules, considerable gymnastics will be needed to get this sort of functionality working.
-A user may also specify a module to load a class of elements and attributes -into their filterest:
- -$config->set('HTML', 'Allowed', 'Hypertext,Core');- -
The granularity of these modules is too coarse for
-the average user (for example, the core module loads everything from
-the essential p
tag to the not-so-safe h1
-tag). How do we make this still a viable solution?
Because selecting each and every one of these configuration options @@ -183,6 +199,89 @@ for selecting a filterset. Possibility:
...which is simply a light wrapper over the individual configuration calls. A custom config file format or text format could also be adopted.
+By reviewing topic posts in the support forum, we determined that +there were two primarily demanded customization features people wanted: +to add an attribute to an existing element, and to add an element. +Thus, we'll want to create convenience functions for these common +use-cases.
+ +Note that the functions described here are only available if
+a raw copy of HTMLPurifier_HTMLDefinition
was retrieved.
+addAttribute
may work on a processed copy, but for
+consistency's sake we will mandate this for everything.
An attribute is bound to an element by a name and has a specific
+AttrDef
that validates it. Thus, the interface should
+be:
function addAttribute($element, $attribute, $attribute_def);+ +
With a use-case that looks like:
+ +$def->addAttribute('a', 'rel', new HTMLPurifier_AttrDef_Enum(array('nofollow')));+ +
The $attribute_def
value can be a little flexible,
+to make things simpler. We'll let it also be:
HTMLPurifier_AttrDef_Anonymous
+ class with that function registered as a callback.HTMLPurifier_AttrTypes
+ enum(
: We'll explode it and stuff it in an
+ HTMLPurifier_AttrDef_Enum
for you.Making the previous example written as:
+ +$def->addAttribute('a', 'rel', 'enum(nofollow)');+ +
An element requires certain information as specified by
+HTMLPurifier_ElementDef
. However, not all of it is necessary,
+the usual things required are:
This suggests an API like this:
+ +function addElement($element, $type, $content_model, $attributes = array());+ +
Each parameter explained in depth:
+ +$element
$type
$content_model
HTMLPurifier_ElementDef
's member variables
+ $content_model
and $content_model_type
,
+ where the form is Type: Model, ex. 'Optional: Inline'.
$attributes
A possible usage:
+ +$def->addElement('font', 'Inline', 'Optional: Inline', + array(0 => array('Common'), 'color' => 'Color'));+ +
We may want to Common attribute collection inclusion to be added +by default.
+