2006-07-23 00:11:03 +00:00
|
|
|
|
2006-07-30 19:11:18 +00:00
|
|
|
HTML Purifier
|
2006-07-23 00:11:03 +00:00
|
|
|
by Edward Z. Yang
|
|
|
|
|
|
|
|
There are a number of ad hoc HTML filtering solutions out there on the web
|
|
|
|
(some examples including HTML_Safe, kses and SafeHtmlChecker.class.php) that
|
|
|
|
claim to filter HTML properly, preventing malicious JavaScript and layout
|
|
|
|
breaking HTML from getting through the parser. None of them, however,
|
|
|
|
demonstrates a thorough knowledge of neither the DTD that defines the HTML
|
|
|
|
nor the caveats of HTML that cannot be expressed by a DTD. Configurable
|
|
|
|
filters (such as kses or PHP's built-in striptags() function) have trouble
|
|
|
|
validating the contents of attributes and can be subject to security attacks
|
|
|
|
due to poor configuration. Other filters take the naive approach of
|
|
|
|
blacklisting known threats and tags, failing to account for the introduction
|
|
|
|
of new technologies, new tags, new attributes or quirky browser behavior.
|
|
|
|
|
|
|
|
However, HTML Purifier takes a different approach, one that doesn't use
|
|
|
|
specification-ignorant regexes or narrow blacklists. HTML Purifier will
|
|
|
|
decompose the whole document into tokens, and rigorously process the tokens by:
|
|
|
|
removing non-whitelisted elements, transforming bad practice tags like <font>
|
|
|
|
into <span>, properly checking the nesting of tags and their children and
|
|
|
|
validating all attributes according to their RFCs.
|
|
|
|
|
|
|
|
To my knowledge, there is nothing like this on the web yet. Not even MediaWiki,
|
|
|
|
which allows an amazingly diverse mix of HTML and wikitext in its documents,
|
|
|
|
gets all the nesting quirks right. Existing solutions hope that no JavaScript
|
|
|
|
will slip through, but either do not attempt to ensure that the resulting
|
|
|
|
output is valid XHTML or send the HTML through a draconic XML parser (and yet
|
|
|
|
still get the nesting wrong: SafeHtmlChecker.class.php does not prevent <a>
|
|
|
|
tags from being nested within each other).
|
|
|
|
|
|
|
|
This document seeks to detail the inner workings of HTML Purifier. The first
|
|
|
|
draft was drawn up after two rough code sketches and the implementation of a
|
|
|
|
forgiving lexer. You may also be interested in the unit tests located in the
|
|
|
|
tests/ folder, which provide a living document on how exactly the filter deals
|
|
|
|
with malformed input.
|
|
|
|
|
|
|
|
In summary:
|
|
|
|
|
|
|
|
1. Parse document into an array of tag and text tokens (Lexer)
|
|
|
|
2. Remove all elements not on whitelist and transform certain other elements
|
|
|
|
into acceptable forms (i.e. <font>)
|
|
|
|
3. Make document well formed while helpfully taking into account certain quirks,
|
|
|
|
such as the fact that <p> tags traditionally are closed by other block-level
|
|
|
|
elements.
|
|
|
|
4. Run through all nodes and check children for proper order (especially
|
|
|
|
important for tables).
|
|
|
|
5. Validate attributes according to more restrictive definitions based on the
|
|
|
|
RFCs.
|
|
|
|
6. Translate back into a string. (Generator)
|
|
|
|
|
|
|
|
HTML Purifier is best suited for documents that require a rich array of
|
|
|
|
HTML tags. Things like blog comments are, in all likelihood, most appropriately
|
|
|
|
written in an extremely restrictive set of markup that doesn't require
|
|
|
|
all this functionality (or not written in HTML at all).
|
|
|
|
|
2006-08-02 02:26:01 +00:00
|
|
|
The rest of this document is pending moving into their associated classes.
|
|
|
|
|
|
|
|
|
|
|
|
|
2006-07-23 00:11:03 +00:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
== STAGE 3 - make well formed ==
|
|
|
|
|
|
|
|
Status: A- (not as good as possible)
|
|
|
|
|
|
|
|
Now we step through the whole thing and correct nesting issues. Most of the
|
|
|
|
time, it's making sure the tags match up, but there's some trickery going on
|
|
|
|
for HTML's quirks. They are:
|
|
|
|
|
|
|
|
* Set of tags that close P
|
|
|
|
'address', 'blockquote', 'dd', 'dir', 'div',
|
|
|
|
'dl', 'dt', 'h1', 'h2', 'h3', 'h4',
|
|
|
|
'h5', 'h6', 'hr',
|
|
|
|
'ol', 'p', 'pre',
|
|
|
|
'table', 'ul'
|
|
|
|
* Li closes li
|
|
|
|
* more?
|
|
|
|
|
|
|
|
We also want to do translations, like from FONT to SPAN with STYLE.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
== STAGE 4 - check nesting ==
|
|
|
|
|
|
|
|
Status: B (table custom definition needs to be implemented)
|
|
|
|
|
|
|
|
We know that the document is now well formed. The tokenizer should now take
|
|
|
|
things in nodes: when you hit a start tag, keep on going until you get its
|
|
|
|
ending tag, and then handle everything inside there. Fortunantely, no
|
|
|
|
fancy recursion is necessary as going to the next node is as simple as
|
|
|
|
scrolling to the next start tag.
|
|
|
|
|
|
|
|
Suppose we have a node and encounter a problem with one of its children.
|
|
|
|
Depending on the complexity of the rule, we will either delete the children,
|
|
|
|
or delete the entire node itself.
|
|
|
|
|
|
|
|
The simplest type of rule is zero or more valid elements, denoted like:
|
|
|
|
|
|
|
|
( el1 | el2 | el3 )*
|
|
|
|
|
|
|
|
The next simplest is with one or more valid elements:
|
|
|
|
|
|
|
|
( li )+
|
|
|
|
|
|
|
|
And then you have complex cases:
|
|
|
|
|
|
|
|
table (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))
|
|
|
|
map ((%block; | form | %misc;)+ | area+)
|
|
|
|
html (head, body)
|
|
|
|
head (%head.misc;,
|
|
|
|
((title, %head.misc;, (base, %head.misc;)?) |
|
|
|
|
(base, %head.misc;, (title, %head.misc;))))
|
|
|
|
|
|
|
|
Each of these has to be dealt with. Case 1 is a joy, because you can zap
|
|
|
|
as many as you want, but you'll never actually have to kill the node. Two
|
|
|
|
and three need the entire node to be killed if you have a problem. This
|
|
|
|
can be problematic, as the missing node might cause its parent node to now
|
|
|
|
be incorrect. Granted, it's unlikely, and I'm fairly certain that HTML, let
|
|
|
|
alone the simplified set I'm allowing will have this problem, but it's worth
|
|
|
|
checking for.
|
|
|
|
|
|
|
|
The way, I suppose, one would check for it, is whenever a node is removed,
|
|
|
|
scroll to it's parent start, and re-evaluate it. Make sure you're able to do
|
|
|
|
that with minimal code repetition.
|
|
|
|
|
|
|
|
EDITOR'S NOTE: this behavior is not implemented by default, because the
|
|
|
|
default configuration has a setup that ensures that cascading node removals
|
|
|
|
will never happen. However, there will be warning signs in case someone tries
|
|
|
|
to hack it further.
|
|
|
|
|
|
|
|
The most complex case can probably be done by using some fancy regexp
|
|
|
|
expressions and transformations. However, it doesn't seem right that, say,
|
|
|
|
a stray <b> in a <table> can cause the entire table to be removed. Fixing it,
|
|
|
|
however, may be too difficult (or not, see below).
|
|
|
|
|
|
|
|
This code was excerpted from the PEAR class XML_DTD. It implements regexp
|
|
|
|
checking.
|
|
|
|
|
|
|
|
--
|
|
|
|
|
|
|
|
// # This actually does the validation
|
|
|
|
|
|
|
|
// Validate the order of the children
|
|
|
|
if (!$was_error && count($dtd_children)) {
|
|
|
|
$children_list = implode(',', $children);
|
|
|
|
$regex = $this->dtd->getPcreRegex($name);
|
|
|
|
if (!preg_match('/^'.$regex.'$/', $children_list)) {
|
|
|
|
$dtd_regex = $this->dtd->getDTDRegex($name);
|
|
|
|
$this->_errors("In element <$name> the children list found:\n'$children_list', ".
|
|
|
|
"does not conform the DTD definition: '$dtd_regex'", $lineno);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
--
|
|
|
|
|
|
|
|
// # This figures out the PcreRegex
|
|
|
|
|
|
|
|
//$ch is a string of the allowed childs
|
|
|
|
$children = preg_split('/([^#a-zA-Z0-9_.-]+)/', $ch, -1, PREG_SPLIT_NO_EMPTY);
|
|
|
|
// check for parsed character data special case
|
|
|
|
if (in_array('#PCDATA', $children)) {
|
|
|
|
$content = '#PCDATA';
|
|
|
|
if (count($children) == 1) {
|
|
|
|
$children = array();
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
// $children is not used after this
|
|
|
|
|
|
|
|
$this->dtd['elements'][$elem_name]['child_validation_dtd_regex'] = $ch;
|
|
|
|
// Convert the DTD regex language into PCRE regex format
|
|
|
|
$reg = str_replace(',', ',?', $ch);
|
|
|
|
$reg = preg_replace('/([#a-zA-Z0-9_.-]+)/', '(,?\\0)', $reg);
|
|
|
|
$this->dtd['elements'][$elem_name]['child_validation_pcre_regex'] = $reg;
|
|
|
|
|
|
|
|
--
|
|
|
|
|
|
|
|
We can probably loot and steal all of this. This brilliance of this code is
|
|
|
|
amazing. I'm lovin' it!
|
|
|
|
|
|
|
|
So, the way we define these cases should work like this:
|
|
|
|
|
|
|
|
class ChildDef with validateChildren($children_tags)
|
|
|
|
|
|
|
|
The function needs to parse into nodes, then into the regex array.
|
|
|
|
It can result in one of three actions: the removal of the entire parent node,
|
|
|
|
replacement of all of the original child tags with a new set of child
|
|
|
|
tags which it returns, or no changes. They shall be denoted as, respectively,
|
|
|
|
|
|
|
|
Remove entire parent node = false
|
|
|
|
Replace child tags with this = array of tags
|
|
|
|
No changes = true
|
|
|
|
|
|
|
|
If we remove the entire parent node, we must scroll back to the parent of the
|
|
|
|
parent.
|
|
|
|
|
|
|
|
--
|
|
|
|
|
|
|
|
Another few problems: EXCLUSIONS!
|
|
|
|
|
|
|
|
a
|
|
|
|
must not contain other a elements.
|
|
|
|
pre
|
|
|
|
must not contain the img, object, big, small, sub, or sup elements.
|
|
|
|
button
|
|
|
|
must not contain the input, select, textarea, label, button, form, fieldset,
|
|
|
|
iframe or isindex elements.
|
|
|
|
label
|
|
|
|
must not contain other label elements.
|
|
|
|
form
|
|
|
|
must not contain other form elements.
|
|
|
|
|
|
|
|
Normative exclusions straight from the horses mouth. These are SGML style,
|
|
|
|
not XML style, so we need to modify the ruleset slightly. However, the DTD
|
|
|
|
may have done this for us already.
|
|
|
|
|
|
|
|
--
|
|
|
|
|
|
|
|
Also, what do we do with elements if they're not allowed somewhere? We need
|
|
|
|
some sort of default behavior. I reckon that we should be allowed to:
|
|
|
|
|
|
|
|
1. Delete the node
|
|
|
|
2. Translate it into text (not okay for areas that don't allow #PCDATA)
|
|
|
|
3. Move the node to somewhere where it is okay
|
|
|
|
|
|
|
|
What complicates the matter is that Firefox has the ability to construct
|
|
|
|
DOMs and render invalid nestings of elements (like <b><div>asdf</div></b>).
|
|
|
|
This means that behavior for stray pcdata in ul/ol is undefined. Behavior
|
|
|
|
with data in a table gets bubbled to the start of the table (assuming
|
|
|
|
that we actually custom-make the table child validation class).
|
|
|
|
|
|
|
|
So... I say delete the node when PCDATA isn't allowed (or the regex is too
|
|
|
|
complicated to determine where PCDATA could be inserted), and translate the node
|
|
|
|
to text when PCDATA is allowed.
|
|
|
|
|
|
|
|
--
|
|
|
|
|
|
|
|
Note that generic child definitions are not usually desirable: we should
|
|
|
|
implement custom handlers for each one that specify the stuff correctly.
|
|
|
|
|
2006-07-30 00:29:26 +00:00
|
|
|
--
|
|
|
|
|
|
|
|
<!--
|
|
|
|
ins/del are allowed in block and inline content, but it is
|
|
|
|
inappropriate to include block content within an ins element
|
|
|
|
occurring in inline content.
|
|
|
|
-->
|
|
|
|
|
2006-07-23 00:11:03 +00:00
|
|
|
== STAGE 4 - check attributes ==
|
|
|
|
|
|
|
|
STATUS: N (not started)
|
|
|
|
|
|
|
|
While we're doing all this nesting hocus-pocus, attributes are also being
|
|
|
|
checked. The reason why we need this to be done with the nesting stuff
|
|
|
|
is if a REQUIRED attribute is not there, we might need to kill the tag (or
|
|
|
|
replace it with data). Fortunantely, this is rare enough that we only have
|
|
|
|
to worry about it for certain things:
|
|
|
|
|
|
|
|
* ! bdo - dir > replace with span, preserve attributes
|
|
|
|
* basefont - size
|
|
|
|
* param - name
|
|
|
|
* applet - width, height
|
|
|
|
* ! img - src, alt > if only alt is missing, insert filename, else remove img
|
|
|
|
* map - id
|
|
|
|
* area - alt
|
|
|
|
* form - action
|
|
|
|
* optgroup - label
|
|
|
|
* textarea - rows, cols
|
|
|
|
|
|
|
|
As you can see, only two of them we would remotely consider for our simplified
|
|
|
|
tag set. But each has a different set of challenges. For the img tag, we'd
|
|
|
|
have to be careful about deleting it. If we do hit a snag, we can supply
|
|
|
|
a default "blank" image.
|
|
|
|
|
|
|
|
So after that's all said and done, each of the different types of content
|
|
|
|
inside the attributes needs to be handled differently.
|
|
|
|
|
|
|
|
ContentType(s) [RFC2045]
|
|
|
|
Charset(s) [RFC2045]
|
|
|
|
LanguageCode [RFC3066] (NMTOKEN)
|
|
|
|
Character [XML][2.2] (a single character)
|
|
|
|
Number /^\d+$/
|
|
|
|
LinkTypes [HTML][6.12] <space>
|
|
|
|
MediaDesc [HTML][6.13] <comma>
|
|
|
|
URI/UriList [RFC2396] <space>
|
|
|
|
Datetime (ISO date format)
|
|
|
|
Script ...
|
|
|
|
StyleSheet [CSS] (complex)
|
|
|
|
Text CDATA
|
|
|
|
FrameTarget NMTOKEN
|
|
|
|
Length (pixel, percentage) (?:px suffix allowed?)
|
|
|
|
MultiLength (pixel, percentage, or relative)
|
|
|
|
Pixels (integer)
|
|
|
|
// map attributes omitted
|
|
|
|
ImgAlign (top|middle|bottom|left|right)
|
|
|
|
Color #NNNNNN, #NNN or color name (translate it
|
|
|
|
Black = #000000 Green = #008000
|
|
|
|
Silver = #C0C0C0 Lime = #00FF00
|
|
|
|
Gray = #808080 Olive = #808000
|
|
|
|
White = #FFFFFF Yellow = #FFFF00
|
|
|
|
Maroon = #800000 Navy = #000080
|
|
|
|
Red = #FF0000 Blue = #0000FF
|
|
|
|
Purple = #800080 Teal = #008080
|
|
|
|
Fuchsia= #FF00FF Aqua = #00FFFF
|
2006-07-30 00:29:26 +00:00
|
|
|
// plus some directly in the spec
|
2006-07-23 00:11:03 +00:00
|
|
|
|
|
|
|
Everything else is either ID, or defined as a certain set of values.
|
|
|
|
|
|
|
|
Unless we use reflection (which then we have to make sure the attribute exists),
|
|
|
|
we probably want to have a function like...
|
|
|
|
|
|
|
|
validate($type, $value) where $type is like ContentType or Number
|
|
|
|
|
|
|
|
and then pass it to a switch.
|
|
|
|
|
|
|
|
The final problem is CSS. Get intimate with the syntax here:
|
|
|
|
http://www.w3.org/TR/CSS21/syndata.html and also note the "bad" CSS elements
|
|
|
|
that HTML_Safe defines to help determine a whitelist.
|
|
|
|
|
2006-07-30 00:29:26 +00:00
|
|
|
----
|
|
|
|
|
|
|
|
<!ENTITY % coreattrs
|
|
|
|
"id ID #IMPLIED
|
|
|
|
class CDATA #IMPLIED
|
|
|
|
style %StyleSheet; #IMPLIED
|
|
|
|
title %Text; #IMPLIED"
|
|
|
|
>
|
|
|
|
|
|
|
|
<!ENTITY % i18n
|
|
|
|
"lang %LanguageCode; #IMPLIED
|
|
|
|
xml:lang %LanguageCode; #IMPLIED
|
|
|
|
dir (ltr|rtl) #IMPLIED"
|
|
|
|
>
|
|
|
|
|
|
|
|
<!ENTITY % attrs "%coreattrs; %i18n;">
|
|
|
|
|
|
|
|
----
|
|
|
|
|
|
|
|
These are the elements that only have %attrs:
|
|
|
|
ul, dl, dt, dd, address, span, em, strong, dfn, code, samp, kbd, var,
|
|
|
|
cite, abbr, acronym, sub, sup, tt, i, b, big, small, u, s, strike
|
|
|
|
|
|
|
|
These are the elements that only have %attrs and need an alignment transform
|
|
|
|
div, p, h1, h2, h3, h4, h5, h6
|
|
|
|
|
2006-07-30 15:29:22 +00:00
|
|
|
----
|
|
|
|
|
|
|
|
Prepend style transformations, as CSS takes precedence.
|
|
|
|
|
2006-07-23 00:11:03 +00:00
|
|
|
== PART 5 - stringify ==
|
|
|
|
|
|
|
|
Status: A+ (done completely!)
|
|
|
|
|
|
|
|
Should be fairly simple as long as we delegate to appropriate functions.
|
|
|
|
It's probably too much trouble to indent the stuff properly, so just output
|
|
|
|
stuff raw.
|