mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2024-11-09 23:28:42 +00:00
f0deae1fc0
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@147 48356398-32a2-884e-a903-53898d9a118a
159 lines
6.3 KiB
Plaintext
159 lines
6.3 KiB
Plaintext
|
|
HTML Purifier
|
|
by Edward Z. Yang
|
|
|
|
There are a number of ad hoc HTML filtering solutions out there on the web
|
|
(some examples including HTML_Safe, kses and SafeHtmlChecker.class.php) that
|
|
claim to filter HTML properly, preventing malicious JavaScript and layout
|
|
breaking HTML from getting through the parser. None of them, however,
|
|
demonstrates a thorough knowledge of neither the DTD that defines the HTML
|
|
nor the caveats of HTML that cannot be expressed by a DTD. Configurable
|
|
filters (such as kses or PHP's built-in striptags() function) have trouble
|
|
validating the contents of attributes and can be subject to security attacks
|
|
due to poor configuration. Other filters take the naive approach of
|
|
blacklisting known threats and tags, failing to account for the introduction
|
|
of new technologies, new tags, new attributes or quirky browser behavior.
|
|
|
|
However, HTML Purifier takes a different approach, one that doesn't use
|
|
specification-ignorant regexes or narrow blacklists. HTML Purifier will
|
|
decompose the whole document into tokens, and rigorously process the tokens by:
|
|
removing non-whitelisted elements, transforming bad practice tags like <font>
|
|
into <span>, properly checking the nesting of tags and their children and
|
|
validating all attributes according to their RFCs.
|
|
|
|
To my knowledge, there is nothing like this on the web yet. Not even MediaWiki,
|
|
which allows an amazingly diverse mix of HTML and wikitext in its documents,
|
|
gets all the nesting quirks right. Existing solutions hope that no JavaScript
|
|
will slip through, but either do not attempt to ensure that the resulting
|
|
output is valid XHTML or send the HTML through a draconic XML parser (and yet
|
|
still get the nesting wrong: SafeHtmlChecker.class.php does not prevent <a>
|
|
tags from being nested within each other).
|
|
|
|
This document seeks to detail the inner workings of HTML Purifier. The first
|
|
draft was drawn up after two rough code sketches and the implementation of a
|
|
forgiving lexer. You may also be interested in the unit tests located in the
|
|
tests/ folder, which provide a living document on how exactly the filter deals
|
|
with malformed input.
|
|
|
|
In summary:
|
|
|
|
1. Parse document into an array of tag and text tokens (Lexer)
|
|
2. Remove all elements not on whitelist and transform certain other elements
|
|
into acceptable forms (i.e. <font>)
|
|
3. Make document well formed while helpfully taking into account certain quirks,
|
|
such as the fact that <p> tags traditionally are closed by other block-level
|
|
elements.
|
|
4. Run through all nodes and check children for proper order (especially
|
|
important for tables).
|
|
5. Validate attributes according to more restrictive definitions based on the
|
|
RFCs.
|
|
6. Translate back into a string. (Generator)
|
|
|
|
HTML Purifier is best suited for documents that require a rich array of
|
|
HTML tags. Things like blog comments are, in all likelihood, most appropriately
|
|
written in an extremely restrictive set of markup that doesn't require
|
|
all this functionality (or not written in HTML at all).
|
|
|
|
The rest of this document is pending moving into their associated classes.
|
|
|
|
== STAGE 4 - check attributes ==
|
|
|
|
STATUS: F (currently implementing core/i18n)
|
|
|
|
While we're doing all this nesting hocus-pocus, attributes are also being
|
|
checked. The reason why we need this to be done with the nesting stuff
|
|
is if a REQUIRED attribute is not there, we might need to kill the tag (or
|
|
replace it with data). Fortunantely, this is rare enough that we only have
|
|
to worry about it for certain things:
|
|
|
|
* ! bdo - dir > replace with span, preserve attributes
|
|
* ! img - src, alt > if only alt is missing, insert filename, else remove img
|
|
* basefont - size
|
|
* param - name
|
|
* applet - width, height
|
|
* map - id
|
|
* area - alt
|
|
* form - action
|
|
* optgroup - label
|
|
* textarea - rows, cols
|
|
|
|
As you can see, only two of them we would remotely consider for our simplified
|
|
tag set. But each has a different set of challenges. For the img tag, we'd
|
|
have to be careful about deleting it. If we do hit a snag, we can supply
|
|
a default "blank" image.
|
|
|
|
So after that's all said and done, each of the different types of content
|
|
inside the attributes needs to be handled differently.
|
|
|
|
ContentType(s) [RFC2045]
|
|
Charset(s) [RFC2045]
|
|
LanguageCode [RFC3066] (NMTOKEN)
|
|
Character [XML][2.2] (a single character)
|
|
Number /^\d+$/
|
|
LinkTypes [HTML][6.12] <space>
|
|
MediaDesc [HTML][6.13] <comma>
|
|
URI/UriList [RFC2396] <space>
|
|
Datetime (ISO date format)
|
|
Script ...
|
|
StyleSheet [CSS] (complex)
|
|
Text CDATA
|
|
FrameTarget NMTOKEN
|
|
Length (pixel, percentage) (?:px suffix allowed?)
|
|
MultiLength (pixel, percentage, or relative)
|
|
Pixels (integer)
|
|
// map attributes omitted
|
|
ImgAlign (top|middle|bottom|left|right)
|
|
Color #NNNNNN, #NNN or color name (translate it
|
|
Black = #000000 Green = #008000
|
|
Silver = #C0C0C0 Lime = #00FF00
|
|
Gray = #808080 Olive = #808000
|
|
White = #FFFFFF Yellow = #FFFF00
|
|
Maroon = #800000 Navy = #000080
|
|
Red = #FF0000 Blue = #0000FF
|
|
Purple = #800080 Teal = #008080
|
|
Fuchsia= #FF00FF Aqua = #00FFFF
|
|
// plus some directly in the spec
|
|
|
|
Everything else is either ID, or defined as a certain set of values.
|
|
|
|
Unless we use reflection (which then we have to make sure the attribute exists),
|
|
we probably want to have a function like...
|
|
|
|
validate($type, $value) where $type is like ContentType or Number
|
|
|
|
and then pass it to a switch.
|
|
|
|
The final problem is CSS. Get intimate with the syntax here:
|
|
http://www.w3.org/TR/CSS21/syndata.html and also note the "bad" CSS elements
|
|
that HTML_Safe defines to help determine a whitelist.
|
|
|
|
----
|
|
|
|
<!ENTITY % coreattrs
|
|
"id ID #IMPLIED
|
|
class CDATA #IMPLIED
|
|
style %StyleSheet; #IMPLIED
|
|
title %Text; #IMPLIED"
|
|
>
|
|
|
|
<!ENTITY % i18n
|
|
"lang %LanguageCode; #IMPLIED
|
|
xml:lang %LanguageCode; #IMPLIED
|
|
dir (ltr|rtl) #IMPLIED"
|
|
>
|
|
|
|
<!ENTITY % attrs "%coreattrs; %i18n;">
|
|
|
|
----
|
|
|
|
These are the elements that only have %attrs:
|
|
ul, dl, dt, dd, address, span, em, strong, dfn, code, samp, kbd, var,
|
|
cite, abbr, acronym, sub, sup, tt, i, b, big, small, u, s, strike
|
|
|
|
These are the elements that only have %attrs and need an alignment transform
|
|
div, p, h1, h2, h3, h4, h5, h6
|
|
|
|
----
|
|
|
|
Prepend style transformations, as CSS takes precedence.
|