0
0
mirror of https://github.com/ezyang/htmlpurifier.git synced 2025-01-24 06:11:52 +00:00
htmlpurifier/docs/spec.txt
Edward Z. Yang f0deae1fc0 Update documentation.
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@147 48356398-32a2-884e-a903-53898d9a118a
2006-08-03 01:37:28 +00:00

159 lines
6.3 KiB
Plaintext

HTML Purifier
by Edward Z. Yang
There are a number of ad hoc HTML filtering solutions out there on the web
(some examples including HTML_Safe, kses and SafeHtmlChecker.class.php) that
claim to filter HTML properly, preventing malicious JavaScript and layout
breaking HTML from getting through the parser. None of them, however,
demonstrates a thorough knowledge of neither the DTD that defines the HTML
nor the caveats of HTML that cannot be expressed by a DTD. Configurable
filters (such as kses or PHP's built-in striptags() function) have trouble
validating the contents of attributes and can be subject to security attacks
due to poor configuration. Other filters take the naive approach of
blacklisting known threats and tags, failing to account for the introduction
of new technologies, new tags, new attributes or quirky browser behavior.
However, HTML Purifier takes a different approach, one that doesn't use
specification-ignorant regexes or narrow blacklists. HTML Purifier will
decompose the whole document into tokens, and rigorously process the tokens by:
removing non-whitelisted elements, transforming bad practice tags like <font>
into <span>, properly checking the nesting of tags and their children and
validating all attributes according to their RFCs.
To my knowledge, there is nothing like this on the web yet. Not even MediaWiki,
which allows an amazingly diverse mix of HTML and wikitext in its documents,
gets all the nesting quirks right. Existing solutions hope that no JavaScript
will slip through, but either do not attempt to ensure that the resulting
output is valid XHTML or send the HTML through a draconic XML parser (and yet
still get the nesting wrong: SafeHtmlChecker.class.php does not prevent <a>
tags from being nested within each other).
This document seeks to detail the inner workings of HTML Purifier. The first
draft was drawn up after two rough code sketches and the implementation of a
forgiving lexer. You may also be interested in the unit tests located in the
tests/ folder, which provide a living document on how exactly the filter deals
with malformed input.
In summary:
1. Parse document into an array of tag and text tokens (Lexer)
2. Remove all elements not on whitelist and transform certain other elements
into acceptable forms (i.e. <font>)
3. Make document well formed while helpfully taking into account certain quirks,
such as the fact that <p> tags traditionally are closed by other block-level
elements.
4. Run through all nodes and check children for proper order (especially
important for tables).
5. Validate attributes according to more restrictive definitions based on the
RFCs.
6. Translate back into a string. (Generator)
HTML Purifier is best suited for documents that require a rich array of
HTML tags. Things like blog comments are, in all likelihood, most appropriately
written in an extremely restrictive set of markup that doesn't require
all this functionality (or not written in HTML at all).
The rest of this document is pending moving into their associated classes.
== STAGE 4 - check attributes ==
STATUS: F (currently implementing core/i18n)
While we're doing all this nesting hocus-pocus, attributes are also being
checked. The reason why we need this to be done with the nesting stuff
is if a REQUIRED attribute is not there, we might need to kill the tag (or
replace it with data). Fortunantely, this is rare enough that we only have
to worry about it for certain things:
* ! bdo - dir > replace with span, preserve attributes
* ! img - src, alt > if only alt is missing, insert filename, else remove img
* basefont - size
* param - name
* applet - width, height
* map - id
* area - alt
* form - action
* optgroup - label
* textarea - rows, cols
As you can see, only two of them we would remotely consider for our simplified
tag set. But each has a different set of challenges. For the img tag, we'd
have to be careful about deleting it. If we do hit a snag, we can supply
a default "blank" image.
So after that's all said and done, each of the different types of content
inside the attributes needs to be handled differently.
ContentType(s) [RFC2045]
Charset(s) [RFC2045]
LanguageCode [RFC3066] (NMTOKEN)
Character [XML][2.2] (a single character)
Number /^\d+$/
LinkTypes [HTML][6.12] <space>
MediaDesc [HTML][6.13] <comma>
URI/UriList [RFC2396] <space>
Datetime (ISO date format)
Script ...
StyleSheet [CSS] (complex)
Text CDATA
FrameTarget NMTOKEN
Length (pixel, percentage) (?:px suffix allowed?)
MultiLength (pixel, percentage, or relative)
Pixels (integer)
// map attributes omitted
ImgAlign (top|middle|bottom|left|right)
Color #NNNNNN, #NNN or color name (translate it
Black = #000000 Green = #008000
Silver = #C0C0C0 Lime = #00FF00
Gray = #808080 Olive = #808000
White = #FFFFFF Yellow = #FFFF00
Maroon = #800000 Navy = #000080
Red = #FF0000 Blue = #0000FF
Purple = #800080 Teal = #008080
Fuchsia= #FF00FF Aqua = #00FFFF
// plus some directly in the spec
Everything else is either ID, or defined as a certain set of values.
Unless we use reflection (which then we have to make sure the attribute exists),
we probably want to have a function like...
validate($type, $value) where $type is like ContentType or Number
and then pass it to a switch.
The final problem is CSS. Get intimate with the syntax here:
http://www.w3.org/TR/CSS21/syndata.html and also note the "bad" CSS elements
that HTML_Safe defines to help determine a whitelist.
----
<!ENTITY % coreattrs
"id ID #IMPLIED
class CDATA #IMPLIED
style %StyleSheet; #IMPLIED
title %Text; #IMPLIED"
>
<!ENTITY % i18n
"lang %LanguageCode; #IMPLIED
xml:lang %LanguageCode; #IMPLIED
dir (ltr|rtl) #IMPLIED"
>
<!ENTITY % attrs "%coreattrs; %i18n;">
----
These are the elements that only have %attrs:
ul, dl, dt, dd, address, span, em, strong, dfn, code, samp, kbd, var,
cite, abbr, acronym, sub, sup, tt, i, b, big, small, u, s, strike
These are the elements that only have %attrs and need an alignment transform
div, p, h1, h2, h3, h4, h5, h6
----
Prepend style transformations, as CSS takes precedence.