HTML Purifier by Edward Z. Yang == Introduction == There are a number of ad hoc HTML filtering solutions out there on the web (some examples including HTML_Safe, kses and SafeHtmlChecker.class.php) that claim to filter HTML properly, preventing malicious JavaScript and layout breaking HTML from getting through the parser. None of them, however, demonstrates a thorough knowledge of neither the DTD that defines the HTML nor the caveats of HTML that cannot be expressed by a DTD. Configurable filters (such as kses or PHP's built-in striptags() function) have trouble validating the contents of attributes and can be subject to security attacks due to poor configuration. Other filters take the naive approach of blacklisting known threats and tags, failing to account for the introduction of new technologies, new tags, new attributes or quirky browser behavior. However, HTML Purifier takes a different approach, one that doesn't use specification-ignorant regexes or narrow blacklists. HTML Purifier will decompose the whole document into tokens, and rigorously process the tokens by: removing non-whitelisted elements, transforming bad practice tags like into , properly checking the nesting of tags and their children and validating all attributes according to their RFCs. To my knowledge, there is nothing like this on the web yet. Not even MediaWiki, which allows an amazingly diverse mix of HTML and wikitext in its documents, gets all the nesting quirks right. Existing solutions hope that no JavaScript will slip through, but either do not attempt to ensure that the resulting output is valid XHTML or send the HTML through a draconic XML parser (and yet still get the nesting wrong: SafeHtmlChecker.class.php does not prevent tags from being nested within each other). This document seeks to detail the inner workings of HTML Purifier. The first draft was drawn up after two rough code sketches and the implementation of a forgiving lexer. You may also be interested in the unit tests located in the tests/ folder, which provide a living document on how exactly the filter deals with malformed input. In summary: 1. Parse document into an array of tag and text tokens 2. Remove all elements not on whitelist and transform certain other elements into acceptable forms (i.e. ) 3. Make document well formed while helpfully taking into account certain quirks, such as the fact that

tags traditionally are closed by other block-level elements. 4. Run through all nodes and check children for proper order (especially important for tables). 5. Validate attributes according to more restrictive definitions based on the RFCs. 6. Translate back into a string. == STAGE 1 - parsing == : Status - largely FINISHED with a few quirks to work out We've got two options for this: HTMLSax or my MarkupLexer. Hopefully, we can make the two interfaces compatible. This means that we need a lot of little classes: * StartTag(name, attributes) is openHandler * EndTag(name) is closeHandler * EmptyTag(name, attributes) is openHandler (is in array of empties) * Data(text) is dataHandler * Comment(text) is escapeHandler (has leading -) * CharacterData(text) is escapeHandler (has leading [) Ignorable/not being implemented (although we probably want to output them raw): * ProcessingInstructions(text) is piHandler * JavaOrASPInstructions(text) is jaspHandler Prefixed with MF (Markup Fragment). We'll make 'em all immutable value objects. == STAGE 2 - remove foreign elements == At this point, the parser needs to start knowing about the DTD. Since we hold everything in an associative $info array, if it's set, it's valid, and we can include. Otherwise zap it, or attempt to figure out what they meant. ? A misspelling of ! This feature may be too sugary though. While we're at it, we can change the Processing Instructions and Java/ASP Instructions into data blocks, scratch comment blocks, change CharacterData into Data (although I don't see why we can't do that at the start). One last thing: the remove foreign elements has to do the element transformations, from FONT to SPAN, etc. == STAGE 3 - make well formed == Now we step through the whole thing and correct nesting issues. Most of the time, it's making sure the tags match up, but there's some trickery going on for HTML's quirks. They are: * Set of tags that close P 'address', 'blockquote', 'dd', 'dir', 'div', 'dl', 'dt', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'ol', 'p', 'pre', 'table', 'ul' * Li closes li * more? We also want to do translations, like from FONT to SPAN with STYLE. == STAGE 4 - check nesting == We know that the document is now well formed. The tokenizer should now take things in nodes: when you hit a start tag, keep on going until you get its ending tag, and then handle everything inside there. Fortunantely, no fancy recursion is necessary as going to the next node is as simple as scrolling to the next start tag. Suppose we have a node and encounter a problem with one of its children. Depending on the complexity of the rule, we will either delete the children, or delete the entire node itself. The simplest type of rule is zero or more valid elements, denoted like: ( el1 | el2 | el3 )* The next simplest is with one or more valid elements: ( li )+ And then you have complex cases: table (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+)) map ((%block; | form | %misc;)+ | area+) html (head, body) head (%head.misc;, ((title, %head.misc;, (base, %head.misc;)?) | (base, %head.misc;, (title, %head.misc;)))) Each of these has to be dealt with. Case 1 is a joy, because you can zap as many as you want, but you'll never actually have to kill the node. Two and three need the entire node to be killed if you have a problem. This can be problematic, as the missing node might cause its parent node to now be incorrect. Granted, it's unlikely, and I'm fairly certain that HTML, let alone the simplified set I'm allowing will have this problem, but it's worth checking for. The way, I suppose, one would check for it, is whenever a node is removed, scroll to it's parent start, and re-evaluate it. Make sure you're able to do that with minimal code repetition. The most complex case can probably be done by using some fancy regexp expressions and transformations. However, it doesn't seem right that, say, a stray in a can cause the entire table to be removed. Fixing it, however, may be too difficult. So... here's the interesting code: -- // Validate the order of the children if (!$was_error && count($dtd_children)) { $children_list = implode(',', $children); $regex = $this->dtd->getPcreRegex($name); if (!preg_match('/^'.$regex.'$/', $children_list)) { $dtd_regex = $this->dtd->getDTDRegex($name); $this->_errors("In element <$name> the children list found:\n'$children_list', ". "does not conform the DTD definition: '$dtd_regex'", $lineno); } } -- //$ch is a string of the allowed childs $children = preg_split('/([^#a-zA-Z0-9_.-]+)/', $ch, -1, PREG_SPLIT_NO_EMPTY); // check for parsed character data special case if (in_array('#PCDATA', $children)) { $content = '#PCDATA'; if (count($children) == 1) { $children = array(); break; } } // $children is not used after this $this->dtd['elements'][$elem_name]['child_validation_dtd_regex'] = $ch; // Convert the DTD regex language into PCRE regex format $reg = str_replace(',', ',?', $ch); $reg = preg_replace('/([#a-zA-Z0-9_.-]+)/', '(,?\\0)', $reg); $this->dtd['elements'][$elem_name]['child_validation_pcre_regex'] = $reg; -- We can probably loot and steal all of this. This brilliance of this code is amazing. I'm lovin' it! So, the way we define these cases should work like this: class ChildDef with validateChildren($children_tags) The function needs to parse into nodes, then into the regex array. It can result in one of three actions: the removal of the entire parent node, replacement of all of the original child tags with a new set of child tags which it returns, or no changes. They shall be denoted as, respectively, Remove entire parent node = false Replace child tags with this = array of tags No changes = true If we remove the entire parent node, we must scroll back to the parent of the parent. -- Another few problems: EXCLUSIONS! a must not contain other a elements. pre must not contain the img, object, big, small, sub, or sup elements. button must not contain the input, select, textarea, label, button, form, fieldset, iframe or isindex elements. label must not contain other label elements. form must not contain other form elements. Normative exclusions straight from the horses mouth. These are SGML style, not XML style, so we need to modify the ruleset slightly. -- Also, what do we do with elements if they're not allowed somewhere? We need some sort of default behavior. I reckon that we should be allowed to: 1. Delete the node 2. Translate it into text (not okay for areas that don't allow #PCDATA) 3. Move the node to somewhere where it is okay What complicates the matter is that Firefox has the ability to construct DOMs and render invalid nestings of elements (like
asdf
). This means that behavior for stray pcdata in ul/ol is undefined. Behavior with data in a table gets bubbled to the start of the table. So... I say delete the node when PCDATA isn't allowed (or the regex is too complicated to determine where PCDATA could be inserted), and translate the node to text when PCDATA is allowed. == STAGE 4 - check attributes == While we're doing all this nesting hocus-pocus, attributes are also being checked. The reason why we need this to be done with the nesting stuff is if a REQUIRED attribute is not there, we might need to kill the tag (or replace it with data). Fortunantely, this is rare enough that we only have to worry about it for certain things: * ! bdo - dir > replace with span, preserve attributes * basefont - size * param - name * applet - width, height * ! img - src, alt > if only alt is missing, insert filename, else remove img * map - id * area - alt * form - action * optgroup - label * textarea - rows, cols As you can see, only two of them we would remotely consider for our simplified tag set. But each has a different set of challenges. So after that's all said and done, each of the different types of content inside the attributes needs to be handled differently. ContentType(s) [RFC2045] Charset(s) [RFC2045] LanguageCode [RFC3066] (NMTOKEN) Character [XML][2.2] (a single character) Number /^\d+$/ LinkTypes [HTML][6.12] MediaDesc [HTML][6.13] URI/UriList [RFC2396] Datetime (ISO date format) Script ... StyleSheet [CSS] (complex) Text CDATA FrameTarget NMTOKEN Length (pixel, percentage) (?:px suffix allowed?) MultiLength (pixel, percentage, or relative) Pixels (integer) // map attributes omitted ImgAlign (top|middle|bottom|left|right) Color #NNNNNN, #NNN or color name (translate it Black = #000000 Green = #008000 Silver = #C0C0C0 Lime = #00FF00 Gray = #808080 Olive = #808000 White = #FFFFFF Yellow = #FFFF00 Maroon = #800000 Navy = #000080 Red = #FF0000 Blue = #0000FF Purple = #800080 Teal = #008080 Fuchsia= #FF00FF Aqua = #00FFFF // plus some directly defined in the spec Everything else is either ID, or defined as a certain set of values. Unless we use reflection (which then we have to make sure the attribute exists), we probably want to have a function like... validate($type, $value) where $type is like ContentType or Number and then pass it to a switch. The final problem is CSS. Get intimate with the syntax here: http://www.w3.org/TR/CSS21/syndata.html and also note the "bad" CSS elements that HTML_Safe defines to help determine a whitelist. == PART 5 - stringify == Should be fairly simple as long as we delegate to appropriate functions. It's probably too much trouble to indent the stuff properly, so just output stuff raw.