tags traditionally are closed by other block-level
elements.
4. Run through all nodes and check children for proper order (especially
important for tables).
5. Validate attributes according to more restrictive definitions based on the
RFCs.
6. Translate back into a string. (Generator)
HTML Purifier is best suited for documents that require a rich array of
HTML tags. Things like blog comments are, in all likelihood, most appropriately
written in an extremely restrictive set of markup that doesn't require
all this functionality (or not written in HTML at all).
The rest of this document is pending moving into their associated classes.
== STAGE 4 - check attributes ==
STATUS: F (currently implementing core/i18n)
While we're doing all this nesting hocus-pocus, attributes are also being
checked. The reason why we need this to be done with the nesting stuff
is if a REQUIRED attribute is not there, we might need to kill the tag (or
replace it with data). Fortunantely, this is rare enough that we only have
to worry about it for certain things:
* ! bdo - dir > replace with span, preserve attributes
* ! img - src, alt > if only alt is missing, insert filename, else remove img
* basefont - size
* param - name
* applet - width, height
* map - id
* area - alt
* form - action
* optgroup - label
* textarea - rows, cols
As you can see, only two of them we would remotely consider for our simplified
tag set. But each has a different set of challenges. For the img tag, we'd
have to be careful about deleting it. If we do hit a snag, we can supply
a default "blank" image.
So after that's all said and done, each of the different types of content
inside the attributes needs to be handled differently.
ContentType(s) [RFC2045]
Charset(s) [RFC2045]
LanguageCode [RFC3066] (NMTOKEN)
Character [XML][2.2] (a single character)
Number /^\d+$/
LinkTypes [HTML][6.12]