tags traditionally are closed by other block-level
elements.
4. Run through all nodes and check children for proper order (especially
important for tables).
5. Validate attributes according to more restrictive definitions based on the
RFCs.
6. Translate back into a string. (Generator)
HTML Purifier is best suited for documents that require a rich array of
HTML tags. Things like blog comments are, in all likelihood, most appropriately
written in an extremely restrictive set of markup that doesn't require
all this functionality (or not written in HTML at all).
The rest of this document is pending moving into their associated classes.
== STAGE 3 - make well formed ==
Status: A- (not as good as possible)
Now we step through the whole thing and correct nesting issues. Most of the
time, it's making sure the tags match up, but there's some trickery going on
for HTML's quirks. They are:
* Set of tags that close P
'address', 'blockquote', 'dd', 'dir', 'div',
'dl', 'dt', 'h1', 'h2', 'h3', 'h4',
'h5', 'h6', 'hr',
'ol', 'p', 'pre',
'table', 'ul'
* Li closes li
* more?
We also want to do translations, like from FONT to SPAN with STYLE.
== STAGE 4 - check nesting ==
Status: B (table custom definition needs to be implemented)
We know that the document is now well formed. The tokenizer should now take
things in nodes: when you hit a start tag, keep on going until you get its
ending tag, and then handle everything inside there. Fortunantely, no
fancy recursion is necessary as going to the next node is as simple as
scrolling to the next start tag.
Suppose we have a node and encounter a problem with one of its children.
Depending on the complexity of the rule, we will either delete the children,
or delete the entire node itself.
The simplest type of rule is zero or more valid elements, denoted like:
( el1 | el2 | el3 )*
The next simplest is with one or more valid elements:
( li )+
And then you have complex cases:
table (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))
map ((%block; | form | %misc;)+ | area+)
html (head, body)
head (%head.misc;,
((title, %head.misc;, (base, %head.misc;)?) |
(base, %head.misc;, (title, %head.misc;))))
Each of these has to be dealt with. Case 1 is a joy, because you can zap
as many as you want, but you'll never actually have to kill the node. Two
and three need the entire node to be killed if you have a problem. This
can be problematic, as the missing node might cause its parent node to now
be incorrect. Granted, it's unlikely, and I'm fairly certain that HTML, let
alone the simplified set I'm allowing will have this problem, but it's worth
checking for.
The way, I suppose, one would check for it, is whenever a node is removed,
scroll to it's parent start, and re-evaluate it. Make sure you're able to do
that with minimal code repetition.
EDITOR'S NOTE: this behavior is not implemented by default, because the
default configuration has a setup that ensures that cascading node removals
will never happen. However, there will be warning signs in case someone tries
to hack it further.
The most complex case can probably be done by using some fancy regexp
expressions and transformations. However, it doesn't seem right that, say,
a stray in a can cause the entire table to be removed. Fixing it,
however, may be too difficult (or not, see below).
This code was excerpted from the PEAR class XML_DTD. It implements regexp
checking.
--
// # This actually does the validation
// Validate the order of the children
if (!$was_error && count($dtd_children)) {
$children_list = implode(',', $children);
$regex = $this->dtd->getPcreRegex($name);
if (!preg_match('/^'.$regex.'$/', $children_list)) {
$dtd_regex = $this->dtd->getDTDRegex($name);
$this->_errors("In element <$name> the children list found:\n'$children_list', ".
"does not conform the DTD definition: '$dtd_regex'", $lineno);
}
}
--
// # This figures out the PcreRegex
//$ch is a string of the allowed childs
$children = preg_split('/([^#a-zA-Z0-9_.-]+)/', $ch, -1, PREG_SPLIT_NO_EMPTY);
// check for parsed character data special case
if (in_array('#PCDATA', $children)) {
$content = '#PCDATA';
if (count($children) == 1) {
$children = array();
break;
}
}
// $children is not used after this
$this->dtd['elements'][$elem_name]['child_validation_dtd_regex'] = $ch;
// Convert the DTD regex language into PCRE regex format
$reg = str_replace(',', ',?', $ch);
$reg = preg_replace('/([#a-zA-Z0-9_.-]+)/', '(,?\\0)', $reg);
$this->dtd['elements'][$elem_name]['child_validation_pcre_regex'] = $reg;
--
We can probably loot and steal all of this. This brilliance of this code is
amazing. I'm lovin' it!
So, the way we define these cases should work like this:
class ChildDef with validateChildren($children_tags)
The function needs to parse into nodes, then into the regex array.
It can result in one of three actions: the removal of the entire parent node,
replacement of all of the original child tags with a new set of child
tags which it returns, or no changes. They shall be denoted as, respectively,
Remove entire parent node = false
Replace child tags with this = array of tags
No changes = true
If we remove the entire parent node, we must scroll back to the parent of the
parent.
--
Another few problems: EXCLUSIONS!
a
must not contain other a elements.
pre
must not contain the img, object, big, small, sub, or sup elements.
button
must not contain the input, select, textarea, label, button, form, fieldset,
iframe or isindex elements.
label
must not contain other label elements.
form
must not contain other form elements.
Normative exclusions straight from the horses mouth. These are SGML style,
not XML style, so we need to modify the ruleset slightly. However, the DTD
may have done this for us already.
--
Also, what do we do with elements if they're not allowed somewhere? We need
some sort of default behavior. I reckon that we should be allowed to:
1. Delete the node
2. Translate it into text (not okay for areas that don't allow #PCDATA)
3. Move the node to somewhere where it is okay
What complicates the matter is that Firefox has the ability to construct
DOMs and render invalid nestings of elements (like