0
0
mirror of https://github.com/ezyang/htmlpurifier.git synced 2024-11-09 15:28:40 +00:00

Update spec.

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@144 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
Edward Z. Yang 2006-08-03 00:21:27 +00:00
parent 064fd603d3
commit 19081ffdf2

View File

@ -64,24 +64,7 @@ The rest of this document is pending moving into their associated classes.
== STAGE 3 - make well formed ==
Status: A- (not as good as possible)
Now we step through the whole thing and correct nesting issues. Most of the
time, it's making sure the tags match up, but there's some trickery going on
for HTML's quirks. They are:
* Set of tags that close P
'address', 'blockquote', 'dd', 'dir', 'div',
'dl', 'dt', 'h1', 'h2', 'h3', 'h4',
'h5', 'h6', 'hr',
'ol', 'p', 'pre',
'table', 'ul'
* Li closes li
* more?
We also want to do translations, like from FONT to SPAN with STYLE.
@ -128,61 +111,15 @@ The way, I suppose, one would check for it, is whenever a node is removed,
scroll to it's parent start, and re-evaluate it. Make sure you're able to do
that with minimal code repetition.
EDITOR'S NOTE: this behavior is not implemented by default, because the
default configuration has a setup that ensures that cascading node removals
will never happen. However, there will be warning signs in case someone tries
to hack it further.
The most complex case can probably be done by using some fancy regexp
expressions and transformations. However, it doesn't seem right that, say,
a stray <b> in a <table> can cause the entire table to be removed. Fixing it,
however, may be too difficult (or not, see below).
This code was excerpted from the PEAR class XML_DTD. It implements regexp
checking.
a stray <b> in a <table> can cause the entire table to be removed. Depending
on how much work we want to do, this will at least need a custom child
definition, and at most require extra element bubbling capabilities to be
added.
--
// # This actually does the validation
// Validate the order of the children
if (!$was_error && count($dtd_children)) {
$children_list = implode(',', $children);
$regex = $this->dtd->getPcreRegex($name);
if (!preg_match('/^'.$regex.'$/', $children_list)) {
$dtd_regex = $this->dtd->getDTDRegex($name);
$this->_errors("In element <$name> the children list found:\n'$children_list', ".
"does not conform the DTD definition: '$dtd_regex'", $lineno);
}
}
--
// # This figures out the PcreRegex
//$ch is a string of the allowed childs
$children = preg_split('/([^#a-zA-Z0-9_.-]+)/', $ch, -1, PREG_SPLIT_NO_EMPTY);
// check for parsed character data special case
if (in_array('#PCDATA', $children)) {
$content = '#PCDATA';
if (count($children) == 1) {
$children = array();
break;
}
}
// $children is not used after this
$this->dtd['elements'][$elem_name]['child_validation_dtd_regex'] = $ch;
// Convert the DTD regex language into PCRE regex format
$reg = str_replace(',', ',?', $ch);
$reg = preg_replace('/([#a-zA-Z0-9_.-]+)/', '(,?\\0)', $reg);
$this->dtd['elements'][$elem_name]['child_validation_pcre_regex'] = $reg;
--
We can probably loot and steal all of this. This brilliance of this code is
amazing. I'm lovin' it!
So, the way we define these cases should work like this:
class ChildDef with validateChildren($children_tags)
@ -201,26 +138,6 @@ parent.
--
Another few problems: EXCLUSIONS!
a
must not contain other a elements.
pre
must not contain the img, object, big, small, sub, or sup elements.
button
must not contain the input, select, textarea, label, button, form, fieldset,
iframe or isindex elements.
label
must not contain other label elements.
form
must not contain other form elements.
Normative exclusions straight from the horses mouth. These are SGML style,
not XML style, so we need to modify the ruleset slightly. However, the DTD
may have done this for us already.
--
Also, what do we do with elements if they're not allowed somewhere? We need
some sort of default behavior. I reckon that we should be allowed to:
@ -240,20 +157,13 @@ to text when PCDATA is allowed.
--
Note that generic child definitions are not usually desirable: we should
implement custom handlers for each one that specify the stuff correctly.
--
<!--
ins/del are allowed in block and inline content, but it is
inappropriate to include block content within an ins element
occurring in inline content.
-->
ins/del are allowed in block and inline content, but it is
inappropriate to include block content within an ins element
occurring in inline content. How would we fix this?
== STAGE 4 - check attributes ==
STATUS: N (not started)
STATUS: F (currently implementing core/i18n)
While we're doing all this nesting hocus-pocus, attributes are also being
checked. The reason why we need this to be done with the nesting stuff
@ -262,10 +172,10 @@ replace it with data). Fortunantely, this is rare enough that we only have
to worry about it for certain things:
* ! bdo - dir > replace with span, preserve attributes
* ! img - src, alt > if only alt is missing, insert filename, else remove img
* basefont - size
* param - name
* applet - width, height
* ! img - src, alt > if only alt is missing, insert filename, else remove img
* map - id
* area - alt
* form - action