mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2025-01-18 11:41:52 +00:00
Update spec.
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@144 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
parent
064fd603d3
commit
19081ffdf2
108
docs/spec.txt
108
docs/spec.txt
@ -64,24 +64,7 @@ The rest of this document is pending moving into their associated classes.
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
== STAGE 3 - make well formed ==
|
|
||||||
|
|
||||||
Status: A- (not as good as possible)
|
|
||||||
|
|
||||||
Now we step through the whole thing and correct nesting issues. Most of the
|
|
||||||
time, it's making sure the tags match up, but there's some trickery going on
|
|
||||||
for HTML's quirks. They are:
|
|
||||||
|
|
||||||
* Set of tags that close P
|
|
||||||
'address', 'blockquote', 'dd', 'dir', 'div',
|
|
||||||
'dl', 'dt', 'h1', 'h2', 'h3', 'h4',
|
|
||||||
'h5', 'h6', 'hr',
|
|
||||||
'ol', 'p', 'pre',
|
|
||||||
'table', 'ul'
|
|
||||||
* Li closes li
|
|
||||||
* more?
|
|
||||||
|
|
||||||
We also want to do translations, like from FONT to SPAN with STYLE.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@ -128,61 +111,15 @@ The way, I suppose, one would check for it, is whenever a node is removed,
|
|||||||
scroll to it's parent start, and re-evaluate it. Make sure you're able to do
|
scroll to it's parent start, and re-evaluate it. Make sure you're able to do
|
||||||
that with minimal code repetition.
|
that with minimal code repetition.
|
||||||
|
|
||||||
EDITOR'S NOTE: this behavior is not implemented by default, because the
|
|
||||||
default configuration has a setup that ensures that cascading node removals
|
|
||||||
will never happen. However, there will be warning signs in case someone tries
|
|
||||||
to hack it further.
|
|
||||||
|
|
||||||
The most complex case can probably be done by using some fancy regexp
|
The most complex case can probably be done by using some fancy regexp
|
||||||
expressions and transformations. However, it doesn't seem right that, say,
|
expressions and transformations. However, it doesn't seem right that, say,
|
||||||
a stray <b> in a <table> can cause the entire table to be removed. Fixing it,
|
a stray <b> in a <table> can cause the entire table to be removed. Depending
|
||||||
however, may be too difficult (or not, see below).
|
on how much work we want to do, this will at least need a custom child
|
||||||
|
definition, and at most require extra element bubbling capabilities to be
|
||||||
This code was excerpted from the PEAR class XML_DTD. It implements regexp
|
added.
|
||||||
checking.
|
|
||||||
|
|
||||||
--
|
--
|
||||||
|
|
||||||
// # This actually does the validation
|
|
||||||
|
|
||||||
// Validate the order of the children
|
|
||||||
if (!$was_error && count($dtd_children)) {
|
|
||||||
$children_list = implode(',', $children);
|
|
||||||
$regex = $this->dtd->getPcreRegex($name);
|
|
||||||
if (!preg_match('/^'.$regex.'$/', $children_list)) {
|
|
||||||
$dtd_regex = $this->dtd->getDTDRegex($name);
|
|
||||||
$this->_errors("In element <$name> the children list found:\n'$children_list', ".
|
|
||||||
"does not conform the DTD definition: '$dtd_regex'", $lineno);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
--
|
|
||||||
|
|
||||||
// # This figures out the PcreRegex
|
|
||||||
|
|
||||||
//$ch is a string of the allowed childs
|
|
||||||
$children = preg_split('/([^#a-zA-Z0-9_.-]+)/', $ch, -1, PREG_SPLIT_NO_EMPTY);
|
|
||||||
// check for parsed character data special case
|
|
||||||
if (in_array('#PCDATA', $children)) {
|
|
||||||
$content = '#PCDATA';
|
|
||||||
if (count($children) == 1) {
|
|
||||||
$children = array();
|
|
||||||
break;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
// $children is not used after this
|
|
||||||
|
|
||||||
$this->dtd['elements'][$elem_name]['child_validation_dtd_regex'] = $ch;
|
|
||||||
// Convert the DTD regex language into PCRE regex format
|
|
||||||
$reg = str_replace(',', ',?', $ch);
|
|
||||||
$reg = preg_replace('/([#a-zA-Z0-9_.-]+)/', '(,?\\0)', $reg);
|
|
||||||
$this->dtd['elements'][$elem_name]['child_validation_pcre_regex'] = $reg;
|
|
||||||
|
|
||||||
--
|
|
||||||
|
|
||||||
We can probably loot and steal all of this. This brilliance of this code is
|
|
||||||
amazing. I'm lovin' it!
|
|
||||||
|
|
||||||
So, the way we define these cases should work like this:
|
So, the way we define these cases should work like this:
|
||||||
|
|
||||||
class ChildDef with validateChildren($children_tags)
|
class ChildDef with validateChildren($children_tags)
|
||||||
@ -201,26 +138,6 @@ parent.
|
|||||||
|
|
||||||
--
|
--
|
||||||
|
|
||||||
Another few problems: EXCLUSIONS!
|
|
||||||
|
|
||||||
a
|
|
||||||
must not contain other a elements.
|
|
||||||
pre
|
|
||||||
must not contain the img, object, big, small, sub, or sup elements.
|
|
||||||
button
|
|
||||||
must not contain the input, select, textarea, label, button, form, fieldset,
|
|
||||||
iframe or isindex elements.
|
|
||||||
label
|
|
||||||
must not contain other label elements.
|
|
||||||
form
|
|
||||||
must not contain other form elements.
|
|
||||||
|
|
||||||
Normative exclusions straight from the horses mouth. These are SGML style,
|
|
||||||
not XML style, so we need to modify the ruleset slightly. However, the DTD
|
|
||||||
may have done this for us already.
|
|
||||||
|
|
||||||
--
|
|
||||||
|
|
||||||
Also, what do we do with elements if they're not allowed somewhere? We need
|
Also, what do we do with elements if they're not allowed somewhere? We need
|
||||||
some sort of default behavior. I reckon that we should be allowed to:
|
some sort of default behavior. I reckon that we should be allowed to:
|
||||||
|
|
||||||
@ -240,20 +157,13 @@ to text when PCDATA is allowed.
|
|||||||
|
|
||||||
--
|
--
|
||||||
|
|
||||||
Note that generic child definitions are not usually desirable: we should
|
ins/del are allowed in block and inline content, but it is
|
||||||
implement custom handlers for each one that specify the stuff correctly.
|
inappropriate to include block content within an ins element
|
||||||
|
occurring in inline content. How would we fix this?
|
||||||
--
|
|
||||||
|
|
||||||
<!--
|
|
||||||
ins/del are allowed in block and inline content, but it is
|
|
||||||
inappropriate to include block content within an ins element
|
|
||||||
occurring in inline content.
|
|
||||||
-->
|
|
||||||
|
|
||||||
== STAGE 4 - check attributes ==
|
== STAGE 4 - check attributes ==
|
||||||
|
|
||||||
STATUS: N (not started)
|
STATUS: F (currently implementing core/i18n)
|
||||||
|
|
||||||
While we're doing all this nesting hocus-pocus, attributes are also being
|
While we're doing all this nesting hocus-pocus, attributes are also being
|
||||||
checked. The reason why we need this to be done with the nesting stuff
|
checked. The reason why we need this to be done with the nesting stuff
|
||||||
@ -262,10 +172,10 @@ replace it with data). Fortunantely, this is rare enough that we only have
|
|||||||
to worry about it for certain things:
|
to worry about it for certain things:
|
||||||
|
|
||||||
* ! bdo - dir > replace with span, preserve attributes
|
* ! bdo - dir > replace with span, preserve attributes
|
||||||
|
* ! img - src, alt > if only alt is missing, insert filename, else remove img
|
||||||
* basefont - size
|
* basefont - size
|
||||||
* param - name
|
* param - name
|
||||||
* applet - width, height
|
* applet - width, height
|
||||||
* ! img - src, alt > if only alt is missing, insert filename, else remove img
|
|
||||||
* map - id
|
* map - id
|
||||||
* area - alt
|
* area - alt
|
||||||
* form - action
|
* form - action
|
||||||
|
Loading…
Reference in New Issue
Block a user