0
0
mirror of https://github.com/ezyang/htmlpurifier.git synced 2025-01-08 15:11:51 +00:00

Update spec:

* Info about applicability
* Add more status reports
* Add source of XML_DTD code
* Remark about table child definitions

git-svn-id: http://htmlpurifier.org/svnroot/html_purifier/trunk@51 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
Edward Z. Yang 2006-06-05 01:08:20 +00:00
parent 4935c69904
commit e1464fa2f6

View File

@ -51,6 +51,11 @@ In summary:
RFCs. RFCs.
6. Translate back into a string. 6. Translate back into a string.
HTML Purifier is best suited for documents that require a rich array of
HTML tags. Things like blog comments are, in all likelihood, most appropriately
written in an extremely restrictive set of markup that doesn't require
all this functionality (or not written in HTML at all).
== STAGE 1 - parsing == == STAGE 1 - parsing ==
: Status - largely FINISHED with a few quirks to work out : Status - largely FINISHED with a few quirks to work out
@ -74,6 +79,8 @@ Prefixed with MF (Markup Fragment). We'll make 'em all immutable value objects.
== STAGE 2 - remove foreign elements == == STAGE 2 - remove foreign elements ==
: Status - Core functionality finished, transformations not started
At this point, the parser needs to start knowing about the DTD. Since we At this point, the parser needs to start knowing about the DTD. Since we
hold everything in an associative $info array, if it's set, it's valid, and hold everything in an associative $info array, if it's set, it's valid, and
we can include. Otherwise zap it, or attempt to figure out what they meant. we can include. Otherwise zap it, or attempt to figure out what they meant.
@ -88,6 +95,8 @@ transformations, from FONT to SPAN, etc.
== STAGE 3 - make well formed == == STAGE 3 - make well formed ==
: Finished, but could have better well-formedness fixing
Now we step through the whole thing and correct nesting issues. Most of the Now we step through the whole thing and correct nesting issues. Most of the
time, it's making sure the tags match up, but there's some trickery going on time, it's making sure the tags match up, but there's some trickery going on
for HTML's quirks. They are: for HTML's quirks. They are:
@ -105,6 +114,8 @@ We also want to do translations, like from FONT to SPAN with STYLE.
== STAGE 4 - check nesting == == STAGE 4 - check nesting ==
: Child definitions finished, actual function body not started
We know that the document is now well formed. The tokenizer should now take We know that the document is now well formed. The tokenizer should now take
things in nodes: when you hit a start tag, keep on going until you get its things in nodes: when you hit a start tag, keep on going until you get its
ending tag, and then handle everything inside there. Fortunantely, no ending tag, and then handle everything inside there. Fortunantely, no
@ -149,7 +160,7 @@ expressions and transformations. However, it doesn't seem right that, say,
a stray <b> in a <table> can cause the entire table to be removed. Fixing it, a stray <b> in a <table> can cause the entire table to be removed. Fixing it,
however, may be too difficult. however, may be too difficult.
So... here's the interesting code: This code was ripped from the PEAR class XML_DTD. It implements regexp checking.
-- --
@ -241,7 +252,8 @@ some sort of default behavior. I reckon that we should be allowed to:
What complicates the matter is that Firefox has the ability to construct What complicates the matter is that Firefox has the ability to construct
DOMs and render invalid nestings of elements (like <b><div>asdf</div></b>). DOMs and render invalid nestings of elements (like <b><div>asdf</div></b>).
This means that behavior for stray pcdata in ul/ol is undefined. Behavior This means that behavior for stray pcdata in ul/ol is undefined. Behavior
with data in a table gets bubbled to the start of the table. with data in a table gets bubbled to the start of the table (assuming
that we actually custom-make the table child validation class).
So... I say delete the node when PCDATA isn't allowed (or the regex is too So... I say delete the node when PCDATA isn't allowed (or the regex is too
complicated to determine where PCDATA could be inserted), and translate the node complicated to determine where PCDATA could be inserted), and translate the node
@ -267,7 +279,9 @@ to worry about it for certain things:
* textarea - rows, cols * textarea - rows, cols
As you can see, only two of them we would remotely consider for our simplified As you can see, only two of them we would remotely consider for our simplified
tag set. But each has a different set of challenges. tag set. But each has a different set of challenges. For the img tag, we'd
have to be careful about deleting it. If we do hit a snag, we can supply
a default "blank" image.
So after that's all said and done, each of the different types of content So after that's all said and done, each of the different types of content
inside the attributes needs to be handled differently. inside the attributes needs to be handled differently.