mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2024-12-23 00:41:52 +00:00
Update spec:
* Info about applicability * Add more status reports * Add source of XML_DTD code * Remark about table child definitions git-svn-id: http://htmlpurifier.org/svnroot/html_purifier/trunk@51 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
parent
4935c69904
commit
e1464fa2f6
@ -51,6 +51,11 @@ In summary:
|
|||||||
RFCs.
|
RFCs.
|
||||||
6. Translate back into a string.
|
6. Translate back into a string.
|
||||||
|
|
||||||
|
HTML Purifier is best suited for documents that require a rich array of
|
||||||
|
HTML tags. Things like blog comments are, in all likelihood, most appropriately
|
||||||
|
written in an extremely restrictive set of markup that doesn't require
|
||||||
|
all this functionality (or not written in HTML at all).
|
||||||
|
|
||||||
== STAGE 1 - parsing ==
|
== STAGE 1 - parsing ==
|
||||||
|
|
||||||
: Status - largely FINISHED with a few quirks to work out
|
: Status - largely FINISHED with a few quirks to work out
|
||||||
@ -74,6 +79,8 @@ Prefixed with MF (Markup Fragment). We'll make 'em all immutable value objects.
|
|||||||
|
|
||||||
== STAGE 2 - remove foreign elements ==
|
== STAGE 2 - remove foreign elements ==
|
||||||
|
|
||||||
|
: Status - Core functionality finished, transformations not started
|
||||||
|
|
||||||
At this point, the parser needs to start knowing about the DTD. Since we
|
At this point, the parser needs to start knowing about the DTD. Since we
|
||||||
hold everything in an associative $info array, if it's set, it's valid, and
|
hold everything in an associative $info array, if it's set, it's valid, and
|
||||||
we can include. Otherwise zap it, or attempt to figure out what they meant.
|
we can include. Otherwise zap it, or attempt to figure out what they meant.
|
||||||
@ -88,6 +95,8 @@ transformations, from FONT to SPAN, etc.
|
|||||||
|
|
||||||
== STAGE 3 - make well formed ==
|
== STAGE 3 - make well formed ==
|
||||||
|
|
||||||
|
: Finished, but could have better well-formedness fixing
|
||||||
|
|
||||||
Now we step through the whole thing and correct nesting issues. Most of the
|
Now we step through the whole thing and correct nesting issues. Most of the
|
||||||
time, it's making sure the tags match up, but there's some trickery going on
|
time, it's making sure the tags match up, but there's some trickery going on
|
||||||
for HTML's quirks. They are:
|
for HTML's quirks. They are:
|
||||||
@ -105,6 +114,8 @@ We also want to do translations, like from FONT to SPAN with STYLE.
|
|||||||
|
|
||||||
== STAGE 4 - check nesting ==
|
== STAGE 4 - check nesting ==
|
||||||
|
|
||||||
|
: Child definitions finished, actual function body not started
|
||||||
|
|
||||||
We know that the document is now well formed. The tokenizer should now take
|
We know that the document is now well formed. The tokenizer should now take
|
||||||
things in nodes: when you hit a start tag, keep on going until you get its
|
things in nodes: when you hit a start tag, keep on going until you get its
|
||||||
ending tag, and then handle everything inside there. Fortunantely, no
|
ending tag, and then handle everything inside there. Fortunantely, no
|
||||||
@ -149,7 +160,7 @@ expressions and transformations. However, it doesn't seem right that, say,
|
|||||||
a stray <b> in a <table> can cause the entire table to be removed. Fixing it,
|
a stray <b> in a <table> can cause the entire table to be removed. Fixing it,
|
||||||
however, may be too difficult.
|
however, may be too difficult.
|
||||||
|
|
||||||
So... here's the interesting code:
|
This code was ripped from the PEAR class XML_DTD. It implements regexp checking.
|
||||||
|
|
||||||
--
|
--
|
||||||
|
|
||||||
@ -241,7 +252,8 @@ some sort of default behavior. I reckon that we should be allowed to:
|
|||||||
What complicates the matter is that Firefox has the ability to construct
|
What complicates the matter is that Firefox has the ability to construct
|
||||||
DOMs and render invalid nestings of elements (like <b><div>asdf</div></b>).
|
DOMs and render invalid nestings of elements (like <b><div>asdf</div></b>).
|
||||||
This means that behavior for stray pcdata in ul/ol is undefined. Behavior
|
This means that behavior for stray pcdata in ul/ol is undefined. Behavior
|
||||||
with data in a table gets bubbled to the start of the table.
|
with data in a table gets bubbled to the start of the table (assuming
|
||||||
|
that we actually custom-make the table child validation class).
|
||||||
|
|
||||||
So... I say delete the node when PCDATA isn't allowed (or the regex is too
|
So... I say delete the node when PCDATA isn't allowed (or the regex is too
|
||||||
complicated to determine where PCDATA could be inserted), and translate the node
|
complicated to determine where PCDATA could be inserted), and translate the node
|
||||||
@ -267,7 +279,9 @@ to worry about it for certain things:
|
|||||||
* textarea - rows, cols
|
* textarea - rows, cols
|
||||||
|
|
||||||
As you can see, only two of them we would remotely consider for our simplified
|
As you can see, only two of them we would remotely consider for our simplified
|
||||||
tag set. But each has a different set of challenges.
|
tag set. But each has a different set of challenges. For the img tag, we'd
|
||||||
|
have to be careful about deleting it. If we do hit a snag, we can supply
|
||||||
|
a default "blank" image.
|
||||||
|
|
||||||
So after that's all said and done, each of the different types of content
|
So after that's all said and done, each of the different types of content
|
||||||
inside the attributes needs to be handled differently.
|
inside the attributes needs to be handled differently.
|
||||||
|
Loading…
Reference in New Issue
Block a user