mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2024-12-23 00:41:52 +00:00
Update docs, add lexer.txt
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@83 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
parent
d22140b9a6
commit
5bcb3c60cd
28
docs/lexer.txt
Normal file
28
docs/lexer.txt
Normal file
@ -0,0 +1,28 @@
|
|||||||
|
|
||||||
|
Lexer
|
||||||
|
|
||||||
|
The lexer parses a string of SGML-style markup and converts them into
|
||||||
|
corresponding tokens. It doesn't check for correctness, although it's
|
||||||
|
internal mechanism may make this automatic (such as the case of DOMLex).
|
||||||
|
|
||||||
|
We have several implementations of the Lexer:
|
||||||
|
|
||||||
|
DirectLex - our in-house implementation
|
||||||
|
DirectLex has absolutely no dependencies, making it a reasonably good
|
||||||
|
default for PHP4. Written with efficiency in mind, it is generally
|
||||||
|
faster than the PEAR parser, although the two are very close and usually
|
||||||
|
overlap a bit. It will support UTF-8 completely eventually.
|
||||||
|
|
||||||
|
PEARSax3 - uses the PEAR package XML_HTMLSax3 to parse
|
||||||
|
PEAR, not suprisingly, also has a SAX parser for HTML. I don't know
|
||||||
|
very much about implementation, but it's fairly well written. You need
|
||||||
|
to have PEAR added to your path to use it though. Not sure whether or
|
||||||
|
not it's UTF-8 aware.
|
||||||
|
|
||||||
|
DOMLex - uses the PHP5 core extension DOM to parse
|
||||||
|
In PHP 5, the DOM XML extension was revamped into DOM and added to the core.
|
||||||
|
It gives us a forgiving HTML parser, which we use to transform the HTML
|
||||||
|
into a DOM, and then into the tokens. It is extremely fast, and is the
|
||||||
|
default choice for PHP 5. However, entity resolution may be troublesome,
|
||||||
|
though it's UTF-8 is excellent.
|
||||||
|
|
@ -1,4 +1,5 @@
|
|||||||
== Possible Security Issues ==
|
|
||||||
|
Security
|
||||||
|
|
||||||
Like anything that claims to afford security, HTML_Purifier can be circumvented
|
Like anything that claims to afford security, HTML_Purifier can be circumvented
|
||||||
through negligence of people. This class will do its job: no more, no less,
|
through negligence of people. This class will do its job: no more, no less,
|
||||||
@ -14,10 +15,11 @@ can do). Make sure any input is properly converted to UTF-8, or the parser
|
|||||||
will mangle it badly (though it won't be a security risk if you're outputting
|
will mangle it badly (though it won't be a security risk if you're outputting
|
||||||
it as UTF-8).
|
it as UTF-8).
|
||||||
|
|
||||||
2. XHTML 1.0. This is what the parser is outputting. For the most part, it's
|
2. XHTML 1.0 Transitional. This is what the parser is outputting. For the most
|
||||||
compatible with HTML 4.01, but XHTML enforces some very nice things that all
|
part, it's compatible with HTML 4.01, but XHTML enforces some very nice things
|
||||||
web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode has
|
that all web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode
|
||||||
waaaay too many quirks for a little parser to handle.
|
has waaaay too many quirks for a little parser to handle. We did not select
|
||||||
|
strict in order to prevent ourselves from being too draconic on users.
|
||||||
|
|
||||||
3. [PROJECTED] IDs. They need to be unique, but without some knowledge of the
|
3. [PROJECTED] IDs. They need to be unique, but without some knowledge of the
|
||||||
rest of the document, it's difficult to know what's unique. I project default
|
rest of the document, it's difficult to know what's unique. I project default
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
|
|
||||||
HTML Purifier
|
HTML Purifier Specification
|
||||||
by Edward Z. Yang
|
by Edward Z. Yang
|
||||||
|
|
||||||
== Introduction ==
|
== Introduction ==
|
||||||
@ -39,7 +39,7 @@ with malformed input.
|
|||||||
|
|
||||||
In summary:
|
In summary:
|
||||||
|
|
||||||
1. Parse document into an array of tag and text tokens
|
1. Parse document into an array of tag and text tokens (Lexer)
|
||||||
2. Remove all elements not on whitelist and transform certain other elements
|
2. Remove all elements not on whitelist and transform certain other elements
|
||||||
into acceptable forms (i.e. <font>)
|
into acceptable forms (i.e. <font>)
|
||||||
3. Make document well formed while helpfully taking into account certain quirks,
|
3. Make document well formed while helpfully taking into account certain quirks,
|
||||||
@ -49,7 +49,7 @@ In summary:
|
|||||||
important for tables).
|
important for tables).
|
||||||
5. Validate attributes according to more restrictive definitions based on the
|
5. Validate attributes according to more restrictive definitions based on the
|
||||||
RFCs.
|
RFCs.
|
||||||
6. Translate back into a string.
|
6. Translate back into a string. (Generator)
|
||||||
|
|
||||||
HTML Purifier is best suited for documents that require a rich array of
|
HTML Purifier is best suited for documents that require a rich array of
|
||||||
HTML tags. Things like blog comments are, in all likelihood, most appropriately
|
HTML tags. Things like blog comments are, in all likelihood, most appropriately
|
||||||
@ -60,25 +60,23 @@ all this functionality (or not written in HTML at all).
|
|||||||
|
|
||||||
== STAGE 1 - parsing ==
|
== STAGE 1 - parsing ==
|
||||||
|
|
||||||
Status: A (see source, mainly internal raw)
|
Status: A (see source, mainly internals and UTF-8)
|
||||||
|
|
||||||
We've got two options for this: HTMLSax or my MarkupLexer. Hopefully, we
|
The Lexer (currently we have three choices) handles parsing into Tokens.
|
||||||
can make the two interfaces compatible. This means that we need a lot
|
|
||||||
of little classes:
|
|
||||||
|
|
||||||
* StartTag(name, attributes) is openHandler
|
Here are the mappings for Lexer_PEARSax3
|
||||||
* EndTag(name) is closeHandler
|
|
||||||
* EmptyTag(name, attributes) is openHandler (is in array of empties)
|
* Start(name, attributes) is openHandler
|
||||||
* Data(text) is dataHandler
|
* End(name) is closeHandler
|
||||||
|
* Empty(name, attributes) is openHandler (is in array of empties)
|
||||||
|
* Data(parse(text)) is dataHandler
|
||||||
* Comment(text) is escapeHandler (has leading -)
|
* Comment(text) is escapeHandler (has leading -)
|
||||||
* CharacterData(text) is escapeHandler (has leading [)
|
* Data(text) is escapeHandler (has leading [, CDATA)
|
||||||
|
|
||||||
Ignorable/not being implemented (although we probably want to output them raw):
|
Ignorable/not being implemented (although we probably want to output them raw):
|
||||||
* ProcessingInstructions(text) is piHandler
|
* ProcessingInstructions(text) is piHandler
|
||||||
* JavaOrASPInstructions(text) is jaspHandler
|
* JavaOrASPInstructions(text) is jaspHandler
|
||||||
|
|
||||||
Prefixed with MF (Markup Fragment). We'll make 'em all immutable value objects.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
== STAGE 2 - remove foreign elements ==
|
== STAGE 2 - remove foreign elements ==
|
||||||
|
Loading…
Reference in New Issue
Block a user