mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2024-12-22 08:21:52 +00:00
Update docs, add lexer.txt
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@83 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
parent
d22140b9a6
commit
5bcb3c60cd
28
docs/lexer.txt
Normal file
28
docs/lexer.txt
Normal file
@ -0,0 +1,28 @@
|
||||
|
||||
Lexer
|
||||
|
||||
The lexer parses a string of SGML-style markup and converts them into
|
||||
corresponding tokens. It doesn't check for correctness, although it's
|
||||
internal mechanism may make this automatic (such as the case of DOMLex).
|
||||
|
||||
We have several implementations of the Lexer:
|
||||
|
||||
DirectLex - our in-house implementation
|
||||
DirectLex has absolutely no dependencies, making it a reasonably good
|
||||
default for PHP4. Written with efficiency in mind, it is generally
|
||||
faster than the PEAR parser, although the two are very close and usually
|
||||
overlap a bit. It will support UTF-8 completely eventually.
|
||||
|
||||
PEARSax3 - uses the PEAR package XML_HTMLSax3 to parse
|
||||
PEAR, not suprisingly, also has a SAX parser for HTML. I don't know
|
||||
very much about implementation, but it's fairly well written. You need
|
||||
to have PEAR added to your path to use it though. Not sure whether or
|
||||
not it's UTF-8 aware.
|
||||
|
||||
DOMLex - uses the PHP5 core extension DOM to parse
|
||||
In PHP 5, the DOM XML extension was revamped into DOM and added to the core.
|
||||
It gives us a forgiving HTML parser, which we use to transform the HTML
|
||||
into a DOM, and then into the tokens. It is extremely fast, and is the
|
||||
default choice for PHP 5. However, entity resolution may be troublesome,
|
||||
though it's UTF-8 is excellent.
|
||||
|
@ -1,4 +1,5 @@
|
||||
== Possible Security Issues ==
|
||||
|
||||
Security
|
||||
|
||||
Like anything that claims to afford security, HTML_Purifier can be circumvented
|
||||
through negligence of people. This class will do its job: no more, no less,
|
||||
@ -14,10 +15,11 @@ can do). Make sure any input is properly converted to UTF-8, or the parser
|
||||
will mangle it badly (though it won't be a security risk if you're outputting
|
||||
it as UTF-8).
|
||||
|
||||
2. XHTML 1.0. This is what the parser is outputting. For the most part, it's
|
||||
compatible with HTML 4.01, but XHTML enforces some very nice things that all
|
||||
web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode has
|
||||
waaaay too many quirks for a little parser to handle.
|
||||
2. XHTML 1.0 Transitional. This is what the parser is outputting. For the most
|
||||
part, it's compatible with HTML 4.01, but XHTML enforces some very nice things
|
||||
that all web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode
|
||||
has waaaay too many quirks for a little parser to handle. We did not select
|
||||
strict in order to prevent ourselves from being too draconic on users.
|
||||
|
||||
3. [PROJECTED] IDs. They need to be unique, but without some knowledge of the
|
||||
rest of the document, it's difficult to know what's unique. I project default
|
||||
|
@ -1,5 +1,5 @@
|
||||
|
||||
HTML Purifier
|
||||
HTML Purifier Specification
|
||||
by Edward Z. Yang
|
||||
|
||||
== Introduction ==
|
||||
@ -39,7 +39,7 @@ with malformed input.
|
||||
|
||||
In summary:
|
||||
|
||||
1. Parse document into an array of tag and text tokens
|
||||
1. Parse document into an array of tag and text tokens (Lexer)
|
||||
2. Remove all elements not on whitelist and transform certain other elements
|
||||
into acceptable forms (i.e. <font>)
|
||||
3. Make document well formed while helpfully taking into account certain quirks,
|
||||
@ -49,10 +49,10 @@ In summary:
|
||||
important for tables).
|
||||
5. Validate attributes according to more restrictive definitions based on the
|
||||
RFCs.
|
||||
6. Translate back into a string.
|
||||
6. Translate back into a string. (Generator)
|
||||
|
||||
HTML Purifier is best suited for documents that require a rich array of
|
||||
HTML tags. Things like blog comments are, in all likelihood, most appropriately
|
||||
HTML tags. Things like blog comments are, in all likelihood, most appropriately
|
||||
written in an extremely restrictive set of markup that doesn't require
|
||||
all this functionality (or not written in HTML at all).
|
||||
|
||||
@ -60,25 +60,23 @@ all this functionality (or not written in HTML at all).
|
||||
|
||||
== STAGE 1 - parsing ==
|
||||
|
||||
Status: A (see source, mainly internal raw)
|
||||
Status: A (see source, mainly internals and UTF-8)
|
||||
|
||||
We've got two options for this: HTMLSax or my MarkupLexer. Hopefully, we
|
||||
can make the two interfaces compatible. This means that we need a lot
|
||||
of little classes:
|
||||
The Lexer (currently we have three choices) handles parsing into Tokens.
|
||||
|
||||
* StartTag(name, attributes) is openHandler
|
||||
* EndTag(name) is closeHandler
|
||||
* EmptyTag(name, attributes) is openHandler (is in array of empties)
|
||||
* Data(text) is dataHandler
|
||||
Here are the mappings for Lexer_PEARSax3
|
||||
|
||||
* Start(name, attributes) is openHandler
|
||||
* End(name) is closeHandler
|
||||
* Empty(name, attributes) is openHandler (is in array of empties)
|
||||
* Data(parse(text)) is dataHandler
|
||||
* Comment(text) is escapeHandler (has leading -)
|
||||
* CharacterData(text) is escapeHandler (has leading [)
|
||||
* Data(text) is escapeHandler (has leading [, CDATA)
|
||||
|
||||
Ignorable/not being implemented (although we probably want to output them raw):
|
||||
* ProcessingInstructions(text) is piHandler
|
||||
* JavaOrASPInstructions(text) is jaspHandler
|
||||
|
||||
Prefixed with MF (Markup Fragment). We'll make 'em all immutable value objects.
|
||||
|
||||
|
||||
|
||||
== STAGE 2 - remove foreign elements ==
|
||||
|
Loading…
Reference in New Issue
Block a user