0
0
mirror of https://github.com/ezyang/htmlpurifier.git synced 2024-12-22 08:21:52 +00:00

Update docs, add lexer.txt

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@83 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
Edward Z. Yang 2006-07-22 14:57:12 +00:00
parent d22140b9a6
commit 5bcb3c60cd
3 changed files with 48 additions and 20 deletions

28
docs/lexer.txt Normal file
View File

@ -0,0 +1,28 @@
Lexer
The lexer parses a string of SGML-style markup and converts them into
corresponding tokens. It doesn't check for correctness, although it's
internal mechanism may make this automatic (such as the case of DOMLex).
We have several implementations of the Lexer:
DirectLex - our in-house implementation
DirectLex has absolutely no dependencies, making it a reasonably good
default for PHP4. Written with efficiency in mind, it is generally
faster than the PEAR parser, although the two are very close and usually
overlap a bit. It will support UTF-8 completely eventually.
PEARSax3 - uses the PEAR package XML_HTMLSax3 to parse
PEAR, not suprisingly, also has a SAX parser for HTML. I don't know
very much about implementation, but it's fairly well written. You need
to have PEAR added to your path to use it though. Not sure whether or
not it's UTF-8 aware.
DOMLex - uses the PHP5 core extension DOM to parse
In PHP 5, the DOM XML extension was revamped into DOM and added to the core.
It gives us a forgiving HTML parser, which we use to transform the HTML
into a DOM, and then into the tokens. It is extremely fast, and is the
default choice for PHP 5. However, entity resolution may be troublesome,
though it's UTF-8 is excellent.

View File

@ -1,4 +1,5 @@
== Possible Security Issues ==
Security
Like anything that claims to afford security, HTML_Purifier can be circumvented
through negligence of people. This class will do its job: no more, no less,
@ -14,10 +15,11 @@ can do). Make sure any input is properly converted to UTF-8, or the parser
will mangle it badly (though it won't be a security risk if you're outputting
it as UTF-8).
2. XHTML 1.0. This is what the parser is outputting. For the most part, it's
compatible with HTML 4.01, but XHTML enforces some very nice things that all
web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode has
waaaay too many quirks for a little parser to handle.
2. XHTML 1.0 Transitional. This is what the parser is outputting. For the most
part, it's compatible with HTML 4.01, but XHTML enforces some very nice things
that all web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode
has waaaay too many quirks for a little parser to handle. We did not select
strict in order to prevent ourselves from being too draconic on users.
3. [PROJECTED] IDs. They need to be unique, but without some knowledge of the
rest of the document, it's difficult to know what's unique. I project default

View File

@ -1,5 +1,5 @@
HTML Purifier
HTML Purifier Specification
by Edward Z. Yang
== Introduction ==
@ -39,7 +39,7 @@ with malformed input.
In summary:
1. Parse document into an array of tag and text tokens
1. Parse document into an array of tag and text tokens (Lexer)
2. Remove all elements not on whitelist and transform certain other elements
into acceptable forms (i.e. <font>)
3. Make document well formed while helpfully taking into account certain quirks,
@ -49,10 +49,10 @@ In summary:
important for tables).
5. Validate attributes according to more restrictive definitions based on the
RFCs.
6. Translate back into a string.
6. Translate back into a string. (Generator)
HTML Purifier is best suited for documents that require a rich array of
HTML tags. Things like blog comments are, in all likelihood, most appropriately
HTML tags. Things like blog comments are, in all likelihood, most appropriately
written in an extremely restrictive set of markup that doesn't require
all this functionality (or not written in HTML at all).
@ -60,25 +60,23 @@ all this functionality (or not written in HTML at all).
== STAGE 1 - parsing ==
Status: A (see source, mainly internal raw)
Status: A (see source, mainly internals and UTF-8)
We've got two options for this: HTMLSax or my MarkupLexer. Hopefully, we
can make the two interfaces compatible. This means that we need a lot
of little classes:
The Lexer (currently we have three choices) handles parsing into Tokens.
* StartTag(name, attributes) is openHandler
* EndTag(name) is closeHandler
* EmptyTag(name, attributes) is openHandler (is in array of empties)
* Data(text) is dataHandler
Here are the mappings for Lexer_PEARSax3
* Start(name, attributes) is openHandler
* End(name) is closeHandler
* Empty(name, attributes) is openHandler (is in array of empties)
* Data(parse(text)) is dataHandler
* Comment(text) is escapeHandler (has leading -)
* CharacterData(text) is escapeHandler (has leading [)
* Data(text) is escapeHandler (has leading [, CDATA)
Ignorable/not being implemented (although we probably want to output them raw):
* ProcessingInstructions(text) is piHandler
* JavaOrASPInstructions(text) is jaspHandler
Prefixed with MF (Markup Fragment). We'll make 'em all immutable value objects.
== STAGE 2 - remove foreign elements ==