Update docs, add lexer.txt

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@83 48356398-32a2-884e-a903-53898d9a118a
2024-12-22 08:21:52 +00:00 · 2006-07-22 14:57:12 +00:00 · 2006-07-22 14:57:12 +00:00 · 5bcb3c60cd
commit 5bcb3c60cd
parent d22140b9a6
3 changed files with 48 additions and 20 deletions
--- a/docs/lexer.txt
+++ b/docs/lexer.txt
@ -0,0 +1,28 @@
+
+Lexer
+
+The lexer parses a string of SGML-style markup and converts them into
+corresponding tokens. It doesn't check for correctness, although it's
+internal mechanism may make this automatic (such as the case of DOMLex).
+
+We have several implementations of the Lexer:
+
+DirectLex - our in-house implementation
+    DirectLex has absolutely no dependencies, making it a reasonably good
+    default for PHP4.  Written with efficiency in mind, it is generally
+    faster than the PEAR parser, although the two are very close and usually
+    overlap a bit.  It will support UTF-8 completely eventually.
+
+PEARSax3 - uses the PEAR package XML_HTMLSax3 to parse
+    PEAR, not suprisingly, also has a SAX parser for HTML.  I don't know
+    very much about implementation, but it's fairly well written.  You need
+    to have PEAR added to your path to use it though.  Not sure whether or
+    not it's UTF-8 aware.
+
+DOMLex - uses the PHP5 core extension DOM to parse
+    In PHP 5, the DOM XML extension was revamped into DOM and added to the core.
+    It gives us a forgiving HTML parser, which we use to transform the HTML
+    into a DOM, and then into the tokens.  It is extremely fast, and is the
+    default choice for PHP 5.  However, entity resolution may be troublesome,
+    though it's UTF-8 is excellent.
+
--- a/docs/security.txt
+++ b/docs/security.txt
@ -1,4 +1,5 @@
-== Possible Security Issues ==
+
+Security

 Like anything that claims to afford security, HTML_Purifier can be circumvented
 through negligence of people. This class will do its job: no more, no less,
@ -14,10 +15,11 @@ can do). Make sure any input is properly converted to UTF-8, or the parser
 will mangle it badly (though it won't be a security risk if you're outputting
 it as UTF-8).

-2. XHTML 1.0. This is what the parser is outputting. For the most part, it's
-compatible with HTML 4.01, but XHTML enforces some very nice things that all
-web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode has
-waaaay too many quirks for a little parser to handle.
+2. XHTML 1.0 Transitional. This is what the parser is outputting. For the most
+part, it's compatible with HTML 4.01, but XHTML enforces some very nice things
+that all web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode
+has waaaay too many quirks for a little parser to handle. We did not select
+strict in order to prevent ourselves from being too draconic on users.

 3. [PROJECTED] IDs. They need to be unique, but without some knowledge of the
 rest of the document, it's difficult to know what's unique. I project default
--- a/docs/spec.txt
+++ b/docs/spec.txt
@ -1,5 +1,5 @@

-HTML Purifier
+HTML Purifier Specification
  by Edward Z. Yang

 == Introduction ==
@ -39,7 +39,7 @@ with malformed input.

 In summary:

-1. Parse document into an array of tag and text tokens
+1. Parse document into an array of tag and text tokens (Lexer)
 2. Remove all elements not on whitelist and transform certain other elements
   into acceptable forms (i.e. <font>)
 3. Make document well formed while helpfully taking into account certain quirks,
@ -49,10 +49,10 @@ In summary:
   important for tables).
 5. Validate attributes according to more restrictive definitions based on the
   RFCs.
-6. Translate back into a string.
+6. Translate back into a string. (Generator)

 HTML Purifier is best suited for documents that require a rich array of
-HTML tags. Things like blog comments are, in all likelihood, most appropriately
+HTML tags.  Things like blog comments are, in all likelihood, most appropriately
 written in an extremely restrictive set of markup that doesn't require
 all this functionality (or not written in HTML at all).

@ -60,25 +60,23 @@ all this functionality (or not written in HTML at all).

 == STAGE 1 - parsing ==

-    Status: A (see source, mainly internal raw)
+    Status: A (see source, mainly internals and UTF-8)

-We've got two options for this: HTMLSax or my MarkupLexer. Hopefully, we
-can make the two interfaces compatible. This means that we need a lot
-of little classes:
+The Lexer (currently we have three choices) handles parsing into Tokens.

-* StartTag(name, attributes)    is openHandler
-* EndTag(name)                  is closeHandler
-* EmptyTag(name, attributes)    is openHandler   (is in array of empties)
-* Data(text)                    is dataHandler
+Here are the mappings for Lexer_PEARSax3
+
+* Start(name, attributes)       is openHandler
+* End(name)                     is closeHandler
+* Empty(name, attributes)       is openHandler   (is in array of empties)
+* Data(parse(text))             is dataHandler
 * Comment(text)                 is escapeHandler (has leading -)
-* CharacterData(text)           is escapeHandler (has leading [)
+* Data(text)                    is escapeHandler (has leading [, CDATA)

 Ignorable/not being implemented (although we probably want to output them raw):
 * ProcessingInstructions(text)  is piHandler
 * JavaOrASPInstructions(text)   is jaspHandler

-Prefixed with MF (Markup Fragment). We'll make 'em all immutable value objects.
-


 == STAGE 2 - remove foreign elements ==