Update docs, add lexer.txt

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@83 48356398-32a2-884e-a903-53898d9a118a
2024-12-22 16:31:53 +00:00 · 2006-07-22 14:57:12 +00:00 · 2006-07-22 14:57:12 +00:00 · 5bcb3c60cd
commit 5bcb3c60cd
parent d22140b9a6
3 changed files with 48 additions and 20 deletions
--- a/docs/lexer.txt
+++ b/docs/lexer.txt
@ -0,0 +1,28 @@
+
+Lexer
+
+The lexer parses a string of SGML-style markup and converts them into
+corresponding tokens. It doesn't check for correctness, although it's
+internal mechanism may make this automatic (such as the case of DOMLex).
+
+We have several implementations of the Lexer:
+
+DirectLex - our in-house implementation
+    DirectLex has absolutely no dependencies, making it a reasonably good
+    default for PHP4.  Written with efficiency in mind, it is generally
+    faster than the PEAR parser, although the two are very close and usually
+    overlap a bit.  It will support UTF-8 completely eventually.
+
+PEARSax3 - uses the PEAR package XML_HTMLSax3 to parse
+    PEAR, not suprisingly, also has a SAX parser for HTML.  I don't know
+    very much about implementation, but it's fairly well written.  You need
+    to have PEAR added to your path to use it though.  Not sure whether or
+    not it's UTF-8 aware.
+
+DOMLex - uses the PHP5 core extension DOM to parse
+    In PHP 5, the DOM XML extension was revamped into DOM and added to the core.
+    It gives us a forgiving HTML parser, which we use to transform the HTML
+    into a DOM, and then into the tokens.  It is extremely fast, and is the
+    default choice for PHP 5.  However, entity resolution may be troublesome,
+    though it's UTF-8 is excellent.
+
--- a/docs/security.txt
+++ b/docs/security.txt
@ -1,4 +1,5 @@
-== Possible Security Issues ==
+
+Security

 Like anything that claims to afford security, HTML_Purifier can be circumvented
 through negligence of people. This class will do its job: no more, no less,
@ -14,10 +15,11 @@ can do). Make sure any input is properly converted to UTF-8, or the parser
 will mangle it badly (though it won't be a security risk if you're outputting
 it as UTF-8).

-2. XHTML 1.0. This is what the parser is outputting. For the most part, it's
-compatible with HTML 4.01, but XHTML enforces some very nice things that all
-web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode has
-waaaay too many quirks for a little parser to handle.
+2. XHTML 1.0 Transitional. This is what the parser is outputting. For the most
+part, it's compatible with HTML 4.01, but XHTML enforces some very nice things
+that all web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode
+has waaaay too many quirks for a little parser to handle. We did not select
+strict in order to prevent ourselves from being too draconic on users.

 3. [PROJECTED] IDs. They need to be unique, but without some knowledge of the
 rest of the document, it's difficult to know what's unique. I project default
--- a/docs/spec.txt
+++ b/docs/spec.txt
@ -1,5 +1,5 @@

-HTML Purifier
+HTML Purifier Specification
  by Edward Z. Yang

 == Introduction ==
@ -39,7 +39,7 @@ with malformed input.

 In summary:

-1. Parse document into an array of tag and text tokens
+1. Parse document into an array of tag and text tokens (Lexer)
 2. Remove all elements not on whitelist and transform certain other elements
   into acceptable forms (i.e. <font>)
 3. Make document well formed while helpfully taking into account certain quirks,
@ -49,7 +49,7 @@ In summary:
   important for tables).
 5. Validate attributes according to more restrictive definitions based on the
   RFCs.
-6. Translate back into a string.
+6. Translate back into a string. (Generator)

 HTML Purifier is best suited for documents that require a rich array of
 HTML tags.  Things like blog comments are, in all likelihood, most appropriately
@ -60,25 +60,23 @@ all this functionality (or not written in HTML at all).

 == STAGE 1 - parsing ==

-    Status: A (see source, mainly internal raw)
+    Status: A (see source, mainly internals and UTF-8)

-We've got two options for this: HTMLSax or my MarkupLexer. Hopefully, we
-can make the two interfaces compatible. This means that we need a lot
-of little classes:
+The Lexer (currently we have three choices) handles parsing into Tokens.

-* StartTag(name, attributes)    is openHandler
-* EndTag(name)                  is closeHandler
-* EmptyTag(name, attributes)    is openHandler   (is in array of empties)
-* Data(text)                    is dataHandler
+Here are the mappings for Lexer_PEARSax3
+
+* Start(name, attributes)       is openHandler
+* End(name)                     is closeHandler
+* Empty(name, attributes)       is openHandler   (is in array of empties)
+* Data(parse(text))             is dataHandler
 * Comment(text)                 is escapeHandler (has leading -)
-* CharacterData(text)           is escapeHandler (has leading [)
+* Data(text)                    is escapeHandler (has leading [, CDATA)

 Ignorable/not being implemented (although we probably want to output them raw):
 * ProcessingInstructions(text)  is piHandler
 * JavaOrASPInstructions(text)   is jaspHandler

-Prefixed with MF (Markup Fragment). We'll make 'em all immutable value objects.
-


 == STAGE 2 - remove foreign elements ==