Update docs, add lexer.txt

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@83 48356398-32a2-884e-a903-53898d9a118a
2024-12-23 00:41:52 +00:00 · 2006-07-22 14:57:12 +00:00 · 2006-07-22 14:57:12 +00:00 · 5bcb3c60cd
commit 5bcb3c60cd
parent d22140b9a6
3 changed files with 48 additions and 20 deletions
--- a/docs/lexer.txt
+++ b/docs/lexer.txt
@ -0,0 +1,28 @@
 Lexer
 The lexer parses a string of SGML-style markup and converts them into
 corresponding tokens. It doesn't check for correctness, although it's
 internal mechanism may make this automatic (such as the case of DOMLex).
 We have several implementations of the Lexer:
 DirectLex - our in-house implementation
    DirectLex has absolutely no dependencies, making it a reasonably good
    default for PHP4.  Written with efficiency in mind, it is generally
    faster than the PEAR parser, although the two are very close and usually
    overlap a bit.  It will support UTF-8 completely eventually.
 PEARSax3 - uses the PEAR package XML_HTMLSax3 to parse
    PEAR, not suprisingly, also has a SAX parser for HTML.  I don't know
    very much about implementation, but it's fairly well written.  You need
    to have PEAR added to your path to use it though.  Not sure whether or
    not it's UTF-8 aware.
 DOMLex - uses the PHP5 core extension DOM to parse
    In PHP 5, the DOM XML extension was revamped into DOM and added to the core.
    It gives us a forgiving HTML parser, which we use to transform the HTML
    into a DOM, and then into the tokens.  It is extremely fast, and is the
    default choice for PHP 5.  However, entity resolution may be troublesome,
    though it's UTF-8 is excellent.
--- a/docs/security.txt
+++ b/docs/security.txt
@ -1,4 +1,5 @@
-== Possible Security Issues ==
+
 Security
 Like anything that claims to afford security, HTML_Purifier can be circumvented
 through negligence of people. This class will do its job: no more, no less,
@ -14,10 +15,11 @@ can do). Make sure any input is properly converted to UTF-8, or the parser
 will mangle it badly (though it won't be a security risk if you're outputting
 it as UTF-8).
-2. XHTML 1.0. This is what the parser is outputting. For the most part, it's
+2. XHTML 1.0 Transitional. This is what the parser is outputting. For the most
-compatible with HTML 4.01, but XHTML enforces some very nice things that all
+part, it's compatible with HTML 4.01, but XHTML enforces some very nice things
-web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode has
+that all web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode
-waaaay too many quirks for a little parser to handle.
+has waaaay too many quirks for a little parser to handle. We did not select
 strict in order to prevent ourselves from being too draconic on users.
 3. [PROJECTED] IDs. They need to be unique, but without some knowledge of the
 rest of the document, it's difficult to know what's unique. I project default
--- a/docs/spec.txt
+++ b/docs/spec.txt
@ -1,5 +1,5 @@
-HTML Purifier
+HTML Purifier Specification
  by Edward Z. Yang
 == Introduction ==
@ -39,7 +39,7 @@ with malformed input.
 In summary:
-1. Parse document into an array of tag and text tokens
+1. Parse document into an array of tag and text tokens (Lexer)
 2. Remove all elements not on whitelist and transform certain other elements
   into acceptable forms (i.e. <font>)
 3. Make document well formed while helpfully taking into account certain quirks,
@ -49,7 +49,7 @@ In summary:
   important for tables).
 5. Validate attributes according to more restrictive definitions based on the
   RFCs.
-6. Translate back into a string.
+6. Translate back into a string. (Generator)
 HTML Purifier is best suited for documents that require a rich array of
 HTML tags.  Things like blog comments are, in all likelihood, most appropriately
@ -60,25 +60,23 @@ all this functionality (or not written in HTML at all).
 == STAGE 1 - parsing ==
-    Status: A (see source, mainly internal raw)
+    Status: A (see source, mainly internals and UTF-8)
-We've got two options for this: HTMLSax or my MarkupLexer. Hopefully, we
+The Lexer (currently we have three choices) handles parsing into Tokens.
 can make the two interfaces compatible. This means that we need a lot
 of little classes:
-* StartTag(name, attributes)    is openHandler
+Here are the mappings for Lexer_PEARSax3
-* EndTag(name)                  is closeHandler
+
-* EmptyTag(name, attributes)    is openHandler   (is in array of empties)
+* Start(name, attributes)       is openHandler
-* Data(text)                    is dataHandler
+* End(name)                     is closeHandler
 * Empty(name, attributes)       is openHandler   (is in array of empties)
 * Data(parse(text))             is dataHandler
 * Comment(text)                 is escapeHandler (has leading -)
-* CharacterData(text)           is escapeHandler (has leading [)
+* Data(text)                    is escapeHandler (has leading [, CDATA)
 Ignorable/not being implemented (although we probably want to output them raw):
 * ProcessingInstructions(text)  is piHandler
 * JavaOrASPInstructions(text)   is jaspHandler
 Prefixed with MF (Markup Fragment). We'll make 'em all immutable value objects.
 == STAGE 2 - remove foreign elements ==