htmlpurifier/docs/lexer.txt


Lexer

The lexer parses a string of SGML-style markup and converts them into
corresponding tokens. It doesn't check for well-formedness, although it's
internal mechanism may make this automatic (such as the case of DOMLex).

We have several implementations of the Lexer:

DirectLex [4,5] - our in-house implementation
    DirectLex has absolutely no dependencies, making it a reasonably good
    default for PHP4.  Written with efficiency in mind, it is up to two
    times faster than the PEAR parser.  It will support UTF-8 completely
    eventually.

PEARSax3 [4,5] - uses the PEAR package XML_HTMLSax3 to parse
    PEAR, not suprisingly, also has a SAX parser for HTML.  I don't know
    very much about implementation, but it's fairly well written.  However, that
    abstraction comes at a price: performance. You need to have it installed,
    and if the API changes, it might break our adapter. Not sure whether or not
    it's UTF-8 aware, but it has some entity parsing trouble.

DOMLex [5] - uses the PHP5 core extension DOM to parse
    In PHP 5, the DOM XML extension was revamped into DOM and added to the core.
    It gives us a forgiving HTML parser, which we use to transform the HTML
    into a DOM, and then into the tokens.  It is blazingly fast, and is the
    default choice for PHP 5.  However, entity resolution may be troublesome,
    though its UTF-8 is excellent.  Also, any empty elements will have empty
    tokens associated with them, even if this is prohibited.

We use tokens because creating a DOM representation would:

1. Require more processing power to create,
2. Require recursion to iterate,
3. Must be compatible with PHP 5's DOM,
4. Has the entire document structure (html and body not needed), and
5. Has unknown readability improvement.

What the last item means is that the functions for manipulating tokens are
already fairly compact, and when well-commented, more abstraction may not
be needed.
Update docs, add lexer.txt git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@83 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 14:57:12 +00:00
			`Lexer`

			`The lexer parses a string of SGML-style markup and converts them into`
Optimize next*() functions in DirectLex, add test for offset. Update Lexer documents. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@90 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 18:55:34 +00:00			`corresponding tokens. It doesn't check for well-formedness, although it's`
Update docs, add lexer.txt git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@83 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 14:57:12 +00:00			`internal mechanism may make this automatic (such as the case of DOMLex).`

			`We have several implementations of the Lexer:`

Optimize next*() functions in DirectLex, add test for offset. Update Lexer documents. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@90 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 18:55:34 +00:00			`DirectLex [4,5] - our in-house implementation`
Update docs, add lexer.txt git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@83 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 14:57:12 +00:00			`DirectLex has absolutely no dependencies, making it a reasonably good`
Optimize next*() functions in DirectLex, add test for offset. Update Lexer documents. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@90 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 18:55:34 +00:00			`default for PHP4. Written with efficiency in mind, it is up to two`
			`times faster than the PEAR parser. It will support UTF-8 completely`
			`eventually.`
Update docs, add lexer.txt git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@83 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 14:57:12 +00:00
Optimize next*() functions in DirectLex, add test for offset. Update Lexer documents. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@90 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 18:55:34 +00:00			`PEARSax3 [4,5] - uses the PEAR package XML_HTMLSax3 to parse`
Update docs, add lexer.txt git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@83 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 14:57:12 +00:00			`PEAR, not suprisingly, also has a SAX parser for HTML. I don't know`
Optimize next*() functions in DirectLex, add test for offset. Update Lexer documents. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@90 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 18:55:34 +00:00			`very much about implementation, but it's fairly well written. However, that`
			`abstraction comes at a price: performance. You need to have it installed,`
			`and if the API changes, it might break our adapter. Not sure whether or not`
			`it's UTF-8 aware, but it has some entity parsing trouble.`
Update docs, add lexer.txt git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@83 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 14:57:12 +00:00
Optimize next*() functions in DirectLex, add test for offset. Update Lexer documents. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@90 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 18:55:34 +00:00			`DOMLex [5] - uses the PHP5 core extension DOM to parse`
Update docs, add lexer.txt git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@83 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 14:57:12 +00:00			`In PHP 5, the DOM XML extension was revamped into DOM and added to the core.`
			`It gives us a forgiving HTML parser, which we use to transform the HTML`
Optimize next*() functions in DirectLex, add test for offset. Update Lexer documents. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@90 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 18:55:34 +00:00			`into a DOM, and then into the tokens. It is blazingly fast, and is the`
Update docs, add lexer.txt git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@83 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 14:57:12 +00:00			`default choice for PHP 5. However, entity resolution may be troublesome,`
Optimize next*() functions in DirectLex, add test for offset. Update Lexer documents. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@90 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 18:55:34 +00:00			`though its UTF-8 is excellent. Also, any empty elements will have empty`
			`tokens associated with them, even if this is prohibited.`
Update docs, add lexer.txt git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@83 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 14:57:12 +00:00
Optimize next*() functions in DirectLex, add test for offset. Update Lexer documents. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@90 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 18:55:34 +00:00			`We use tokens because creating a DOM representation would:`

			`1. Require more processing power to create,`
			`2. Require recursion to iterate,`
			`3. Must be compatible with PHP 5's DOM,`
			`4. Has the entire document structure (html and body not needed), and`
			`5. Has unknown readability improvement.`

			`What the last item means is that the functions for manipulating tokens are`
			`already fairly compact, and when well-commented, more abstraction may not`
			`be needed.`