2006-07-22 14:57:12 +00:00
|
|
|
|
|
|
|
Lexer
|
|
|
|
|
|
|
|
The lexer parses a string of SGML-style markup and converts them into
|
2006-07-22 18:55:34 +00:00
|
|
|
corresponding tokens. It doesn't check for well-formedness, although it's
|
2006-07-22 14:57:12 +00:00
|
|
|
internal mechanism may make this automatic (such as the case of DOMLex).
|
|
|
|
|
|
|
|
We have several implementations of the Lexer:
|
|
|
|
|
2006-07-22 18:55:34 +00:00
|
|
|
DirectLex [4,5] - our in-house implementation
|
2006-07-22 14:57:12 +00:00
|
|
|
DirectLex has absolutely no dependencies, making it a reasonably good
|
2006-07-22 18:55:34 +00:00
|
|
|
default for PHP4. Written with efficiency in mind, it is up to two
|
|
|
|
times faster than the PEAR parser. It will support UTF-8 completely
|
|
|
|
eventually.
|
2006-07-22 14:57:12 +00:00
|
|
|
|
2006-07-22 18:55:34 +00:00
|
|
|
PEARSax3 [4,5] - uses the PEAR package XML_HTMLSax3 to parse
|
2006-07-22 14:57:12 +00:00
|
|
|
PEAR, not suprisingly, also has a SAX parser for HTML. I don't know
|
2006-07-22 18:55:34 +00:00
|
|
|
very much about implementation, but it's fairly well written. However, that
|
|
|
|
abstraction comes at a price: performance. You need to have it installed,
|
|
|
|
and if the API changes, it might break our adapter. Not sure whether or not
|
|
|
|
it's UTF-8 aware, but it has some entity parsing trouble.
|
2006-07-22 14:57:12 +00:00
|
|
|
|
2006-07-22 18:55:34 +00:00
|
|
|
DOMLex [5] - uses the PHP5 core extension DOM to parse
|
2006-07-22 14:57:12 +00:00
|
|
|
In PHP 5, the DOM XML extension was revamped into DOM and added to the core.
|
|
|
|
It gives us a forgiving HTML parser, which we use to transform the HTML
|
2006-07-22 18:55:34 +00:00
|
|
|
into a DOM, and then into the tokens. It is blazingly fast, and is the
|
2006-07-22 14:57:12 +00:00
|
|
|
default choice for PHP 5. However, entity resolution may be troublesome,
|
2006-07-22 18:55:34 +00:00
|
|
|
though its UTF-8 is excellent. Also, any empty elements will have empty
|
|
|
|
tokens associated with them, even if this is prohibited.
|
2006-07-22 14:57:12 +00:00
|
|
|
|
2006-07-22 18:55:34 +00:00
|
|
|
We use tokens because creating a DOM representation would:
|
|
|
|
|
|
|
|
1. Require more processing power to create,
|
|
|
|
2. Require recursion to iterate,
|
|
|
|
3. Must be compatible with PHP 5's DOM,
|
|
|
|
4. Has the entire document structure (html and body not needed), and
|
|
|
|
5. Has unknown readability improvement.
|
|
|
|
|
|
|
|
What the last item means is that the functions for manipulating tokens are
|
|
|
|
already fairly compact, and when well-commented, more abstraction may not
|
|
|
|
be needed.
|