htmlpurifier/library/HTMLPurifier/Lexer.php

<?php

require_once 'HTMLPurifier/Token.php';

/**
 * Forgivingly lexes HTML (SGML-style) markup into tokens.
 * 
 * A lexer parses a string of SGML-style markup and converts them into
 * corresponding tokens.  It doesn't check for well-formedness, although its
 * internal mechanism may make this automatic (such as the case of
 * HTMLPurifier_Lexer_DOMLex).  There are several implementations to choose
 * from.
 * 
 * A lexer is HTML-oriented: it might work with XML, but it's not
 * recommended, as we adhere to a subset of the specification for optimization
 * reasons.
 * 
 * This class should not be directly instantiated, but you may use create() to
 * retrieve a default copy of the lexer.  Being a supertype, this class
 * does not actually define any implementation, but offers commonly used
 * convenience functions for subclasses.
 * 
 * @note The unit tests will instantiate this class for testing purposes, as
 *       many of the utility functions require a class to be instantiated.
 *       Be careful when porting this class to PHP 5.
 * 
 * @par
 * 
 * @note
 * We use tokens rather than create a DOM representation because DOM would:
 * 
 * @par
 *  -# Require more processing power to create,
 *  -# Require recursion to iterate,
 *  -# Must be compatible with PHP 5's DOM (otherwise duplication),
 *  -# Has the entire document structure (html and body not needed), and
 *  -# Has unknown readability improvement.
 * 
 * @par
 * What the last item means is that the functions for manipulating tokens are
 * already fairly compact, and when well-commented, more abstraction may not
 * be needed.
 * 
 * @see HTMLPurifier_Token
 */
class HTMLPurifier_Lexer
{
    
    /**
     * Lexes an HTML string into tokens.
     * 
     * @param $string String HTML.
     * @return HTMLPurifier_Token array representation of HTML.
     */
    function tokenizeHTML($string) {
        trigger_error('Call to abstract class', E_USER_ERROR);
    }
    
    /**
     * Retrieves or sets the default Lexer as a Prototype Factory.
     * 
     * Depending on what PHP version you are running, the abstract base
     * Lexer class will determine which concrete Lexer is best for you:
     * HTMLPurifier_Lexer_DirectLex for PHP 4, and HTMLPurifier_Lexer_DOMLex
     * for PHP 5 and beyond.
     * 
     * Passing the optional prototype lexer parameter will override the
     * default with your own implementation.  A copy/reference of the prototype
     * lexer will now be returned when you request a new lexer.
     * 
     * @note
     * Though it is possible to call this factory method from subclasses,
     * such usage is not recommended.
     * 
     * @param $prototype Optional prototype lexer.
     * @return Concrete lexer.
     */
    function create($prototype = null) {
        // we don't really care if it's a reference or a copy
        static $lexer = null;
        if ($prototype) {
            $lexer = $prototype;
        }
        if (empty($lexer)) {
            if (version_compare(PHP_VERSION, '5', '>=')) {
                require_once 'HTMLPurifier/Lexer/DOMLex.php';
                $lexer = new HTMLPurifier_Lexer_DOMLex();
            } else {
                require_once 'HTMLPurifier/Lexer/DirectLex.php';
                $lexer = new HTMLPurifier_Lexer_DirectLex();
            }
        }
        return $lexer;
    }
    
    /**
     * Decimal to parsed string conversion table for special entities.
     * @protected
     */
    var $_special_dec2str =
            array(
                    34 => '"',
                    38 => '&',
                    39 => "'",
                    60 => '<',
                    62 => '>'
            );
    
    /**
     * Stripped entity names to decimal conversion table for special entities.
     * @protected
     */
    var $_special_ent2dec =
            array(
                    'quot' => 34,
                    'amp'  => 38,
                    'lt'   => 60,
                    'gt'   => 62
            );
    
    /**
     * Most common entity to raw value conversion table for special entities.
     * @protected
     */
    var $_special_entity2str =
            array(
                    '&quot;' => '"',
                    '&amp;'  => '&',
                    '&lt;'   => '<',
                    '&gt;'   => '>',
                    '&#39;'  => "'",
                    '&#039;' => "'",
                    '&#x27;' => "'"
            );
    
    /**
     * Callback regex string for parsing entities.
     * @protected
     */                             
    var $_substituteEntitiesRegex =
'/&(?:[#]x([a-fA-F0-9]+)|[#]0*(\d+)|([A-Za-z]+));?/';
//     1. hex             2. dec      3. string
    
    /**
     * Substitutes non-special entities with their parsed equivalents. Since
     * running this whenever you have parsed character is t3h 5uck, we run
     * it before everything else.
     * 
     * @protected
     * @param $string String to have non-special entities parsed.
     * @returns Parsed string.
     */
    function substituteNonSpecialEntities($string) {
        // it will try to detect missing semicolons, but don't rely on it
        return preg_replace_callback(
            $this->_substituteEntitiesRegex,
            array($this, 'nonSpecialEntityCallback'),
            $string
            );
    }
    
    /**
     * Callback function for substituteNonSpecialEntities() that does the work.
     * 
     * @warning Though this is public in order to let the callback happen,
     *          calling it directly is not recommended.
     * @param $matches  PCRE matches array, with 0 the entire match, and
     *                  either index 1, 2 or 3 set with a hex value, dec value,
     *                  or string (respectively).
     * @returns Replacement string.
     * @todo Implement string translations
     */
    function nonSpecialEntityCallback($matches) {
        // replaces all but big five
        $entity = $matches[0];
        $is_num = (@$matches[0][1] === '#');
        if ($is_num) {
            $is_hex = (@$entity[2] === 'x');
            $int = $is_hex ? hexdec($matches[1]) : (int) $matches[2];
            if (isset($this->_special_dec2str[$int]))  return $entity;
            return chr($int);
        } else {
            if (isset($this->_special_ent2dec[$matches[3]])) return $entity;
            if (!$this->_entity_lookup) {
                require_once 'HTMLPurifier/EntityLookup.php';
                $this->_entity_lookup = HTMLPurifier_EntityLookup::instance();
            }
            if (isset($this->_entity_lookup->table[$matches[3]])) {
                return $this->_entity_lookup->table[$matches[3]];
            } else {
                return $entity;
            }
        }
    }
    
    /**
     * Contains a copy of the EntityLookup table.
     * @protected
     */
    var $_entity_lookup;
    
    /**
     * Translates CDATA sections into regular sections (through escaping).
     * 
     * @protected
     * @param $string HTML string to process.
     * @returns HTML with CDATA sections escaped.
     */
    function escapeCDATA($string) {
        return preg_replace_callback(
            '/<!\[CDATA\[(.+?)\]\]>/',
            array('HTMLPurifier_Lexer', 'CDATACallback'),
            $string
        );
    }
    
    /**
     * Callback function for escapeCDATA() that does the work.
     * 
     * @warning Though this is public in order to let the callback happen,
     *          calling it directly is not recommended.
     * @params $matches PCRE matches array, with index 0 the entire match
     *                  and 1 the inside of the CDATA section.
     * @returns Escaped internals of the CDATA section.
     */
    function CDATACallback($matches) {
        // not exactly sure why the character set is needed, but whatever
        return htmlspecialchars($matches[1], ENT_COMPAT, 'UTF-8');
    }
    
}

?>
Rename Lexer, separate files. Also augmented benchmarks and benchmarker, git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@79 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 12:53:04 +00:00			`<?php`

			`require_once 'HTMLPurifier/Token.php';`

Begin adding Doxygen documentation. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@98 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 03:43:53 +00:00			`/**`
			`* Forgivingly lexes HTML (SGML-style) markup into tokens.`
			`*`
Quality control, improve a little documentation and fix UTF-8 unfriendliness in the Generator. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@138 48356398-32a2-884e-a903-53898d9a118a 2006-08-01 00:29:38 +00:00			`* A lexer parses a string of SGML-style markup and converts them into`
Finish documenting PEARSax3, touch up the other docs. Nuke the original lexer.txt document. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@102 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 18:56:00 +00:00			`* corresponding tokens. It doesn't check for well-formedness, although its`
Begin adding Doxygen documentation. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@98 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 03:43:53 +00:00			`* internal mechanism may make this automatic (such as the case of`
			`* HTMLPurifier_Lexer_DOMLex). There are several implementations to choose`
			`* from.`
			`*`
Quality control, improve a little documentation and fix UTF-8 unfriendliness in the Generator. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@138 48356398-32a2-884e-a903-53898d9a118a 2006-08-01 00:29:38 +00:00			`* A lexer is HTML-oriented: it might work with XML, but it's not`
Begin adding Doxygen documentation. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@98 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 03:43:53 +00:00			`* recommended, as we adhere to a subset of the specification for optimization`
			`* reasons.`
			`*`
Add CDATA support to the Lexers, as well as give PEARSax3 entity replacement. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@106 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 23:04:34 +00:00			`* This class should not be directly instantiated, but you may use create() to`
Quality control, improve a little documentation and fix UTF-8 unfriendliness in the Generator. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@138 48356398-32a2-884e-a903-53898d9a118a 2006-08-01 00:29:38 +00:00			`* retrieve a default copy of the lexer. Being a supertype, this class`
			`* does not actually define any implementation, but offers commonly used`
			`* convenience functions for subclasses.`
Begin adding Doxygen documentation. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@98 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 03:43:53 +00:00			`*`
Add CDATA support to the Lexers, as well as give PEARSax3 entity replacement. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@106 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 23:04:34 +00:00			`* @note The unit tests will instantiate this class for testing purposes, as`
			`* many of the utility functions require a class to be instantiated.`
			`* Be careful when porting this class to PHP 5.`
			`*`
			`* @par`
			`*`
Begin adding Doxygen documentation. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@98 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 03:43:53 +00:00			`* @note`
			`* We use tokens rather than create a DOM representation because DOM would:`
			`*`
Add CDATA support to the Lexers, as well as give PEARSax3 entity replacement. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@106 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 23:04:34 +00:00			`* @par`
Begin adding Doxygen documentation. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@98 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 03:43:53 +00:00			`* -# Require more processing power to create,`
			`* -# Require recursion to iterate,`
			`* -# Must be compatible with PHP 5's DOM (otherwise duplication),`
			`* -# Has the entire document structure (html and body not needed), and`
			`* -# Has unknown readability improvement.`
			`*`
Add CDATA support to the Lexers, as well as give PEARSax3 entity replacement. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@106 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 23:04:34 +00:00			`* @par`
Begin adding Doxygen documentation. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@98 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 03:43:53 +00:00			`* What the last item means is that the functions for manipulating tokens are`
			`* already fairly compact, and when well-commented, more abstraction may not`
			`* be needed.`
			`*`
			`* @see HTMLPurifier_Token`
			`*/`
Rename Lexer, separate files. Also augmented benchmarks and benchmarker, git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@79 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 12:53:04 +00:00			`class HTMLPurifier_Lexer`
			`{`

Begin adding Doxygen documentation. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@98 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 03:43:53 +00:00			`/**`
			`* Lexes an HTML string into tokens.`
			`*`
			`* @param $string String HTML.`
			`* @return HTMLPurifier_Token array representation of HTML.`
			`*/`
Rename Lexer, separate files. Also augmented benchmarks and benchmarker, git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@79 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 12:53:04 +00:00			`function tokenizeHTML($string) {`
			`trigger_error('Call to abstract class', E_USER_ERROR);`
			`}`

Begin adding Doxygen documentation. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@98 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 03:43:53 +00:00			`/**`
			`* Retrieves or sets the default Lexer as a Prototype Factory.`
			`*`
			`* Depending on what PHP version you are running, the abstract base`
			`* Lexer class will determine which concrete Lexer is best for you:`
			`* HTMLPurifier_Lexer_DirectLex for PHP 4, and HTMLPurifier_Lexer_DOMLex`
			`* for PHP 5 and beyond.`
			`*`
			`* Passing the optional prototype lexer parameter will override the`
			`* default with your own implementation. A copy/reference of the prototype`
			`* lexer will now be returned when you request a new lexer.`
			`*`
			`* @note`
			`* Though it is possible to call this factory method from subclasses,`
			`* such usage is not recommended.`
			`*`
			`* @param $prototype Optional prototype lexer.`
			`* @return Concrete lexer.`
			`*/`
Rename Lexer, separate files. Also augmented benchmarks and benchmarker, git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@79 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 12:53:04 +00:00			`function create($prototype = null) {`
Begin adding Doxygen documentation. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@98 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 03:43:53 +00:00			`// we don't really care if it's a reference or a copy`
Rename Lexer, separate files. Also augmented benchmarks and benchmarker, git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@79 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 12:53:04 +00:00			`static $lexer = null;`
			`if ($prototype) {`
			`$lexer = $prototype;`
			`}`
			`if (empty($lexer)) {`
Make PHP5 lexer default DOMLex. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@84 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 15:07:53 +00:00			`if (version_compare(PHP_VERSION, '5', '>=')) {`
			`require_once 'HTMLPurifier/Lexer/DOMLex.php';`
			`$lexer = new HTMLPurifier_Lexer_DOMLex();`
			`} else {`
			`require_once 'HTMLPurifier/Lexer/DirectLex.php';`
			`$lexer = new HTMLPurifier_Lexer_DirectLex();`
			`}`
Rename Lexer, separate files. Also augmented benchmarks and benchmarker, git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@79 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 12:53:04 +00:00			`}`
			`return $lexer;`
			`}`

Add CDATA support to the Lexers, as well as give PEARSax3 entity replacement. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@106 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 23:04:34 +00:00			`/**`
			`* Decimal to parsed string conversion table for special entities.`
			`* @protected`
			`*/`
			`var $_special_dec2str =`
			`array(`
			`34 => '"',`
			`38 => '&',`
			`39 => "'",`
			`60 => '<',`
			`62 => '>'`
			`);`
Implement EntityLookup and put in the Lexer. Some behavior was migrated, since it looks like it will have to be used in all Lexers, not just DirectLex (which is the only one that uses it). git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@105 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 21:07:30 +00:00
			`/**`
Add CDATA support to the Lexers, as well as give PEARSax3 entity replacement. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@106 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 23:04:34 +00:00			`* Stripped entity names to decimal conversion table for special entities.`
Implement EntityLookup and put in the Lexer. Some behavior was migrated, since it looks like it will have to be used in all Lexers, not just DirectLex (which is the only one that uses it). git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@105 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 21:07:30 +00:00			`* @protected`
			`*/`
Add CDATA support to the Lexers, as well as give PEARSax3 entity replacement. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@106 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 23:04:34 +00:00			`var $_special_ent2dec =`
			`array(`
			`'quot' => 34,`
			`'amp' => 38,`
			`'lt' => 60,`
			`'gt' => 62`
			`);`

			`/**`
			`* Most common entity to raw value conversion table for special entities.`
			`* @protected`
			`*/`
			`var $_special_entity2str =`
			`array(`
			`'"' => '"',`
			`'&' => '&',`
			`'<' => '<',`
			`'>' => '>',`
			`''' => "'",`
			`''' => "'",`
			`''' => "'"`
			`);`

			`/**`
			`* Callback regex string for parsing entities.`
			`* @protected`
			`*/`
Implement EntityLookup and put in the Lexer. Some behavior was migrated, since it looks like it will have to be used in all Lexers, not just DirectLex (which is the only one that uses it). git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@105 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 21:07:30 +00:00			`var $_substituteEntitiesRegex =`
Add CDATA support to the Lexers, as well as give PEARSax3 entity replacement. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@106 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 23:04:34 +00:00			`'/&(?:[#]x([a-fA-F0-9]+)\|[#]0*(\d+)\|([A-Za-z]+));?/';`
			`// 1. hex 2. dec 3. string`
Implement EntityLookup and put in the Lexer. Some behavior was migrated, since it looks like it will have to be used in all Lexers, not just DirectLex (which is the only one that uses it). git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@105 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 21:07:30 +00:00
			`/**`
			`* Substitutes non-special entities with their parsed equivalents. Since`
			`* running this whenever you have parsed character is t3h 5uck, we run`
			`* it before everything else.`
			`*`
			`* @protected`
			`* @param $string String to have non-special entities parsed.`
			`* @returns Parsed string.`
			`*/`
			`function substituteNonSpecialEntities($string) {`
			`// it will try to detect missing semicolons, but don't rely on it`
			`return preg_replace_callback(`
			`$this->_substituteEntitiesRegex,`
Add CDATA support to the Lexers, as well as give PEARSax3 entity replacement. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@106 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 23:04:34 +00:00			`array($this, 'nonSpecialEntityCallback'),`
			`$string`
			`);`
Implement EntityLookup and put in the Lexer. Some behavior was migrated, since it looks like it will have to be used in all Lexers, not just DirectLex (which is the only one that uses it). git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@105 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 21:07:30 +00:00			`}`

			`/**`
			`* Callback function for substituteNonSpecialEntities() that does the work.`
			`*`
			`* @warning Though this is public in order to let the callback happen,`
			`* calling it directly is not recommended.`
Add CDATA support to the Lexers, as well as give PEARSax3 entity replacement. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@106 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 23:04:34 +00:00			`* @param $matches PCRE matches array, with 0 the entire match, and`
Implement EntityLookup and put in the Lexer. Some behavior was migrated, since it looks like it will have to be used in all Lexers, not just DirectLex (which is the only one that uses it). git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@105 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 21:07:30 +00:00			`* either index 1, 2 or 3 set with a hex value, dec value,`
			`* or string (respectively).`
			`* @returns Replacement string.`
			`* @todo Implement string translations`
			`*/`
			`function nonSpecialEntityCallback($matches) {`
			`// replaces all but big five`
			`$entity = $matches[0];`
			`$is_num = (@$matches[0][1] === '#');`
			`if ($is_num) {`
			`$is_hex = (@$entity[2] === 'x');`
			`$int = $is_hex ? hexdec($matches[1]) : (int) $matches[2];`
			`if (isset($this->_special_dec2str[$int])) return $entity;`
			`return chr($int);`
			`} else {`
			`if (isset($this->_special_ent2dec[$matches[3]])) return $entity;`
			`if (!$this->_entity_lookup) {`
			`require_once 'HTMLPurifier/EntityLookup.php';`
Add CDATA support to the Lexers, as well as give PEARSax3 entity replacement. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@106 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 23:04:34 +00:00			`$this->_entity_lookup = HTMLPurifier_EntityLookup::instance();`
Implement EntityLookup and put in the Lexer. Some behavior was migrated, since it looks like it will have to be used in all Lexers, not just DirectLex (which is the only one that uses it). git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@105 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 21:07:30 +00:00			`}`
			`if (isset($this->_entity_lookup->table[$matches[3]])) {`
			`return $this->_entity_lookup->table[$matches[3]];`
			`} else {`
			`return $entity;`
			`}`
			`}`
			`}`

Add CDATA support to the Lexers, as well as give PEARSax3 entity replacement. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@106 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 23:04:34 +00:00			`/**`
			`* Contains a copy of the EntityLookup table.`
			`* @protected`
			`*/`
Implement EntityLookup and put in the Lexer. Some behavior was migrated, since it looks like it will have to be used in all Lexers, not just DirectLex (which is the only one that uses it). git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@105 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 21:07:30 +00:00			`var $_entity_lookup;`

Add CDATA support to the Lexers, as well as give PEARSax3 entity replacement. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@106 48356398-32a2-884e-a903-53898d9a118a 2006-07-23 23:04:34 +00:00			`/**`
			`* Translates CDATA sections into regular sections (through escaping).`
			`*`
			`* @protected`
			`* @param $string HTML string to process.`
			`* @returns HTML with CDATA sections escaped.`
			`*/`
			`function escapeCDATA($string) {`
			`return preg_replace_callback(`
			`'/<!\[CDATA\[(.+?)\]\]>/',`
			`array('HTMLPurifier_Lexer', 'CDATACallback'),`
			`$string`
			`);`
			`}`

			`/**`
			`* Callback function for escapeCDATA() that does the work.`
			`*`
			`* @warning Though this is public in order to let the callback happen,`
			`* calling it directly is not recommended.`
			`* @params $matches PCRE matches array, with index 0 the entire match`
			`* and 1 the inside of the CDATA section.`
			`* @returns Escaped internals of the CDATA section.`
			`*/`
			`function CDATACallback($matches) {`
			`// not exactly sure why the character set is needed, but whatever`
			`return htmlspecialchars($matches[1], ENT_COMPAT, 'UTF-8');`
			`}`

Rename Lexer, separate files. Also augmented benchmarks and benchmarker, git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@79 48356398-32a2-884e-a903-53898d9a118a 2006-07-22 12:53:04 +00:00			`}`

			`?>`