mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2024-11-14 01:08:41 +00:00
0767bbc12d
This mega-patch rips out the FixNesting implementation and the related ChildDef components. The primary algorithmic change is to convert from use of tokens to tree nodes, which are far more amenable to the style of processing that FixNesting uses. Additionally, FixNesting has been changed to go bottom-up rather than top-down, in order to avoid needing to implement backtracking. This patch simplifies a good deal of the relevant logic, since we no longer need to continually recalculate the nesting structure when processing things. However, the conversion to the alternate format incurs some overhead, so for small inputs these changes are not a win. One possibility to greatly reduce the constant factors here is to switch to entirely using libxml's representation, and never serializing tokens; this would require one to rewrite injectors, however. The iterative post-order traversal in FixNesting is a bit subtle, but we have essentially reified the stack and continuations. We've removed support for %Core.EscapeInvalidChildren. Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
55 lines
1.3 KiB
PHP
55 lines
1.3 KiB
PHP
<?php
|
|
|
|
/**
|
|
* Concrete text token class.
|
|
*
|
|
* Text tokens comprise of regular parsed character data (PCDATA) and raw
|
|
* character data (from the CDATA sections). Internally, their
|
|
* data is parsed with all entities expanded. Surprisingly, the text token
|
|
* does have a "tag name" called #PCDATA, which is how the DTD represents it
|
|
* in permissible child nodes.
|
|
*/
|
|
class HTMLPurifier_Node_Text extends HTMLPurifier_Node
|
|
{
|
|
|
|
/**
|
|
* PCDATA tag name compatible with DTD, see
|
|
* HTMLPurifier_ChildDef_Custom for details.
|
|
* @type string
|
|
*/
|
|
public $name = '#PCDATA';
|
|
|
|
/**
|
|
* @type string
|
|
*/
|
|
public $data;
|
|
/**< Parsed character data of text. */
|
|
|
|
/**
|
|
* @type bool
|
|
*/
|
|
public $is_whitespace;
|
|
|
|
/**< Bool indicating if node is whitespace. */
|
|
|
|
/**
|
|
* Constructor, accepts data and determines if it is whitespace.
|
|
* @param string $data String parsed character data.
|
|
* @param int $line
|
|
* @param int $col
|
|
*/
|
|
public function __construct($data, $is_whitespace, $line = null, $col = null)
|
|
{
|
|
$this->data = $data;
|
|
$this->is_whitespace = $is_whitespace;
|
|
$this->line = $line;
|
|
$this->col = $col;
|
|
}
|
|
|
|
public function toTokenPair() {
|
|
return array(new HTMLPurifier_Token_Text($this->data, $this->line, $this->col), null);
|
|
}
|
|
}
|
|
|
|
// vim: et sw=4 sts=4
|