This mega-patch rips out the FixNesting implementation and the related
ChildDef components. The primary algorithmic change is to convert from
use of tokens to tree nodes, which are far more amenable to the style
of processing that FixNesting uses. Additionally, FixNesting has been
changed to go bottom-up rather than top-down, in order to avoid needing
to implement backtracking.
This patch simplifies a good deal of the relevant logic, since we no
longer need to continually recalculate the nesting structure when
processing things. However, the conversion to the alternate format
incurs some overhead, so for small inputs these changes are not a win.
One possibility to greatly reduce the constant factors here is to switch
to entirely using libxml's representation, and never serializing tokens;
this would require one to rewrite injectors, however.
The iterative post-order traversal in FixNesting is a bit subtle, but
we have essentially reified the stack and continuations.
We've removed support for %Core.EscapeInvalidChildren.
Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
The first bug is that we will repeatedly write out the result
of a customized raw definition to the filesystem, even when a cache
entry already exists.
The second bug is that caching these definitions doesn't actually
work (the cache entry is written but never used.) A new API
for retrieving raw definitions permits the user to take advantage
of caching.
Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
This commit is a limited implementation of the "active formatting
elements" algorithm implemented in HTML5, which preserves certain
formatting elements such as <a> and <b> when exiting or entering nodes.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Previously, handleEnd was called for any end tag, except ones that were obviously
spurious because there were no parent tags. Now, it is only called for end tags
that are "approved." If an injector operates on the end tag, we automatically
punt. There may be some optimizations that could be made to this procedure,
but for now it's much more consistent.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Major paradigm shift in this commit is bailing ship on the "skip" integers, which
were extremely buggy and error prone, and simply mark tokens as processed or
not processed by injectors. Other notable changes:
- Removed ad hoc decrements to inputIndex in favor of $reprocess flag variable
- Moved rewind outside of processToken()
- Make rewind properly ignore all other injectors
- Cleanup end of document code
- Reconfigure injector loops to account for skips and rewinds
- Punt the empty to start/end transformation
- Completely rewrite processToken to be array based
- Added skip and rewind member variables to tokens
- Fixed a longstanding bug with remove empty!
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Previous design of injector streaming involved editability only to start, empty
and text tokens, because they could be safely modified without causing formedness
errors. By modifying notifyEnd to operate before MakeWellFormed's safeguards
kick into effect, it can be converted into a handle function, allowing for
arbitrary modification of end tags.
This change involved quite a bit of restructuring of the MakeWellFormed code,
including the moving of end of document tags to inside the loop, so rewinding
on those tags would be functional, increased reuse of the end tag codepath by
code that inserts end tags (as they could be changed out from under you), and
processToken modified to have an extra parameter to force re-processing of
a token if the original token was an end token.
We're not exactly sure if handleEnd works at this point, but the important
talking point about this refactoring is that nothing else broke. Also, a number
of convenience functions were moved from AutoParagraph to the Injector
supertype (specifically: forward, forwardToEndToken, backward, and current).
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
This is a very large commit that includes numerous improvements to the
AutoParagraph injector. These are:
* Rewritten flow control of the injector to use almost exclusively
binary conditionals.
* Improved inline documentation with "State" comments, which give concise
examples of what the token stack looks like at flow points.
* Documentation for all flow branches, even those with no actions.
* Factoring out of common operations to improve readability, especially the
new iterator private methods.
* Expanded test-suite which covers new flow points, and corrects some errors
in previous cases.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
The newest autoclose code uses the elements property in whether or not an
element should be closed by a particular tag. The heuristic is simple; if
the element doesn't allow that tag as a child, it closes the parent
container. This doesn't work, however, with <blockquote>, which while not
allowing inline styles under Strict doctypes, requires them to be passed
through MakeWellFormed.
The fix was to transition MakeWellFormed to call a method to retrieve the
elements, and then have StrictBlockquote implement a special version of
this method. Future versions of HTML Purifier may be more flexible in this
regard--further study of the HTML5 specification is required.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
The MakeWellFormed strategy uses metadata from HTMLDefinition in order to
determine whether or not tokens need to be converted or tags need to be
auto-closed. While this functionality is good to have, it is by no means
essential, and MakeWellFormed should not error when this information is not
available.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Injector rewind: Injectors can now use the method rewind() in order to move
the input index backwards, so that they can reprocess tokens (other injectors
are not affected by a rewind). This functionality was necessary to implement
nested node removals in %AutoFormat.RemoveEmpty.
End to start ref: To facilitate rewinding, HTMLPurifier_Token_End now
maintains a reference called $start to the starting token for their node.
%AutoFormat.RemoveEmpty removes empty nodes. Lots of people have requested
it, so here is a partially effective implementation. Because it is implemented
as an Injector, it's not possible for it to handle newly introduced empty
nodes by later validators, specifically auto-closing and child validation.
The Injector is only meant to be used on HTML-ish languages.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Some implementation notes: not all comments are valid; HTML makes sure
double-hyphens and trailing hyphens are not found in comments. In addition,
two new localizable messages were added.
Requested-by: Waldo Jaquith <waldo@vqronline.org>
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
- Change API for HTMLPurifier_AttrDef_CSS_Length
- Implement HTMLPurifier_AttrDef_Switch class
- Implement HTMLPurifier_Length->compareTo, and make make() accept object instances
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1754 48356398-32a2-884e-a903-53898d9a118a
- Removed HTMLPurifier_Error
- Documentation updates
- Removed more copy() methods in favor of clone
- HTMLPurifier::getInstance() to HTMLPurifier::instance()
- Fix InterchangeBuilder to use HTMLPURIFIER_PREFIX
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1689 48356398-32a2-884e-a903-53898d9a118a
- Resync standalone with includes
- Fix error queue code
- Fix heisenbug with flush and certain definition error messages
- Fix premature $end in ini files
- Make $php config variable actually do something
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1659 48356398-32a2-884e-a903-53898d9a118a
- Work around unnecessary DOMElement type-cast in PH5P that caused errors in PHP 5.1
- Work around PHP 4 SimpleTest lack-of-error complaining for one-time-only HTMLDefinition errors, this may indicate problems with error-collecting facilities in PHP 5
- Make ErrorCollectorEMock work in both PHP 4 and PHP 5
. tests/multitest.php allows you to test multiple versions by running tests/index.php through multiple interpreters using `phpv` shell script (you must provide this script!)
. Minor cosmetic change to flush-definition-cache.php: trailing newline is outputted
. Maintenance script for generating PH5P patch added, original PH5P source file also added under version control
. Full unit test runner script title made more descriptive with PHP version
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1440 48356398-32a2-884e-a903-53898d9a118a
- HTMLDefinition->addElement now returns a reference to the created element object, as implied by the documentation
. Extend Injector hooks to allow for more powerful injector routines
. HTMLDefinition->addBlankElement created, as according to the HTMLModule method
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1425 48356398-32a2-884e-a903-53898d9a118a
- Buggy treatment of end tags of elements that have required attributes fixed (does not manifest on default tag-set)
- Spurious internal content reorganization error suppressed
. Error unit tests can now specify the expectation of no errors. Future iterations of the harness will be extremely strict about what errors are allowed
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1424 48356398-32a2-884e-a903-53898d9a118a
- Support executing a single unit tests using __only prefix
- Hook in Email classes to main code, even if they're unused
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1373 48356398-32a2-884e-a903-53898d9a118a