This fix is slightly hackish, as we simply treat comments as whitespace.
This should largely be correct, and breaks no current test cases,
although it could result in noncompliant behavior.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
This commit is a limited implementation of the "active formatting
elements" algorithm implemented in HTML5, which preserves certain
formatting elements such as <a> and <b> when exiting or entering nodes.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
When precision dictates that a number be zero padded, we cannot give sprintf()
a negative precision specifier. This commit implements manual negative precision
printing of floats, taking into account common rounding errors with floating
point numbers.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
When viewing potentially hostile html, it may be helpful to see what
a given link was pointing to. This new injector takes the href
attribute and adds the text after the link, and deletes the href
attribute.
Other forms of display could easily be contrived, but this seems to be
a good basic way to present the information.
Signed-off-by: David Morton <mortonda@dgrmm.net>
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Previously, handleEnd was called for any end tag, except ones that were obviously
spurious because there were no parent tags. Now, it is only called for end tags
that are "approved." If an injector operates on the end tag, we automatically
punt. There may be some optimizations that could be made to this procedure,
but for now it's much more consistent.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Major paradigm shift in this commit is bailing ship on the "skip" integers, which
were extremely buggy and error prone, and simply mark tokens as processed or
not processed by injectors. Other notable changes:
- Removed ad hoc decrements to inputIndex in favor of $reprocess flag variable
- Moved rewind outside of processToken()
- Make rewind properly ignore all other injectors
- Cleanup end of document code
- Reconfigure injector loops to account for skips and rewinds
- Punt the empty to start/end transformation
- Completely rewrite processToken to be array based
- Added skip and rewind member variables to tokens
- Fixed a longstanding bug with remove empty!
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Previous design of injector streaming involved editability only to start, empty
and text tokens, because they could be safely modified without causing formedness
errors. By modifying notifyEnd to operate before MakeWellFormed's safeguards
kick into effect, it can be converted into a handle function, allowing for
arbitrary modification of end tags.
This change involved quite a bit of restructuring of the MakeWellFormed code,
including the moving of end of document tags to inside the loop, so rewinding
on those tags would be functional, increased reuse of the end tag codepath by
code that inserts end tags (as they could be changed out from under you), and
processToken modified to have an extra parameter to force re-processing of
a token if the original token was an end token.
We're not exactly sure if handleEnd works at this point, but the important
talking point about this refactoring is that nothing else broke. Also, a number
of convenience functions were moved from AutoParagraph to the Injector
supertype (specifically: forward, forwardToEndToken, backward, and current).
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Also changed:
- DirectLex keeps track of column numbers in context
- New class HTMLPurifier_ErrorStruct
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
AttrValidator's changes are fairly self-explanatory, but ErrorCollector's
changes are worth a little discussion. ErrorCollector can use generators at
various points during its flow control; there are two distinct generators that
it should use: 1. The one used for the output, and 2. The one used for the
error output. These will usually be the same, but in the odd case where they
need to be different, getHTMLFormatted() will accept an alterate configuration
object with an appropriate doctype.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
* Added Charsets and Character attribute types
* Fix a heavily recursive form of ContentSets, this allows a content-set
to include another content-set which includes another content-set, and
so forth.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
This is a very large commit that includes numerous improvements to the
AutoParagraph injector. These are:
* Rewritten flow control of the injector to use almost exclusively
binary conditionals.
* Improved inline documentation with "State" comments, which give concise
examples of what the token stack looks like at flow points.
* Documentation for all flow branches, even those with no actions.
* Factoring out of common operations to improve readability, especially the
new iterator private methods.
* Expanded test-suite which covers new flow points, and corrects some errors
in previous cases.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
The newest autoclose code uses the elements property in whether or not an
element should be closed by a particular tag. The heuristic is simple; if
the element doesn't allow that tag as a child, it closes the parent
container. This doesn't work, however, with <blockquote>, which while not
allowing inline styles under Strict doctypes, requires them to be passed
through MakeWellFormed.
The fix was to transition MakeWellFormed to call a method to retrieve the
elements, and then have StrictBlockquote implement a special version of
this method. Future versions of HTML Purifier may be more flexible in this
regard--further study of the HTML5 specification is required.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Previously, if an absolute munge URL location was used, HTML passed through
HTML Purifier multiple times would be munged multiple times. This patch
checks if the output URI has the same URI as the input URI; if they do,
the munge is considered unnecessary and discarded.
Requested-by: Chris <justbittin@gmail.com>
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
As part of its duties, URIDefinition determine the base URL and the host URL
of the page based on the two corresponding configuration directives. The
DisableExternal URIFilter, however, bypassed this check by directly checking
%URI.Host. This fix forwards the call through URIDefinition.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
HTMLT tests are a compact and easy-to-use way of making assertPurification
type tests. They take the format of:
--INI--
Ns.Directive = "directive value"
--HTML--
Input HTML
--EXPECT--
Expected HTML
Expect more features and migration to be coming soon.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
By default, the DirectLex and DOMLex behavior with stray angled brackets
varied a great deal due to their implementations. A little known directive
%Core.AggressivelyFixLt attempted to match DOMLex's behavior with DirectLex's,
but it was off by default. By turning it on by default, users now enjoy these
benefits, and performance-minded users can turn it back off.
Also, several refinements to stray angled bracket parsing was made. Specifically:
* DirectLex: Handle each left angled bracket individually, which prevents
strange behavior as reported by eon.
* DOMLex: Iterate aggressive lt fix, so that stacked brackets like << are
handled.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>