Basically, browsers don't parse what should be valid URIs correctly, so
we have to go through some backbends to accomodate them. Specifically,
for browseable URIs, the following URIs have unintended behavior:
- ///example.com
- http:/example.com
- http:///example.com
Furthermore, if the path begins with //, modifying these URLs must
be done with care, as if you remove the host-name component, the
parse tree changes.
I've modified the engine to follow correct URI semantics as much
as possible while outputting browser compatible code, and invalidate
the URI in cases where we can't deal. There has been a refactoring
of URIScheme so that this important check is always performed,
introducing a new member variable allow_empty_host which is true
on data, file, mailto and news schemes.
This also fixes bypass bugs on URI.Munge.
Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
There are some deep DOMs you can hit the maximum nesting level
limit in tokenizeDOM (we've experienced this even with maximum nesting
level of 300). Here is an iterative version of the same function with
simple queue/dequeue approach.
Signed-off-by: Maxim Krizhanovsky <darhazer@gmail.com>
The first bug is that we will repeatedly write out the result
of a customized raw definition to the filesystem, even when a cache
entry already exists.
The second bug is that caching these definitions doesn't actually
work (the cache entry is written but never used.) A new API
for retrieving raw definitions permits the user to take advantage
of caching.
Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
HTML Purifier loads itself as the first autoload function by
unregistering all existing functions and re-registering them after
registering itself.
Originally an exception was thrown when a non-static object method was
encountered as the behaviour of spl_autoload_functions() did not return
the object instance, but only the class name. This was filed on PHP
bugs (#44144).
The bug was fixed for PHP >= 5.2.11 and >= 5.3
Signed-off-by: Nick Pope <nick@nickpope.me.uk>
Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
The new logic is as follows:
* Given a URL to insert into url(), check that it is properly URL
encoded (in particular, a doublequote and backslash never occurs
within it) and then place it as url("http://example.com").
* Given a font name, if it is strictly alphanumeric, it is safe to omit
quotes. Otherwise, wrap in double quotes and replace '"' with '\22 '
(note trailing space) and '\' with '\5C ' (ditto).
We introduce expandCSSEscape() which is a hack for common parsing
idioms in CSS; this means that CSS escapes are now recognized inside
URLs as well as unquoted font names.
Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
When specifying source material for <object> tags, you must use
data inside the object tag as well as specify movie in a param.
If you specify a src (which is the appropriate markup for <embed>)
we now convert and fill in the other attributes appropriately.
Also, fix a PHP warning in Generator code.
Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
YouTube slideshows contain a /cp/, not a /v/, in their URL;
relax the YouTube filter to allow them.
Signed-off-by: Nigel McNie <nigel@catalyst.net.nz>
Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
Previously, my development environment was not running the PEARSax3
tests because my environment was set to E_STRICT error handling, and
thus the tests were skipped. Relax this requirement by making the
wrapper class E_STRICT safe. This introduces a few failing tests.
Also update TODO and add another fresh test.
Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
Previously, if two </body> tags were present, HTML Purifier
would truncate everything after the first </body>. This is
not ideal behavior; so HTML Purifier has been changed to
match up to the last </body>.
Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
%URI.Munge incorrectly munged URIs that pointed to the
same host as the current website (it did, however, have
the correct behavior for when the munge URL was on the
same server).
Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
This fix is slightly hackish, as we simply treat comments as whitespace.
This should largely be correct, and breaks no current test cases,
although it could result in noncompliant behavior.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
This commit is a limited implementation of the "active formatting
elements" algorithm implemented in HTML5, which preserves certain
formatting elements such as <a> and <b> when exiting or entering nodes.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
When precision dictates that a number be zero padded, we cannot give sprintf()
a negative precision specifier. This commit implements manual negative precision
printing of floats, taking into account common rounding errors with floating
point numbers.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
When viewing potentially hostile html, it may be helpful to see what
a given link was pointing to. This new injector takes the href
attribute and adds the text after the link, and deletes the href
attribute.
Other forms of display could easily be contrived, but this seems to be
a good basic way to present the information.
Signed-off-by: David Morton <mortonda@dgrmm.net>
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Previously, handleEnd was called for any end tag, except ones that were obviously
spurious because there were no parent tags. Now, it is only called for end tags
that are "approved." If an injector operates on the end tag, we automatically
punt. There may be some optimizations that could be made to this procedure,
but for now it's much more consistent.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Major paradigm shift in this commit is bailing ship on the "skip" integers, which
were extremely buggy and error prone, and simply mark tokens as processed or
not processed by injectors. Other notable changes:
- Removed ad hoc decrements to inputIndex in favor of $reprocess flag variable
- Moved rewind outside of processToken()
- Make rewind properly ignore all other injectors
- Cleanup end of document code
- Reconfigure injector loops to account for skips and rewinds
- Punt the empty to start/end transformation
- Completely rewrite processToken to be array based
- Added skip and rewind member variables to tokens
- Fixed a longstanding bug with remove empty!
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Previous design of injector streaming involved editability only to start, empty
and text tokens, because they could be safely modified without causing formedness
errors. By modifying notifyEnd to operate before MakeWellFormed's safeguards
kick into effect, it can be converted into a handle function, allowing for
arbitrary modification of end tags.
This change involved quite a bit of restructuring of the MakeWellFormed code,
including the moving of end of document tags to inside the loop, so rewinding
on those tags would be functional, increased reuse of the end tag codepath by
code that inserts end tags (as they could be changed out from under you), and
processToken modified to have an extra parameter to force re-processing of
a token if the original token was an end token.
We're not exactly sure if handleEnd works at this point, but the important
talking point about this refactoring is that nothing else broke. Also, a number
of convenience functions were moved from AutoParagraph to the Injector
supertype (specifically: forward, forwardToEndToken, backward, and current).
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Also changed:
- DirectLex keeps track of column numbers in context
- New class HTMLPurifier_ErrorStruct
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
AttrValidator's changes are fairly self-explanatory, but ErrorCollector's
changes are worth a little discussion. ErrorCollector can use generators at
various points during its flow control; there are two distinct generators that
it should use: 1. The one used for the output, and 2. The one used for the
error output. These will usually be the same, but in the odd case where they
need to be different, getHTMLFormatted() will accept an alterate configuration
object with an appropriate doctype.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
* Added Charsets and Character attribute types
* Fix a heavily recursive form of ContentSets, this allows a content-set
to include another content-set which includes another content-set, and
so forth.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
This is a very large commit that includes numerous improvements to the
AutoParagraph injector. These are:
* Rewritten flow control of the injector to use almost exclusively
binary conditionals.
* Improved inline documentation with "State" comments, which give concise
examples of what the token stack looks like at flow points.
* Documentation for all flow branches, even those with no actions.
* Factoring out of common operations to improve readability, especially the
new iterator private methods.
* Expanded test-suite which covers new flow points, and corrects some errors
in previous cases.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
The newest autoclose code uses the elements property in whether or not an
element should be closed by a particular tag. The heuristic is simple; if
the element doesn't allow that tag as a child, it closes the parent
container. This doesn't work, however, with <blockquote>, which while not
allowing inline styles under Strict doctypes, requires them to be passed
through MakeWellFormed.
The fix was to transition MakeWellFormed to call a method to retrieve the
elements, and then have StrictBlockquote implement a special version of
this method. Future versions of HTML Purifier may be more flexible in this
regard--further study of the HTML5 specification is required.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Previously, if an absolute munge URL location was used, HTML passed through
HTML Purifier multiple times would be munged multiple times. This patch
checks if the output URI has the same URI as the input URI; if they do,
the munge is considered unnecessary and discarded.
Requested-by: Chris <justbittin@gmail.com>
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
As part of its duties, URIDefinition determine the base URL and the host URL
of the page based on the two corresponding configuration directives. The
DisableExternal URIFilter, however, bypassed this check by directly checking
%URI.Host. This fix forwards the call through URIDefinition.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
By default, the DirectLex and DOMLex behavior with stray angled brackets
varied a great deal due to their implementations. A little known directive
%Core.AggressivelyFixLt attempted to match DOMLex's behavior with DirectLex's,
but it was off by default. By turning it on by default, users now enjoy these
benefits, and performance-minded users can turn it back off.
Also, several refinements to stray angled bracket parsing was made. Specifically:
* DirectLex: Handle each left angled bracket individually, which prevents
strange behavior as reported by eon.
* DOMLex: Iterate aggressive lt fix, so that stacked brackets like << are
handled.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Previously, attempting to set %Core.Encoding to an encoding iconv didn't
know about would result in a silent failure, with the return of the
boolean false. Now it will fatally error out.
Reported-by: mcgrailm <mgm19@psu.edu>
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
The bugs are:
* Undefined $is_folder variable when path is empty, and
* Improper concatenation of host and path together.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
The MakeWellFormed strategy uses metadata from HTMLDefinition in order to
determine whether or not tokens need to be converted or tags need to be
auto-closed. While this functionality is good to have, it is by no means
essential, and MakeWellFormed should not error when this information is not
available.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Injector rewind: Injectors can now use the method rewind() in order to move
the input index backwards, so that they can reprocess tokens (other injectors
are not affected by a rewind). This functionality was necessary to implement
nested node removals in %AutoFormat.RemoveEmpty.
End to start ref: To facilitate rewinding, HTMLPurifier_Token_End now
maintains a reference called $start to the starting token for their node.
%AutoFormat.RemoveEmpty removes empty nodes. Lots of people have requested
it, so here is a partially effective implementation. Because it is implemented
as an Injector, it's not possible for it to handle newly introduced empty
nodes by later validators, specifically auto-closing and child validation.
The Injector is only meant to be used on HTML-ish languages.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Prior to this commit, the name attribute was unilaterally removed, except
for Strict doctypes or a heavy TidyLevel, when it was converted to an id
attribute. As name is actually permitted in both HTML 4.01 Strict and
XHTML 1.0 Strict, although deprecated, the more sensible default behavior
is to allow it unless TidyLevel is heavy.
Our implementation is slightly stricter than the specs, as name attributes are
treated as first class IDs, disallowing <a name="foo" id="foo"> or duplicate
names. The former should be treated as a special case, but that will be
a separate commit.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Previously, MakeWellFormed processed tokens and appended them onto an output
array, which was presumably immutable and inaccessible to Injectors. By
having MakeWellFormed operate directly on the input array, the strategy
saves memory and will also allow for a rewind implementation, as a unifying
the two arrays allows Injectors to easily determine an index behind them they'd
like to reset state to.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Some implementation notes: not all comments are valid; HTML makes sure
double-hyphens and trailing hyphens are not found in comments. In addition,
two new localizable messages were added.
Requested-by: Waldo Jaquith <waldo@vqronline.org>
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>