Previous design of injector streaming involved editability only to start, empty
and text tokens, because they could be safely modified without causing formedness
errors. By modifying notifyEnd to operate before MakeWellFormed's safeguards
kick into effect, it can be converted into a handle function, allowing for
arbitrary modification of end tags.
This change involved quite a bit of restructuring of the MakeWellFormed code,
including the moving of end of document tags to inside the loop, so rewinding
on those tags would be functional, increased reuse of the end tag codepath by
code that inserts end tags (as they could be changed out from under you), and
processToken modified to have an extra parameter to force re-processing of
a token if the original token was an end token.
We're not exactly sure if handleEnd works at this point, but the important
talking point about this refactoring is that nothing else broke. Also, a number
of convenience functions were moved from AutoParagraph to the Injector
supertype (specifically: forward, forwardToEndToken, backward, and current).
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Also changed:
- DirectLex keeps track of column numbers in context
- New class HTMLPurifier_ErrorStruct
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
AttrValidator's changes are fairly self-explanatory, but ErrorCollector's
changes are worth a little discussion. ErrorCollector can use generators at
various points during its flow control; there are two distinct generators that
it should use: 1. The one used for the output, and 2. The one used for the
error output. These will usually be the same, but in the odd case where they
need to be different, getHTMLFormatted() will accept an alterate configuration
object with an appropriate doctype.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
* Added Charsets and Character attribute types
* Fix a heavily recursive form of ContentSets, this allows a content-set
to include another content-set which includes another content-set, and
so forth.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
This is a very large commit that includes numerous improvements to the
AutoParagraph injector. These are:
* Rewritten flow control of the injector to use almost exclusively
binary conditionals.
* Improved inline documentation with "State" comments, which give concise
examples of what the token stack looks like at flow points.
* Documentation for all flow branches, even those with no actions.
* Factoring out of common operations to improve readability, especially the
new iterator private methods.
* Expanded test-suite which covers new flow points, and corrects some errors
in previous cases.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
The newest autoclose code uses the elements property in whether or not an
element should be closed by a particular tag. The heuristic is simple; if
the element doesn't allow that tag as a child, it closes the parent
container. This doesn't work, however, with <blockquote>, which while not
allowing inline styles under Strict doctypes, requires them to be passed
through MakeWellFormed.
The fix was to transition MakeWellFormed to call a method to retrieve the
elements, and then have StrictBlockquote implement a special version of
this method. Future versions of HTML Purifier may be more flexible in this
regard--further study of the HTML5 specification is required.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Previously, if an absolute munge URL location was used, HTML passed through
HTML Purifier multiple times would be munged multiple times. This patch
checks if the output URI has the same URI as the input URI; if they do,
the munge is considered unnecessary and discarded.
Requested-by: Chris <justbittin@gmail.com>
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
As part of its duties, URIDefinition determine the base URL and the host URL
of the page based on the two corresponding configuration directives. The
DisableExternal URIFilter, however, bypassed this check by directly checking
%URI.Host. This fix forwards the call through URIDefinition.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
HTMLT tests are a compact and easy-to-use way of making assertPurification
type tests. They take the format of:
--INI--
Ns.Directive = "directive value"
--HTML--
Input HTML
--EXPECT--
Expected HTML
Expect more features and migration to be coming soon.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
The following changes were made:
* Create --type parameter which accepts 'htmlpurifier', 'phpt', 'vtest', etc.
in order to execute only that class of tests. This supercedes --only-phpt.
* Create --quick parameter for multitest.php, run only the tips of each
release series.
* Create --distro parameter for multitest.php, supercedes --exclude-normal
and --exclude-standalone.
Also, a grep for htmlt tests was added, although add_tests() doesn't do
anything with it yet.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
By default, the DirectLex and DOMLex behavior with stray angled brackets
varied a great deal due to their implementations. A little known directive
%Core.AggressivelyFixLt attempted to match DOMLex's behavior with DirectLex's,
but it was off by default. By turning it on by default, users now enjoy these
benefits, and performance-minded users can turn it back off.
Also, several refinements to stray angled bracket parsing was made. Specifically:
* DirectLex: Handle each left angled bracket individually, which prevents
strange behavior as reported by eon.
* DOMLex: Iterate aggressive lt fix, so that stacked brackets like << are
handled.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Previously, attempting to set %Core.Encoding to an encoding iconv didn't
know about would result in a silent failure, with the return of the
boolean false. Now it will fatally error out.
Reported-by: mcgrailm <mgm19@psu.edu>
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
The bugs are:
* Undefined $is_folder variable when path is empty, and
* Improper concatenation of host and path together.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
The MakeWellFormed strategy uses metadata from HTMLDefinition in order to
determine whether or not tokens need to be converted or tags need to be
auto-closed. While this functionality is good to have, it is by no means
essential, and MakeWellFormed should not error when this information is not
available.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Injector rewind: Injectors can now use the method rewind() in order to move
the input index backwards, so that they can reprocess tokens (other injectors
are not affected by a rewind). This functionality was necessary to implement
nested node removals in %AutoFormat.RemoveEmpty.
End to start ref: To facilitate rewinding, HTMLPurifier_Token_End now
maintains a reference called $start to the starting token for their node.
%AutoFormat.RemoveEmpty removes empty nodes. Lots of people have requested
it, so here is a partially effective implementation. Because it is implemented
as an Injector, it's not possible for it to handle newly introduced empty
nodes by later validators, specifically auto-closing and child validation.
The Injector is only meant to be used on HTML-ish languages.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Prior to this commit, the name attribute was unilaterally removed, except
for Strict doctypes or a heavy TidyLevel, when it was converted to an id
attribute. As name is actually permitted in both HTML 4.01 Strict and
XHTML 1.0 Strict, although deprecated, the more sensible default behavior
is to allow it unless TidyLevel is heavy.
Our implementation is slightly stricter than the specs, as name attributes are
treated as first class IDs, disallowing <a name="foo" id="foo"> or duplicate
names. The former should be treated as a special case, but that will be
a separate commit.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Previously, MakeWellFormed processed tokens and appended them onto an output
array, which was presumably immutable and inaccessible to Injectors. By
having MakeWellFormed operate directly on the input array, the strategy
saves memory and will also allow for a rewind implementation, as a unifying
the two arrays allows Injectors to easily determine an index behind them they'd
like to reset state to.
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
Some implementation notes: not all comments are valid; HTML makes sure
double-hyphens and trailing hyphens are not found in comments. In addition,
two new localizable messages were added.
Requested-by: Waldo Jaquith <waldo@vqronline.org>
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
If %Output.SortAttr is true, attributes are sorted to be
in alphabetical order. This was requested by frank farmer.
See also: http://htmlpurifier.org/phorum/read.php?2,1576
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
UTF-8 is a variable-width encoding that uses octets, UTF-16
is a variable-width encoding that uses 16-bit words, and
UCS-2 is an obsolete fixed-width encoding that doesn't not
support characters beyond the BMP. Explaining this would be
unwieldly, so we just removed the information.
See also: http://www.reddit.com/info/6mlqc/comments/c04aold
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
* Setup usage.xml to be binary, as XMLWriter does not honor operating
system's newline format.
* Setup various files to ignore (svn:ignore was not carried over)
* Add dummy files to prevent git from ignoring empty directories
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>