You've probably heard of HTML Tidy, Dave Raggett's little piece of software that cleans up poorly written HTML. Let me say it straight out:
This ain't HTML Tidy!
Rather, Tidy stands for a cool set of Tidy-inspired in HTML Purifier that allows users to submit deprecated elements and attributes and get valid strict markup back. For example:
<center>Centered</center>
...becomes:
<div style="text-align:center;">Centered</div>
...when this particular fix is run on the HTML. This tutorial will give you down the lowdown of what exactly HTML Purifier will do when Tidy is on, and how to fine tune this behavior. Once again, you do not need Tidy installed on your PHP to use these features!
Tidy will do several things to your HTML:
Levels describe how aggressive the Tidy module should be when cleaning up HTML. There are four levels to pick: none, light, medium and heavy. Each of these levels has a well-defined set of behavior associated with it, although it may change depending on your doctype.
By default, Tidy operates on the medium level. You can change the level of cleaning by setting the %HTML.TidyLevel configuration directive:
$config->set('HTML', 'TidyLevel', 'heavy'); // burn baby burn!
It depends on what doctype you're using. If your documents are HTML
4.01 Transitional, HTML Purifier will be lazy
and won't clean up your center
or font
tags. But if you're using HTML 4.01 Strict,
HTML Purifier has no choice: it has to convert them, or they will
be nuked out of existence. So while light on Transitional will result
in little to no changes, light on Strict will still result in quite
a lot of fixes.
This is different behavior from 1.6 or before, where deprecated tags in transitional documents would always be cleaned up regardless. This is also better behavior.
HTML Purifier is tasked with converting deprecated tags and attributes to standards-compliant alternatives, which usually need copious amounts of CSS. It's also not foolproof: sometimes things do get lost in the translation. This is why when HTML Purifier can get away with not doing cleaning, it won't; this is why the default value is medium and not heavy.
Fortunately, only a few attributes have problems with the switch over. They are described below:
Element@Attr | Changes |
---|---|
caption@align | Firefox supports stuffing the caption on the left and right side of the table, a feature that Internet Explorer, understandably, does not have. When align equals right or left, the text will simply be aligned on the left or right side. |
img@align | The implementation for align bottom is good, but not perfect. There are a few pixel differences. |
br@clear | Clear both gets a little wonky in Internet Explorer. Haven't really been able to figure out why. |
hr@noshade | All browsers implement this slightly differently: we've chosen to make noshade horizontal rules gray. |
There are a few more minor, although irritating, bugs. Some older browsers support deprecated attributes, but not CSS. Transformed elements and attributes will look unstyled to said browsers. Also, CSS precedence is slightly different for inline styles versus presentational markup. In increasing precedence:
This means that styling that may have been masked by external CSS declarations will start showing up (a good thing, perhaps). Finally, if you've turned off the style attribute, almost all of these transformations will not work. Sorry mates.
You can review the rendering before and after of these transformations by consulting the attrTransform.php smoketest.
So you want HTML Purifier to clean up your HTML, but you're not so happy about the br@clear implementation. That's perfectly fine! HTML Purifier will make accomodations:
$config->set('HTML', 'Doctype', 'XHTML 1.0 Transitional'); $config->set('HTML', 'TidyLevel', 'heavy'); // all changes, minus... $config->set('HTML', 'TidyRemove', 'br@clear');
That third line does the magic, removing the br@clear fix
from the module, ensuring that <br clear="both" />
will pass through unharmed. The reverse is possible too:
$config->set('HTML', 'Doctype', 'XHTML 1.0 Transitional'); $config->set('HTML', 'TidyLevel', 'none'); // no changes, plus... $config->set('HTML', 'TidyAdd', 'p@align');
In this case, all transformations are shut off, except for the p@align one, which you found handy.
To find out what the names of fixes you want to turn on or off are,
you'll have to consult the source code, specifically the files in
HTMLPurifier/HTMLModule/Tidy/
. There is, however, a
general syntax:
Name | Example | Interpretation |
---|---|---|
element | font | Tag transform for element |
element@attr | br@clear | Attribute transform for attr on element |
@attr | @lang | Global attribute transform for attr |
e#content_model_type | blockquote#content_model_type | Change of child processing implementation for e |
The lowdown is, quite frankly, HTML Purifier's default settings are probably good enough. The next step is to bump the level up to heavy, and if that still doesn't satisfy your appetite, do some fine tuning. Other than that, don't worry about it: this all works silently and effectively in the background.