mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2024-11-08 06:48:42 +00:00
[2.1.2] Merge in Brett Zamir's patches.
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1397 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
parent
e45cc503a2
commit
29c3c21b34
1
NEWS
1
NEWS
@ -27,6 +27,7 @@ NEWS ( CHANGELOG and HISTORY ) HTMLPurifier
|
||||
- Hammer out a bunch of edge-case bugs in the standalone distribution
|
||||
- Inclusion reflection removed from URISchemeRegistry; you must manually
|
||||
include any new schema files you wish to use
|
||||
- Numerous typo fixes in documentation thanks to Brett Zamir
|
||||
. Unit test refactoring for one logical test per test function
|
||||
. Config and context parameters in ComplexHarness deprecated: instead, edit
|
||||
the $config and $context member variables
|
||||
|
@ -32,7 +32,7 @@
|
||||
Before we even write any code, it is paramount to consider whether or
|
||||
not the code we're writing is necessary or not. HTML Purifier, by default,
|
||||
contains a large set of elements and attributes: large enough so that
|
||||
<em>any</em> element or attribute in XHTML 1.0 (and its HTML variant)
|
||||
<em>any</em> element or attribute in XHTML 1.0 or 1.1 (and its HTML variants)
|
||||
that can be safely used by the general public is implemented.
|
||||
</p>
|
||||
|
||||
@ -76,11 +76,12 @@
|
||||
<h3>XHTML 1.1</h3>
|
||||
|
||||
<p>
|
||||
We have not implemented the
|
||||
As of HTMLPurifier 2.1.0, we have implemented the
|
||||
<a href="http://www.w3.org/TR/2001/REC-ruby-20010531/">Ruby module</a>,
|
||||
which defines a set of tags
|
||||
for publishing short annotations for text, used mostly in Japanese
|
||||
and Chinese school texts.
|
||||
and Chinese school texts, but applicable for positioning any text (not
|
||||
limited to translations) above or below other corresponding text.
|
||||
</p>
|
||||
|
||||
<h3>XHTML 2.0</h3>
|
||||
@ -492,10 +493,11 @@ $def =& $config->getHTMLDefinition(true);
|
||||
<p>
|
||||
The <code>(%flow;)*</code> indicates the allowed children of the
|
||||
<code>li</code> tag: <code>li</code> allows any number of flow
|
||||
elements as its children. In HTML Purifier, we'd write it like
|
||||
<code>Flow</code> (here's where the content sets we were
|
||||
discussing earlier come into play). There are three shorthand content models you
|
||||
can specify:
|
||||
elements as its children. (The <code>- O</code> allows the closing tag to be
|
||||
omitted, though in XML this is not allowed.) In HTML Purifier,
|
||||
we'd write it like <code>Flow</code> (here's where the content sets
|
||||
we were discussing earlier come into play). There are three shorthand
|
||||
content models you can specify:
|
||||
</p>
|
||||
|
||||
<table class="table">
|
||||
@ -668,12 +670,22 @@ $def =& $config->getHTMLDefinition(true);
|
||||
Common is a combination of the above-mentioned collections.
|
||||
</p>
|
||||
|
||||
<p class="aside">
|
||||
Readers familiar with the modularization may have noticed that the Core
|
||||
attribute collection differs from that specified by the <a
|
||||
href="http://www.w3.org/TR/xhtml-modularization/abstract_modules.html#s_commonatts">abstract
|
||||
modules of the XHTML Modularization 1.1</a>. We believe this section
|
||||
to be in error, as <code>br</code> permits the use of the <code>style</code>
|
||||
attribute even though it uses the <code>Core</code> collection, and
|
||||
the DTD and XML Schemas supplied by W3C support our interpretation.
|
||||
</p>
|
||||
|
||||
<h3>Attributes</h3>
|
||||
|
||||
<p>
|
||||
If you didn't read the <a href="#addAttribute">previous section on
|
||||
If you didn't read the <a href="#addAttribute">earlier section on
|
||||
adding attributes</a>, read it now. The last parameter is simply
|
||||
array of attribute names to attribute implementations, in the exact
|
||||
an array of attribute names to attribute implementations, in the exact
|
||||
same format as <code>addAttribute()</code>.
|
||||
</p>
|
||||
|
||||
|
@ -58,7 +58,7 @@ appear elsewhere on the document. The method is simple:</p>
|
||||
|
||||
<pre>$config->set('HTML', 'EnableAttrID', true);
|
||||
$config->set('Attr', 'IDBlacklist' array(
|
||||
'list', 'of', 'attributes', 'that', 'are', 'forbidden'
|
||||
'list', 'of', 'attribute', 'values', 'that', 'are', 'forbidden'
|
||||
));</pre>
|
||||
|
||||
<p>That being said, there are some notable drawbacks. First of all, you have to
|
||||
@ -71,9 +71,9 @@ to possible standards-compliance issues.</p>
|
||||
<p>Furthermore, this position becomes untenable when a single web page must hold
|
||||
multiple portions of user-submitted content. Since there's obviously no way
|
||||
to find out before-hand what IDs users will use, the blacklist is helpless.
|
||||
And even since HTML Purifier validates each segment seperately, perhaps doing
|
||||
And since HTML Purifier validates each segment separately, perhaps doing
|
||||
so at different times, it would be extremely difficult to dynamically update
|
||||
the blacklist inbetween runs.</p>
|
||||
the blacklist in between runs.</p>
|
||||
|
||||
<p>Finally, simply destroying the ID is extremely un-userfriendly behavior: after
|
||||
all, they might have simply specified a duplicate ID by accident.</p>
|
||||
|
@ -22,7 +22,7 @@ out:</p>
|
||||
|
||||
<p class="emphasis">This ain't HTML Tidy!</p>
|
||||
|
||||
<p>Rather, Tidy stands for a cool set of Tidy-inspired in HTML Purifier
|
||||
<p>Rather, Tidy stands for a cool set of Tidy-inspired features in HTML Purifier
|
||||
that allows users to submit deprecated elements and attributes and get
|
||||
valid strict markup back. For example:</p>
|
||||
|
||||
@ -33,8 +33,8 @@ valid strict markup back. For example:</p>
|
||||
<pre><div style="text-align:center;">Centered</div></pre>
|
||||
|
||||
<p>...when this particular fix is run on the HTML. This tutorial will give
|
||||
you down the lowdown of what exactly HTML Purifier will do when Tidy
|
||||
is on, and how to fine tune this behavior. Once again, <strong>you do
|
||||
you the lowdown of what exactly HTML Purifier will do when Tidy
|
||||
is on, and how to fine-tune this behavior. Once again, <strong>you do
|
||||
not need Tidy installed on your PHP to use these features!</strong></p>
|
||||
|
||||
<h2>What does it do?</h2>
|
||||
@ -221,7 +221,7 @@ general syntax:</p>
|
||||
|
||||
<p>The lowdown is, quite frankly, HTML Purifier's default settings are
|
||||
probably good enough. The next step is to bump the level up to heavy,
|
||||
and if that still doesn't satisfy your appetite, do some fine tuning.
|
||||
and if that still doesn't satisfy your appetite, do some fine-tuning.
|
||||
Other than that, don't worry about it: this all works silently and
|
||||
effectively in the background.</p>
|
||||
|
||||
|
@ -96,7 +96,7 @@ which can be a rewarding (but difficult) task.</p>
|
||||
<h2 id="findcharset">Finding the real encoding</h2>
|
||||
|
||||
<p>In the beginning, there was ASCII, and things were simple. But they
|
||||
weren't good, for no one could write in Cryllic or Thai. So there
|
||||
weren't good, for no one could write in Cyrillic or Thai. So there
|
||||
exploded a proliferation of character encodings to remedy the problem
|
||||
by extending the characters ASCII could express. This ridiculously
|
||||
simplified version of the history of character encodings shows us that
|
||||
@ -138,7 +138,7 @@ browser:</p>
|
||||
<dd>View > Encoding: bulleted item is unofficial name</dd>
|
||||
</dl>
|
||||
|
||||
<p>Internet Explorer won't give you the mime (i.e. useful/real) name of the
|
||||
<p>Internet Explorer won't give you the MIME (i.e. useful/real) name of the
|
||||
character encoding, so you'll have to look it up using their description.
|
||||
Some common ones:</p>
|
||||
|
||||
@ -216,6 +216,12 @@ if your <code>META</code> tag claims that either:</p>
|
||||
|
||||
<h2 id="fixcharset">Fixing the encoding</h2>
|
||||
|
||||
<p class="aside">The advice given here is for pages being served as
|
||||
vanilla <code>text/html</code>. Different practices must be used
|
||||
for <code>application/xml</code> or <code>application/xml+xhtml</code>, see
|
||||
<a href="http://www.w3.org/TR/2002/NOTE-xhtml-media-types-20020430/">W3C's
|
||||
document on XHTML media types</a> for more information.</p>
|
||||
|
||||
<p>If your <code>META</code> encoding and your real encoding match,
|
||||
savvy! You can skip this section. If they don't...</p>
|
||||
|
||||
@ -302,7 +308,8 @@ languages</a>. The appropriate code is:</p>
|
||||
|
||||
<p>...replacing UTF-8 with whatever your embedded encoding is.
|
||||
This code must come before any output, so be careful about
|
||||
stray whitespace in your application.</p>
|
||||
stray whitespace in your application (i.e., any whitespace before
|
||||
output excluding whitespace within <?php ?> tags).</p>
|
||||
|
||||
<h4 id="fixcharset-server-phpini">PHP ini directive</h4>
|
||||
|
||||
@ -313,8 +320,8 @@ header call: <code><a href="http://php.net/ini.core#ini.default-charset">default
|
||||
|
||||
<p>...will also do the trick. If PHP is running as an Apache module (and
|
||||
not as FastCGI, consult
|
||||
<a href="http://php.net/phpinfo">phpinfo</a>() for details), you can even use htaccess do apply this property
|
||||
globally:</p>
|
||||
<a href="http://php.net/phpinfo">phpinfo</a>() for details), you can even use htaccess to apply this property
|
||||
across many PHP files:</p>
|
||||
|
||||
<pre><a href="http://php.net/configuration.changes#configuration.changes.apache">php_value</a> default_charset "UTF-8"</pre>
|
||||
|
||||
@ -360,10 +367,11 @@ to send anything at all:</p>
|
||||
|
||||
<pre><a href="http://httpd.apache.org/docs/1.3/mod/core.html#adddefaultcharset">AddDefaultCharset</a> Off</pre>
|
||||
|
||||
<p>...making your <code>META</code> tags the sole source of
|
||||
character encoding information. In these cases, it is
|
||||
<em>especially</em> important to make sure you have valid <code>META</code>
|
||||
tags on your pages and all the text before them is ASCII.</p>
|
||||
<p>...making your internal charset declaration (usually the <code>META</code> tags)
|
||||
the sole source of character encoding
|
||||
information. In these cases, it is <em>especially</em> important to make
|
||||
sure you have valid <code>META</code> tags on your pages and all the
|
||||
text before them is ASCII.</p>
|
||||
|
||||
<blockquote class="aside"><p>These directives can also be
|
||||
placed in httpd.conf file for Apache, but
|
||||
@ -428,28 +436,30 @@ IIS to change character encodings, I'd be grateful.</p>
|
||||
|
||||
<p><code>META</code> tags are the most common source of embedded
|
||||
encodings, but they can also come from somewhere else: XML
|
||||
processing instructions. They look like:</p>
|
||||
Declarations. They look like:</p>
|
||||
|
||||
<pre><?xml version="1.0" encoding="UTF-8"?></pre>
|
||||
|
||||
<p>...and are most often found in XML documents (including XHTML).</p>
|
||||
|
||||
<p>For XHTML, this processing instruction theoretically
|
||||
<p>For XHTML, this XML Declaration theoretically
|
||||
overrides the <code>META</code> tag. In reality, this happens only when the
|
||||
XHTML is actually served as legit XML and not HTML, which is almost always
|
||||
never due to Internet Explorer's lack of support for
|
||||
<code>application/xhtml+xml</code> (even though doing so is often
|
||||
argued to be <a href="http://www.hixie.ch/advocacy/xhtml">good practice</a>).</p>
|
||||
argued to be <a href="http://www.hixie.ch/advocacy/xhtml">good
|
||||
practice</a> and is required by the XHTML 1.1 specification).</p>
|
||||
|
||||
<p>For XML, however, this processing instruction is extremely important.
|
||||
<p>For XML, however, this XML Declaration is extremely important.
|
||||
Since most webservers are not configured to send charsets for .xml files,
|
||||
this is the only thing a parser has to go on. Furthermore, the default
|
||||
for XML files is UTF-8, which often butts heads with more common
|
||||
ISO-8859-1 encoding (you see this in garbled RSS feeds).</p>
|
||||
|
||||
<p>In short, if you use XHTML and have gone through the
|
||||
trouble of adding the XML header, make sure it jives
|
||||
with your <code>META</code> tags and HTTP headers.</p>
|
||||
trouble of adding the XML Declaration, make sure it jives
|
||||
with your <code>META</code> tags (which should only be present
|
||||
if served in text/html) and HTTP headers.</p>
|
||||
|
||||
<h3 id="fixcharset-internals">Inside the process</h3>
|
||||
|
||||
@ -545,7 +555,7 @@ an application that originally used ISO-8859-1 but switched to UTF-8
|
||||
when it became far to cumbersome to support foreign languages. Bots
|
||||
will now actually go through articles and convert character entities
|
||||
to their corresponding real characters for the sake of user-friendliness
|
||||
and searcheability. See
|
||||
and searchability. See
|
||||
<a href="http://meta.wikimedia.org/wiki/Help:Special_characters">Meta's
|
||||
page on special characters</a> for more details.
|
||||
</p></blockquote>
|
||||
@ -609,7 +619,7 @@ since UTF-8 supports every character.</p>
|
||||
|
||||
<h4 id="whyutf8-forms-multipart"><code>multipart/form-data</code></h4>
|
||||
|
||||
<p>Multipart form submission takes a way a lot of the ambiguity
|
||||
<p>Multipart form submission takes away a lot of the ambiguity
|
||||
that percent-encoding had: the server now can explicitly ask for
|
||||
certain encodings, and the client can explicitly tell the server
|
||||
during the form submission what encoding the fields are in.</p>
|
||||
@ -678,7 +688,7 @@ set the encoding correctly using %Core.Encoding):</p>
|
||||
|
||||
<ul>
|
||||
<li>The <code>Encoder</code> will transform the text from ISO 8859-1 to UTF-8
|
||||
(note that theta is preserved since it doesn't actually use
|
||||
(note that theta is preserved here since it doesn't actually use
|
||||
any non-ASCII characters): <code>&theta;</code></li>
|
||||
<li>The <code>EntityParser</code> will transform all named and numeric
|
||||
character entities to their corresponding raw UTF-8 equivalents:
|
||||
@ -723,7 +733,7 @@ by the target encoding, but that would require reimplementing iconv
|
||||
with HTML awareness, something I will not do.</p>
|
||||
|
||||
<p>So there: either it's UTF-8 or crippled international support. Your pick! (and I'm
|
||||
not being sarcastic here: some people could care less about other languages)</p>
|
||||
not being sarcastic here: some people could care less about other languages).</p>
|
||||
|
||||
<h2 id="migrate">Migrate to UTF-8</h2>
|
||||
|
||||
@ -985,7 +995,7 @@ and yes, it is variable width. Other traits:</p>
|
||||
in different ways. It is beyond the scope of this document to explain
|
||||
what precisely these implications are. PHPWact provides
|
||||
a very good <a href="http://www.phpwact.org/php/i18n/utf-8">reference document</a>
|
||||
on what to expect from each functions, although coverage is spotty in
|
||||
on what to expect from each function, although coverage is spotty in
|
||||
some areas. Their more general notes on
|
||||
<a href="http://www.phpwact.org/php/i18n/charsets">character sets</a>
|
||||
are also worth looking at for information on UTF-8. Some rules of thumb
|
||||
|
Loading…
Reference in New Issue
Block a user