0
0
mirror of https://github.com/ezyang/htmlpurifier.git synced 2024-12-22 08:21:52 +00:00

[2.1.2] Merge in Brett Zamir's patches.

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1397 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
Edward Z. Yang 2007-08-26 18:20:46 +00:00
parent e45cc503a2
commit 29c3c21b34
5 changed files with 59 additions and 36 deletions

1
NEWS
View File

@ -27,6 +27,7 @@ NEWS ( CHANGELOG and HISTORY ) HTMLPurifier
- Hammer out a bunch of edge-case bugs in the standalone distribution - Hammer out a bunch of edge-case bugs in the standalone distribution
- Inclusion reflection removed from URISchemeRegistry; you must manually - Inclusion reflection removed from URISchemeRegistry; you must manually
include any new schema files you wish to use include any new schema files you wish to use
- Numerous typo fixes in documentation thanks to Brett Zamir
. Unit test refactoring for one logical test per test function . Unit test refactoring for one logical test per test function
. Config and context parameters in ComplexHarness deprecated: instead, edit . Config and context parameters in ComplexHarness deprecated: instead, edit
the $config and $context member variables the $config and $context member variables

View File

@ -32,7 +32,7 @@
Before we even write any code, it is paramount to consider whether or Before we even write any code, it is paramount to consider whether or
not the code we're writing is necessary or not. HTML Purifier, by default, not the code we're writing is necessary or not. HTML Purifier, by default,
contains a large set of elements and attributes: large enough so that contains a large set of elements and attributes: large enough so that
<em>any</em> element or attribute in XHTML 1.0 (and its HTML variant) <em>any</em> element or attribute in XHTML 1.0 or 1.1 (and its HTML variants)
that can be safely used by the general public is implemented. that can be safely used by the general public is implemented.
</p> </p>
@ -76,11 +76,12 @@
<h3>XHTML 1.1</h3> <h3>XHTML 1.1</h3>
<p> <p>
We have not implemented the As of HTMLPurifier 2.1.0, we have implemented the
<a href="http://www.w3.org/TR/2001/REC-ruby-20010531/">Ruby module</a>, <a href="http://www.w3.org/TR/2001/REC-ruby-20010531/">Ruby module</a>,
which defines a set of tags which defines a set of tags
for publishing short annotations for text, used mostly in Japanese for publishing short annotations for text, used mostly in Japanese
and Chinese school texts. and Chinese school texts, but applicable for positioning any text (not
limited to translations) above or below other corresponding text.
</p> </p>
<h3>XHTML 2.0</h3> <h3>XHTML 2.0</h3>
@ -492,10 +493,11 @@ $def =& $config->getHTMLDefinition(true);
<p> <p>
The <code>(%flow;)*</code> indicates the allowed children of the The <code>(%flow;)*</code> indicates the allowed children of the
<code>li</code> tag: <code>li</code> allows any number of flow <code>li</code> tag: <code>li</code> allows any number of flow
elements as its children. In HTML Purifier, we'd write it like elements as its children. (The <code>- O</code> allows the closing tag to be
<code>Flow</code> (here's where the content sets we were omitted, though in XML this is not allowed.) In HTML Purifier,
discussing earlier come into play). There are three shorthand content models you we'd write it like <code>Flow</code> (here's where the content sets
can specify: we were discussing earlier come into play). There are three shorthand
content models you can specify:
</p> </p>
<table class="table"> <table class="table">
@ -668,12 +670,22 @@ $def =& $config->getHTMLDefinition(true);
Common is a combination of the above-mentioned collections. Common is a combination of the above-mentioned collections.
</p> </p>
<p class="aside">
Readers familiar with the modularization may have noticed that the Core
attribute collection differs from that specified by the <a
href="http://www.w3.org/TR/xhtml-modularization/abstract_modules.html#s_commonatts">abstract
modules of the XHTML Modularization 1.1</a>. We believe this section
to be in error, as <code>br</code> permits the use of the <code>style</code>
attribute even though it uses the <code>Core</code> collection, and
the DTD and XML Schemas supplied by W3C support our interpretation.
</p>
<h3>Attributes</h3> <h3>Attributes</h3>
<p> <p>
If you didn't read the <a href="#addAttribute">previous section on If you didn't read the <a href="#addAttribute">earlier section on
adding attributes</a>, read it now. The last parameter is simply adding attributes</a>, read it now. The last parameter is simply
array of attribute names to attribute implementations, in the exact an array of attribute names to attribute implementations, in the exact
same format as <code>addAttribute()</code>. same format as <code>addAttribute()</code>.
</p> </p>

View File

@ -58,7 +58,7 @@ appear elsewhere on the document. The method is simple:</p>
<pre>$config->set('HTML', 'EnableAttrID', true); <pre>$config->set('HTML', 'EnableAttrID', true);
$config->set('Attr', 'IDBlacklist' array( $config->set('Attr', 'IDBlacklist' array(
'list', 'of', 'attributes', 'that', 'are', 'forbidden' 'list', 'of', 'attribute', 'values', 'that', 'are', 'forbidden'
));</pre> ));</pre>
<p>That being said, there are some notable drawbacks. First of all, you have to <p>That being said, there are some notable drawbacks. First of all, you have to
@ -71,9 +71,9 @@ to possible standards-compliance issues.</p>
<p>Furthermore, this position becomes untenable when a single web page must hold <p>Furthermore, this position becomes untenable when a single web page must hold
multiple portions of user-submitted content. Since there's obviously no way multiple portions of user-submitted content. Since there's obviously no way
to find out before-hand what IDs users will use, the blacklist is helpless. to find out before-hand what IDs users will use, the blacklist is helpless.
And even since HTML Purifier validates each segment seperately, perhaps doing And since HTML Purifier validates each segment separately, perhaps doing
so at different times, it would be extremely difficult to dynamically update so at different times, it would be extremely difficult to dynamically update
the blacklist inbetween runs.</p> the blacklist in between runs.</p>
<p>Finally, simply destroying the ID is extremely un-userfriendly behavior: after <p>Finally, simply destroying the ID is extremely un-userfriendly behavior: after
all, they might have simply specified a duplicate ID by accident.</p> all, they might have simply specified a duplicate ID by accident.</p>

View File

@ -22,7 +22,7 @@ out:</p>
<p class="emphasis">This ain't HTML Tidy!</p> <p class="emphasis">This ain't HTML Tidy!</p>
<p>Rather, Tidy stands for a cool set of Tidy-inspired in HTML Purifier <p>Rather, Tidy stands for a cool set of Tidy-inspired features in HTML Purifier
that allows users to submit deprecated elements and attributes and get that allows users to submit deprecated elements and attributes and get
valid strict markup back. For example:</p> valid strict markup back. For example:</p>
@ -33,8 +33,8 @@ valid strict markup back. For example:</p>
<pre>&lt;div style=&quot;text-align:center;&quot;&gt;Centered&lt;/div&gt;</pre> <pre>&lt;div style=&quot;text-align:center;&quot;&gt;Centered&lt;/div&gt;</pre>
<p>...when this particular fix is run on the HTML. This tutorial will give <p>...when this particular fix is run on the HTML. This tutorial will give
you down the lowdown of what exactly HTML Purifier will do when Tidy you the lowdown of what exactly HTML Purifier will do when Tidy
is on, and how to fine tune this behavior. Once again, <strong>you do is on, and how to fine-tune this behavior. Once again, <strong>you do
not need Tidy installed on your PHP to use these features!</strong></p> not need Tidy installed on your PHP to use these features!</strong></p>
<h2>What does it do?</h2> <h2>What does it do?</h2>
@ -221,7 +221,7 @@ general syntax:</p>
<p>The lowdown is, quite frankly, HTML Purifier's default settings are <p>The lowdown is, quite frankly, HTML Purifier's default settings are
probably good enough. The next step is to bump the level up to heavy, probably good enough. The next step is to bump the level up to heavy,
and if that still doesn't satisfy your appetite, do some fine tuning. and if that still doesn't satisfy your appetite, do some fine-tuning.
Other than that, don't worry about it: this all works silently and Other than that, don't worry about it: this all works silently and
effectively in the background.</p> effectively in the background.</p>

View File

@ -96,7 +96,7 @@ which can be a rewarding (but difficult) task.</p>
<h2 id="findcharset">Finding the real encoding</h2> <h2 id="findcharset">Finding the real encoding</h2>
<p>In the beginning, there was ASCII, and things were simple. But they <p>In the beginning, there was ASCII, and things were simple. But they
weren't good, for no one could write in Cryllic or Thai. So there weren't good, for no one could write in Cyrillic or Thai. So there
exploded a proliferation of character encodings to remedy the problem exploded a proliferation of character encodings to remedy the problem
by extending the characters ASCII could express. This ridiculously by extending the characters ASCII could express. This ridiculously
simplified version of the history of character encodings shows us that simplified version of the history of character encodings shows us that
@ -138,7 +138,7 @@ browser:</p>
<dd>View &gt; Encoding: bulleted item is unofficial name</dd> <dd>View &gt; Encoding: bulleted item is unofficial name</dd>
</dl> </dl>
<p>Internet Explorer won't give you the mime (i.e. useful/real) name of the <p>Internet Explorer won't give you the MIME (i.e. useful/real) name of the
character encoding, so you'll have to look it up using their description. character encoding, so you'll have to look it up using their description.
Some common ones:</p> Some common ones:</p>
@ -216,6 +216,12 @@ if your <code>META</code> tag claims that either:</p>
<h2 id="fixcharset">Fixing the encoding</h2> <h2 id="fixcharset">Fixing the encoding</h2>
<p class="aside">The advice given here is for pages being served as
vanilla <code>text/html</code>. Different practices must be used
for <code>application/xml</code> or <code>application/xml+xhtml</code>, see
<a href="http://www.w3.org/TR/2002/NOTE-xhtml-media-types-20020430/">W3C's
document on XHTML media types</a> for more information.</p>
<p>If your <code>META</code> encoding and your real encoding match, <p>If your <code>META</code> encoding and your real encoding match,
savvy! You can skip this section. If they don't...</p> savvy! You can skip this section. If they don't...</p>
@ -302,7 +308,8 @@ languages</a>. The appropriate code is:</p>
<p>...replacing UTF-8 with whatever your embedded encoding is. <p>...replacing UTF-8 with whatever your embedded encoding is.
This code must come before any output, so be careful about This code must come before any output, so be careful about
stray whitespace in your application.</p> stray whitespace in your application (i.e., any whitespace before
output excluding whitespace within &lt;?php ?&gt; tags).</p>
<h4 id="fixcharset-server-phpini">PHP ini directive</h4> <h4 id="fixcharset-server-phpini">PHP ini directive</h4>
@ -313,8 +320,8 @@ header call: <code><a href="http://php.net/ini.core#ini.default-charset">default
<p>...will also do the trick. If PHP is running as an Apache module (and <p>...will also do the trick. If PHP is running as an Apache module (and
not as FastCGI, consult not as FastCGI, consult
<a href="http://php.net/phpinfo">phpinfo</a>() for details), you can even use htaccess do apply this property <a href="http://php.net/phpinfo">phpinfo</a>() for details), you can even use htaccess to apply this property
globally:</p> across many PHP files:</p>
<pre><a href="http://php.net/configuration.changes#configuration.changes.apache">php_value</a> default_charset &quot;UTF-8&quot;</pre> <pre><a href="http://php.net/configuration.changes#configuration.changes.apache">php_value</a> default_charset &quot;UTF-8&quot;</pre>
@ -360,10 +367,11 @@ to send anything at all:</p>
<pre><a href="http://httpd.apache.org/docs/1.3/mod/core.html#adddefaultcharset">AddDefaultCharset</a> Off</pre> <pre><a href="http://httpd.apache.org/docs/1.3/mod/core.html#adddefaultcharset">AddDefaultCharset</a> Off</pre>
<p>...making your <code>META</code> tags the sole source of <p>...making your internal charset declaration (usually the <code>META</code> tags)
character encoding information. In these cases, it is the sole source of character encoding
<em>especially</em> important to make sure you have valid <code>META</code> information. In these cases, it is <em>especially</em> important to make
tags on your pages and all the text before them is ASCII.</p> sure you have valid <code>META</code> tags on your pages and all the
text before them is ASCII.</p>
<blockquote class="aside"><p>These directives can also be <blockquote class="aside"><p>These directives can also be
placed in httpd.conf file for Apache, but placed in httpd.conf file for Apache, but
@ -428,28 +436,30 @@ IIS to change character encodings, I'd be grateful.</p>
<p><code>META</code> tags are the most common source of embedded <p><code>META</code> tags are the most common source of embedded
encodings, but they can also come from somewhere else: XML encodings, but they can also come from somewhere else: XML
processing instructions. They look like:</p> Declarations. They look like:</p>
<pre>&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;</pre> <pre>&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;</pre>
<p>...and are most often found in XML documents (including XHTML).</p> <p>...and are most often found in XML documents (including XHTML).</p>
<p>For XHTML, this processing instruction theoretically <p>For XHTML, this XML Declaration theoretically
overrides the <code>META</code> tag. In reality, this happens only when the overrides the <code>META</code> tag. In reality, this happens only when the
XHTML is actually served as legit XML and not HTML, which is almost always XHTML is actually served as legit XML and not HTML, which is almost always
never due to Internet Explorer's lack of support for never due to Internet Explorer's lack of support for
<code>application/xhtml+xml</code> (even though doing so is often <code>application/xhtml+xml</code> (even though doing so is often
argued to be <a href="http://www.hixie.ch/advocacy/xhtml">good practice</a>).</p> argued to be <a href="http://www.hixie.ch/advocacy/xhtml">good
practice</a> and is required by the XHTML 1.1 specification).</p>
<p>For XML, however, this processing instruction is extremely important. <p>For XML, however, this XML Declaration is extremely important.
Since most webservers are not configured to send charsets for .xml files, Since most webservers are not configured to send charsets for .xml files,
this is the only thing a parser has to go on. Furthermore, the default this is the only thing a parser has to go on. Furthermore, the default
for XML files is UTF-8, which often butts heads with more common for XML files is UTF-8, which often butts heads with more common
ISO-8859-1 encoding (you see this in garbled RSS feeds).</p> ISO-8859-1 encoding (you see this in garbled RSS feeds).</p>
<p>In short, if you use XHTML and have gone through the <p>In short, if you use XHTML and have gone through the
trouble of adding the XML header, make sure it jives trouble of adding the XML Declaration, make sure it jives
with your <code>META</code> tags and HTTP headers.</p> with your <code>META</code> tags (which should only be present
if served in text/html) and HTTP headers.</p>
<h3 id="fixcharset-internals">Inside the process</h3> <h3 id="fixcharset-internals">Inside the process</h3>
@ -545,7 +555,7 @@ an application that originally used ISO-8859-1 but switched to UTF-8
when it became far to cumbersome to support foreign languages. Bots when it became far to cumbersome to support foreign languages. Bots
will now actually go through articles and convert character entities will now actually go through articles and convert character entities
to their corresponding real characters for the sake of user-friendliness to their corresponding real characters for the sake of user-friendliness
and searcheability. See and searchability. See
<a href="http://meta.wikimedia.org/wiki/Help:Special_characters">Meta's <a href="http://meta.wikimedia.org/wiki/Help:Special_characters">Meta's
page on special characters</a> for more details. page on special characters</a> for more details.
</p></blockquote> </p></blockquote>
@ -609,7 +619,7 @@ since UTF-8 supports every character.</p>
<h4 id="whyutf8-forms-multipart"><code>multipart/form-data</code></h4> <h4 id="whyutf8-forms-multipart"><code>multipart/form-data</code></h4>
<p>Multipart form submission takes a way a lot of the ambiguity <p>Multipart form submission takes away a lot of the ambiguity
that percent-encoding had: the server now can explicitly ask for that percent-encoding had: the server now can explicitly ask for
certain encodings, and the client can explicitly tell the server certain encodings, and the client can explicitly tell the server
during the form submission what encoding the fields are in.</p> during the form submission what encoding the fields are in.</p>
@ -678,7 +688,7 @@ set the encoding correctly using %Core.Encoding):</p>
<ul> <ul>
<li>The <code>Encoder</code> will transform the text from ISO 8859-1 to UTF-8 <li>The <code>Encoder</code> will transform the text from ISO 8859-1 to UTF-8
(note that theta is preserved since it doesn't actually use (note that theta is preserved here since it doesn't actually use
any non-ASCII characters): <code>&amp;theta;</code></li> any non-ASCII characters): <code>&amp;theta;</code></li>
<li>The <code>EntityParser</code> will transform all named and numeric <li>The <code>EntityParser</code> will transform all named and numeric
character entities to their corresponding raw UTF-8 equivalents: character entities to their corresponding raw UTF-8 equivalents:
@ -723,7 +733,7 @@ by the target encoding, but that would require reimplementing iconv
with HTML awareness, something I will not do.</p> with HTML awareness, something I will not do.</p>
<p>So there: either it's UTF-8 or crippled international support. Your pick! (and I'm <p>So there: either it's UTF-8 or crippled international support. Your pick! (and I'm
not being sarcastic here: some people could care less about other languages)</p> not being sarcastic here: some people could care less about other languages).</p>
<h2 id="migrate">Migrate to UTF-8</h2> <h2 id="migrate">Migrate to UTF-8</h2>
@ -985,7 +995,7 @@ and yes, it is variable width. Other traits:</p>
in different ways. It is beyond the scope of this document to explain in different ways. It is beyond the scope of this document to explain
what precisely these implications are. PHPWact provides what precisely these implications are. PHPWact provides
a very good <a href="http://www.phpwact.org/php/i18n/utf-8">reference document</a> a very good <a href="http://www.phpwact.org/php/i18n/utf-8">reference document</a>
on what to expect from each functions, although coverage is spotty in on what to expect from each function, although coverage is spotty in
some areas. Their more general notes on some areas. Their more general notes on
<a href="http://www.phpwact.org/php/i18n/charsets">character sets</a> <a href="http://www.phpwact.org/php/i18n/charsets">character sets</a>
are also worth looking at for information on UTF-8. Some rules of thumb are also worth looking at for information on UTF-8. Some rules of thumb