mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2024-12-22 16:31:53 +00:00
[2.1.2] Correct usage of entity -> character entity reference.
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1398 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
parent
29c3c21b34
commit
fb367dc871
@ -516,7 +516,7 @@ usage in one language sometimes requires the occasional special character
|
|||||||
that, without surprise, is not available in your character set. Sometimes
|
that, without surprise, is not available in your character set. Sometimes
|
||||||
developers get around this by adding support for multiple encodings: when
|
developers get around this by adding support for multiple encodings: when
|
||||||
using Chinese, use Big5, when using Japanese, use Shift-JIS, when
|
using Chinese, use Big5, when using Japanese, use Shift-JIS, when
|
||||||
using Greek, etc. Other times, they use character entities with great
|
using Greek, etc. Other times, they use character references with great
|
||||||
zeal.</p>
|
zeal.</p>
|
||||||
|
|
||||||
<p>UTF-8, however, obviates the need for any of these complicated
|
<p>UTF-8, however, obviates the need for any of these complicated
|
||||||
@ -530,14 +530,14 @@ you don't have to use those user-unfriendly entities.</p>
|
|||||||
|
|
||||||
<p>Websites encoded in Latin-1 (ISO-8859-1) which ocassionally need
|
<p>Websites encoded in Latin-1 (ISO-8859-1) which ocassionally need
|
||||||
a special character outside of their scope often will use a character
|
a special character outside of their scope often will use a character
|
||||||
entity to achieve the desired effect. For instance, θ can be
|
entity reference to achieve the desired effect. For instance, θ can be
|
||||||
written <code>&theta;</code>, regardless of the character encoding's
|
written <code>&theta;</code>, regardless of the character encoding's
|
||||||
support of Greek letters.</p>
|
support of Greek letters.</p>
|
||||||
|
|
||||||
<p>This works nicely for limited use of special characters, but
|
<p>This works nicely for limited use of special characters, but
|
||||||
say you wanted this sentence of Chinese text: 激光,
|
say you wanted this sentence of Chinese text: 激光,
|
||||||
這兩個字是甚麼意思.
|
這兩個字是甚麼意思.
|
||||||
The entity-ized version would look like this:</p>
|
The ampersand encoded version would look like this:</p>
|
||||||
|
|
||||||
<pre>&#28608;&#20809;, &#36889;&#20841;&#20491;&#23383;&#26159;&#29978;&#40636;&#24847;&#24605;</pre>
|
<pre>&#28608;&#20809;, &#36889;&#20841;&#20491;&#23383;&#26159;&#29978;&#40636;&#24847;&#24605;</pre>
|
||||||
|
|
||||||
@ -603,7 +603,7 @@ browser you're using, they might:</p>
|
|||||||
<ul>
|
<ul>
|
||||||
<li>Replace the unsupported characters with useless question marks,</li>
|
<li>Replace the unsupported characters with useless question marks,</li>
|
||||||
<li>Attempt to fix the characters (example: smart quotes to regular quotes),</li>
|
<li>Attempt to fix the characters (example: smart quotes to regular quotes),</li>
|
||||||
<li>Replace the character with a character entity, or</li>
|
<li>Replace the character with a character entity reference, or</li>
|
||||||
<li>Send it anyway as a different character encoding mixed in
|
<li>Send it anyway as a different character encoding mixed in
|
||||||
with the original encoding (usually Windows-1252 rather than
|
with the original encoding (usually Windows-1252 rather than
|
||||||
iso-8859-1 or UTF-8 interspersed in 8-bit)</li>
|
iso-8859-1 or UTF-8 interspersed in 8-bit)</li>
|
||||||
@ -632,9 +632,9 @@ Each method has deficiencies, especially the former.</p>
|
|||||||
<p>If you tell the browser to send the form in the same encoding as
|
<p>If you tell the browser to send the form in the same encoding as
|
||||||
the page, you still have the trouble of what to do with characters
|
the page, you still have the trouble of what to do with characters
|
||||||
that are outside of the character encoding's range. The behavior, once
|
that are outside of the character encoding's range. The behavior, once
|
||||||
again, varies: Firefox 2.0 entity-izes them while Internet Explorer
|
again, varies: Firefox 2.0 converts them to character entity references
|
||||||
7.0 mangles them beyond intelligibility. For serious internationalization purposes,
|
while Internet Explorer 7.0 mangles them beyond intelligibility. For
|
||||||
this is not an option.</p>
|
serious internationalization purposes, this is not an option.</p>
|
||||||
|
|
||||||
<p>The other possibility is to set Accept-Encoding to UTF-8, which
|
<p>The other possibility is to set Accept-Encoding to UTF-8, which
|
||||||
begs the question: Why aren't you using UTF-8 for everything then?
|
begs the question: Why aren't you using UTF-8 for everything then?
|
||||||
@ -674,12 +674,12 @@ it up to the module iconv to do the dirty work.</p>
|
|||||||
<p>This approach, however, is not perfect. iconv is blithely unaware
|
<p>This approach, however, is not perfect. iconv is blithely unaware
|
||||||
of HTML character entities. HTML Purifier, in order to
|
of HTML character entities. HTML Purifier, in order to
|
||||||
protect against sophisticated escaping schemes, normalizes all character
|
protect against sophisticated escaping schemes, normalizes all character
|
||||||
and numeric entities before processing the text. This leads to
|
and numeric entitie references before processing the text. This leads to
|
||||||
one important ramification:</p>
|
one important ramification:</p>
|
||||||
|
|
||||||
<p><strong>Any character that is not supported by the target character
|
<p><strong>Any character that is not supported by the target character
|
||||||
set, regardless of whether or not it is in the form of a character
|
set, regardless of whether or not it is in the form of a character
|
||||||
entity or a raw character, will be silently ignored.</strong></p>
|
entity reference or a raw character, will be silently ignored.</strong></p>
|
||||||
|
|
||||||
<p>Example of this principle at work: say you have <code>&theta;</code>
|
<p>Example of this principle at work: say you have <code>&theta;</code>
|
||||||
in your HTML, but the output is in Latin-1 (which, understandably,
|
in your HTML, but the output is in Latin-1 (which, understandably,
|
||||||
@ -711,7 +711,7 @@ Purifier has provided a slightly more palatable workaround using
|
|||||||
<li>The <code>EntityParser</code> transforms entities: <code>θ</code></li>
|
<li>The <code>EntityParser</code> transforms entities: <code>θ</code></li>
|
||||||
<li>HTML Purifier processes the code: <code>θ</code></li>
|
<li>HTML Purifier processes the code: <code>θ</code></li>
|
||||||
<li>The <code>Encoder</code> replaces all non-ASCII characters
|
<li>The <code>Encoder</code> replaces all non-ASCII characters
|
||||||
with numeric entities: <code>&#952;</code></li>
|
with numeric entity reference: <code>&#952;</code></li>
|
||||||
<li>For good measure, <code>Encoder</code> transforms encoding back to
|
<li>For good measure, <code>Encoder</code> transforms encoding back to
|
||||||
original (which is strictly unnecessary for 99% of encodings
|
original (which is strictly unnecessary for 99% of encodings
|
||||||
out there): <code>&#952;</code> (remember, it's all ASCII!)</li>
|
out there): <code>&#952;</code> (remember, it's all ASCII!)</li>
|
||||||
@ -721,12 +721,12 @@ Purifier has provided a slightly more palatable workaround using
|
|||||||
the land of Unicode characters, and is totally unacceptable for Chinese
|
the land of Unicode characters, and is totally unacceptable for Chinese
|
||||||
or Japanese texts. The even bigger kicker is that, supposing the
|
or Japanese texts. The even bigger kicker is that, supposing the
|
||||||
input encoding was actually ISO-8859-7, which <em>does</em> support
|
input encoding was actually ISO-8859-7, which <em>does</em> support
|
||||||
theta, the character would get entity-ized anyway! (The Encoder does
|
theta, the character would get converted into a character entity reference
|
||||||
not discriminate).</p>
|
anyway! (The Encoder does not discriminate).</p>
|
||||||
|
|
||||||
<p>The current functionality is about where HTML Purifier will be for
|
<p>The current functionality is about where HTML Purifier will be for
|
||||||
the rest of eternity. HTML Purifier could attempt to preserve the original
|
the rest of eternity. HTML Purifier could attempt to preserve the original
|
||||||
form of the entities so that they could be substituted back in, only the
|
form of the character references so that they could be substituted back in, only the
|
||||||
DOM extension kills them off irreversibly. HTML Purifier could also attempt
|
DOM extension kills them off irreversibly. HTML Purifier could also attempt
|
||||||
to be smart and only convert non-ASCII characters that weren't supported
|
to be smart and only convert non-ASCII characters that weren't supported
|
||||||
by the target encoding, but that would require reimplementing iconv
|
by the target encoding, but that would require reimplementing iconv
|
||||||
@ -1009,7 +1009,7 @@ when dealing with Unicode text:</p>
|
|||||||
<li>Think twice before using functions that:<ul>
|
<li>Think twice before using functions that:<ul>
|
||||||
<li>...count characters (strlen will return bytes, not characters;
|
<li>...count characters (strlen will return bytes, not characters;
|
||||||
str_split and word_wrap may corrupt)</li>
|
str_split and word_wrap may corrupt)</li>
|
||||||
<li>...entity-ize things (UTF-8 doesn't need entities)</li>
|
<li>...convert characters to entity references (UTF-8 doesn't need entities)</li>
|
||||||
<li>...do very complex string processing (*printf)</li>
|
<li>...do very complex string processing (*printf)</li>
|
||||||
</ul></li>
|
</ul></li>
|
||||||
</ul>
|
</ul>
|
||||||
|
Loading…
Reference in New Issue
Block a user