diff --git a/docs/enduser-utf8.html b/docs/enduser-utf8.html index 9933f1dd..062eed7b 100644 --- a/docs/enduser-utf8.html +++ b/docs/enduser-utf8.html @@ -516,7 +516,7 @@ usage in one language sometimes requires the occasional special character that, without surprise, is not available in your character set. Sometimes developers get around this by adding support for multiple encodings: when using Chinese, use Big5, when using Japanese, use Shift-JIS, when -using Greek, etc. Other times, they use character entities with great +using Greek, etc. Other times, they use character references with great zeal.
UTF-8, however, obviates the need for any of these complicated @@ -530,14 +530,14 @@ you don't have to use those user-unfriendly entities.
Websites encoded in Latin-1 (ISO-8859-1) which ocassionally need
a special character outside of their scope often will use a character
-entity to achieve the desired effect. For instance, θ can be
+entity reference to achieve the desired effect. For instance, θ can be
written θ
, regardless of the character encoding's
support of Greek letters.
This works nicely for limited use of special characters, but say you wanted this sentence of Chinese text: 激光, 這兩個字是甚麼意思. -The entity-ized version would look like this:
+The ampersand encoded version would look like this:激光, 這兩個字是甚麼意思@@ -603,7 +603,7 @@ browser you're using, they might:
If you tell the browser to send the form in the same encoding as the page, you still have the trouble of what to do with characters that are outside of the character encoding's range. The behavior, once -again, varies: Firefox 2.0 entity-izes them while Internet Explorer -7.0 mangles them beyond intelligibility. For serious internationalization purposes, -this is not an option.
+again, varies: Firefox 2.0 converts them to character entity references +while Internet Explorer 7.0 mangles them beyond intelligibility. For +serious internationalization purposes, this is not an option.The other possibility is to set Accept-Encoding to UTF-8, which begs the question: Why aren't you using UTF-8 for everything then? @@ -674,12 +674,12 @@ it up to the module iconv to do the dirty work.
This approach, however, is not perfect. iconv is blithely unaware of HTML character entities. HTML Purifier, in order to protect against sophisticated escaping schemes, normalizes all character -and numeric entities before processing the text. This leads to +and numeric entitie references before processing the text. This leads to one important ramification:
Any character that is not supported by the target character set, regardless of whether or not it is in the form of a character -entity or a raw character, will be silently ignored.
+entity reference or a raw character, will be silently ignored.Example of this principle at work: say you have θ
in your HTML, but the output is in Latin-1 (which, understandably,
@@ -711,7 +711,7 @@ Purifier has provided a slightly more palatable workaround using
EntityParser
transforms entities: θ
θ
Encoder
replaces all non-ASCII characters
- with numeric entities: θ
θ
Encoder
transforms encoding back to
original (which is strictly unnecessary for 99% of encodings
out there): θ
(remember, it's all ASCII!)The current functionality is about where HTML Purifier will be for the rest of eternity. HTML Purifier could attempt to preserve the original -form of the entities so that they could be substituted back in, only the +form of the character references so that they could be substituted back in, only the DOM extension kills them off irreversibly. HTML Purifier could also attempt to be smart and only convert non-ASCII characters that weren't supported by the target encoding, but that would require reimplementing iconv @@ -1009,7 +1009,7 @@ when dealing with Unicode text: