From 159a1cced1b3f92bc080dac305262aa57ea16db6 Mon Sep 17 00:00:00 2001 From: "Edward Z. Yang" Date: Tue, 23 Jan 2007 03:27:10 +0000 Subject: [PATCH] Complete HTML Purifier segment. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@693 48356398-32a2-884e-a903-53898d9a118a --- docs/enduser-utf8.html | 72 +++++++++++++++++++++++++++++++++++++++++- 1 file changed, 71 insertions(+), 1 deletion(-) diff --git a/docs/enduser-utf8.html b/docs/enduser-utf8.html index e9a5bc88..fda59c20 100644 --- a/docs/enduser-utf8.html +++ b/docs/enduser-utf8.html @@ -604,7 +604,75 @@ hounding you about broken pages.

HTML Purifier

-

And finally, we get to HTML Purifier.

+

And finally, we get to HTML Purifier. HTML Purifier is built to +deal with UTF-8: any indications otherwise are the result of an +encoder that converts text from your preferred encoding to UTF-8, and +back again. HTML Purifier never touches anything else, and leaves +it up to the module iconv to do the dirty work.

+ +

This approach, however, is not perfect. iconv is blithely unaware +of HTML character entities. HTML Purifier, in order to +protect against sophisticated escaping schemes, normalizes all character +and numeric entities before processing the text. This leads to +one important ramification:

+ +

Any character that is not supported by the target character +set, regardless of whether or not it is in the form of a character +entity or a raw character, will be silently ignored.

+ +

Example of this principle at work: say you have θ +in your HTML, but the output is in Latin-1 (which, understandably, +does not understand Greek), the following process will occur (assuming you've +set the encoding correctly using %Core.Encoding):

+ + + +

This behaviour is quite unsatisfactory. It is a deal-breaker for +I18N applications, and it can be mildly annoying for the provincial +soul who occasionally needs a special character. Since 1.4.0, HTML +Purifier has provided a slightly more palatable workaround using +%Core.EscapeNonASCIICharacters. The process now looks like:

+ + + +

...which means that this is only good for an occasional foray into +the land of Unicode characters, and is totally unacceptable for Chinese +or Japanese texts. The even bigger kicker is that, supposing the +input encoding was actually ISO-8859-7, which does support +theta, the character would get entity-ized anyway! (The Encoder does +not discriminate).

+ +

The current functionality is about where HTML Purifier will be for +the rest of eternity. HTML Purifier could attempt to preserve the original +form of the entities so that they could be substituted back in, only the +DOM extension kills them off irreversibly. HTML Purifier could also attempt +to be smart and only convert non-ASCII characters that weren't supported +by the target encoding, but that would require reimplementing iconv +with HTML awareness, something I will not do.

+ +

So there: either it's UTF-8 or crippled I18N support. Your pick! (and I'm +not being sarcastic here: some people could care less about other languages)

Migrate to UTF-8

@@ -618,6 +686,8 @@ hounding you about broken pages.

Dealing with variable width in functions

+

Fonts

+

Many other developers have already discussed the subject of Unicode,