From 159a1cced1b3f92bc080dac305262aa57ea16db6 Mon Sep 17 00:00:00 2001
From: "Edward Z. Yang"
And finally, we get to HTML Purifier.
+And finally, we get to HTML Purifier. HTML Purifier is built to +deal with UTF-8: any indications otherwise are the result of an +encoder that converts text from your preferred encoding to UTF-8, and +back again. HTML Purifier never touches anything else, and leaves +it up to the module iconv to do the dirty work.
+ +This approach, however, is not perfect. iconv is blithely unaware +of HTML character entities. HTML Purifier, in order to +protect against sophisticated escaping schemes, normalizes all character +and numeric entities before processing the text. This leads to +one important ramification:
+ +Any character that is not supported by the target character +set, regardless of whether or not it is in the form of a character +entity or a raw character, will be silently ignored.
+ +Example of this principle at work: say you have θ
+in your HTML, but the output is in Latin-1 (which, understandably,
+does not understand Greek), the following process will occur (assuming you've
+set the encoding correctly using %Core.Encoding):
Encoder
will transform the text from ISO 8859-1 to UTF-8
+ (note that theta is preserved since it doesn't actually use
+ any non-ASCII characters): θ
EntityParser
will transform all named and numeric
+ character entities to their corresponding raw UTF-8 equivalents:
+ θ
θ
Encoder
now transforms the text back from UTF-8
+ to ISO 8859-1. Since Greek is not supported by ISO 8859-1, it
+ will be either ignored or replaced with a question mark:
+ ?
This behaviour is quite unsatisfactory. It is a deal-breaker for +I18N applications, and it can be mildly annoying for the provincial +soul who occasionally needs a special character. Since 1.4.0, HTML +Purifier has provided a slightly more palatable workaround using +%Core.EscapeNonASCIICharacters. The process now looks like:
+ +Encoder
transforms encoding to UTF-8: θ
EntityParser
transforms entities: θ
θ
Encoder
replaces all non-ASCII characters
+ with numeric entities: θ
Encoder
transforms encoding back to
+ original (which is strictly unnecessary for 99% of encodings
+ out there): θ
(remember, it's all ASCII!)...which means that this is only good for an occasional foray into +the land of Unicode characters, and is totally unacceptable for Chinese +or Japanese texts. The even bigger kicker is that, supposing the +input encoding was actually ISO-8859-7, which does support +theta, the character would get entity-ized anyway! (The Encoder does +not discriminate).
+ +The current functionality is about where HTML Purifier will be for +the rest of eternity. HTML Purifier could attempt to preserve the original +form of the entities so that they could be substituted back in, only the +DOM extension kills them off irreversibly. HTML Purifier could also attempt +to be smart and only convert non-ASCII characters that weren't supported +by the target encoding, but that would require reimplementing iconv +with HTML awareness, something I will not do.
+ +So there: either it's UTF-8 or crippled I18N support. Your pick! (and I'm +not being sarcastic here: some people could care less about other languages)
Many other developers have already discussed the subject of Unicode,