diff --git a/docs/enduser-utf8.html b/docs/enduser-utf8.html new file mode 100644 index 00000000..32526f55 --- /dev/null +++ b/docs/enduser-utf8.html @@ -0,0 +1,206 @@ + + + + + + + + +UTF-8 - HTML Purifier + + + +

UTF-8

+ +
Filed under End-User
+
Return to the index.
+ +

Character encoding and character sets, in truth, are not that +difficult to understand. But if you don't understand them, you are going +to be caught by surprise by some of HTML Purifier's behavior, namely +the fact that it operates UTF-8 or the limitations of the character +encoding transformations it does. This document will walk you through +determining the encoding of your system and how you should handle +this information.

+ +
Text in this formatting is an aside, + interesting tidbits for the curious but not strictly necessary material to + do the tutorial. If you read this text, you'll come out + with a greater understanding of the underlying issues.
+ +

Finding the real encoding

+ +

In the beginning, there was ASCII, and things were simple. But they +weren't good, for no one could write in Cryllic or Thai. So there +exploded a proliferation of character encodings to remedy the problem +by extending the characters ASCII could express. This is ridiculously +simplified version of the history of character encodings shows us that +there are now many character encodings floating around.

+ +
+

A character encoding tells the computer how to + interpret raw zeroes and ones into real characters. It + usually does this by pairing numbers with characters.

+

There are many different types of character encodings floating + around, but the ones we deal most frequently with are ASCII, + 8-bit encodings, and Unicode-based encodings.

+ +
+ +

The first step of our journey is to find out what the encoding of +insert-application-here is. The most reliable way is to ask your +browser:

+ +
+
Mozilla Firefox
+
Tools > Page Info: Encoding
+
Internet Explorer
+
View > Encoding: bulleted item is unofficial name
+
+ +

Internet Explorer won't give you the mime (i.e. useful/real) name of the +character encoding, so you'll have to look it up using their description. +Some common ones:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IE's DescriptionMime Name
Windows
Arabic (Windows)Windows-1256
Baltic (Windows)Windows-1257
Central European (Windows)Windows-1250
Cyrillic (Windows)Windows-1251
Greek (Windows)Windows-1253
Hebrew (Windows)Windows-1255
Thai (Windows)TIS-620
Turkish (Windows)Windows-1254
Vietnamese (Windows)Windows-1258
Western European (Windows)Windows-1252
ISO
Arabic (ISO)ISO-8859-6 +
Baltic (ISO)ISO-8859-4 +
Central European (ISO)ISO-8859-2 +
Cyrillic (ISO)ISO-8859-5 +
Estonian (ISO)ISO-8859-13 +
Greek (ISO)ISO-8859-7 +
Hebrew (ISO-Logical)ISO-8859-8-l +
Hebrew (ISO-Visual)ISO-8859-8 +
Latin 9 (ISO)ISO-8859-15 +
Turkish (ISO)ISO-8859-9 +
Western European (ISO)ISO-8859-1 +
Other
Chinese Simplified (GB18030)GB18030
Chinese Simplified (GB2312)GB2312
Chinese Simplified (HZ)HZ
Chinese Traditional (Big5)Big5
Japanese (Shift-JIS)Shift_JIS
Japanese (EUC)EUC-JP
KoreanEUC-KR
Unicode (UTF-8)UTF-8
+ +

Internet Explorer does not recognize some of the more obscure +character encodings, and having to lookup the real names with a table +is a pain, so I recommend using Mozilla Firefox to find out your +character encoding.

+ +

Finding the embedded encoding

+ +

At this point, you may be asking, "Didn't we already find out our +encoding?" Well, as it turns out, there are multiple places where +a web developer can specify a character encoding, and one such place +is in a META tag:

+ +
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
+ +

You'll find this in the HEAD section of an HTML document. +The text to the right of charset= is the "claimed" +encoding: the HTML claims to be this encoding, but whether or not this +is actually the case depends on other factors. For now, take note +if your META tag claims that either:

+ +
    +
  1. The character encoding is the same as the one reported by the + browser,
  2. +
  3. The character encoding is different from the browser's, or
  4. +
  5. There is no META tag at all! (horror, horror!)
  6. +
+ +

Fixing the embedded encoding

+ +

If your META encoding and your real encoding match, +savvy! You can skip this section. If they don't...

+ +

I have no embedded encoding!

+ +

If this is the case, you'll want to add in the appropriate +META tag to your website. It's as simple as copy-pasting +the code snippet above and replacing UTF-8 with whatever is the mime name +of your real encoding.

+ +
+

For all those skeptics out there, there is a very good reason + why the character encoding should be explicitly stated. When the + browser isn't told what the character encoding of a text is, it + has to guess: and sometimes the guess is wrong. Hackers can manipulate + this guess in order to slip XSS pass filters and then fool the + browser into executing it as active code. A great example of this + is the Google UTF-7 + exploit.

+

You might be able to get away with not specifying a character + encoding with the META tag as long as your webserver + sends the right Content-Type header, but why risk it?

+
+ +

Huh? The embedded encoding disagrees!

+ +

Further Reading

+ +

Many other developers have already discussed the subject of Unicode, +UTF-8 and internationalization, and I would like to defer to them for +a more in-depth look into character sets and encodings.

+ + + + + \ No newline at end of file