diff --git a/docs/enduser-utf8.html b/docs/enduser-utf8.html new file mode 100644 index 00000000..32526f55 --- /dev/null +++ b/docs/enduser-utf8.html @@ -0,0 +1,206 @@ + + +
+ + + + + +Character encoding and character sets, in truth, are not that +difficult to understand. But if you don't understand them, you are going +to be caught by surprise by some of HTML Purifier's behavior, namely +the fact that it operates UTF-8 or the limitations of the character +encoding transformations it does. This document will walk you through +determining the encoding of your system and how you should handle +this information.
+ +Text in this formatting is an aside, + interesting tidbits for the curious but not strictly necessary material to + do the tutorial. If you read this text, you'll come out + with a greater understanding of the underlying issues.+ +
In the beginning, there was ASCII, and things were simple. But they +weren't good, for no one could write in Cryllic or Thai. So there +exploded a proliferation of character encodings to remedy the problem +by extending the characters ASCII could express. This is ridiculously +simplified version of the history of character encodings shows us that +there are now many character encodings floating around.
+ +++ +A character encoding tells the computer how to + interpret raw zeroes and ones into real characters. It + usually does this by pairing numbers with characters.
+There are many different types of character encodings floating + around, but the ones we deal most frequently with are ASCII, + 8-bit encodings, and Unicode-based encodings.
++
+- ASCII is a 7-bit encoding based on the + English alphabet.
+- 8-bit encodings are extensions to ASCII + that add a potpourri of useful, non-standard characters + like é and æ. They can only add 127 characters, + so usually only support one script at a time. When you + see a page on the web, chances are it's encoded in one + of these encodings.
+- Unicode-based encodings implement the + Unicode standard and include UTF-8, UCS-2 and UTF-16. + They go beyond 8-bits (the first two are variable length, + while the second one uses 16-bits), and support almost + every language in the world. UTF-8 is gaining traction + as the dominant international encoding of the web.
+
The first step of our journey is to find out what the encoding of +insert-application-here is. The most reliable way is to ask your +browser:
+ +Internet Explorer won't give you the mime (i.e. useful/real) name of the +character encoding, so you'll have to look it up using their description. +Some common ones:
+ +IE's Description | +Mime Name | +|
---|---|---|
Windows | +||
Arabic (Windows) | Windows-1256 | |
Baltic (Windows) | Windows-1257 | |
Central European (Windows) | Windows-1250 | |
Cyrillic (Windows) | Windows-1251 | |
Greek (Windows) | Windows-1253 | |
Hebrew (Windows) | Windows-1255 | |
Thai (Windows) | TIS-620 | |
Turkish (Windows) | Windows-1254 | |
Vietnamese (Windows) | Windows-1258 | |
Western European (Windows) | Windows-1252 | |
ISO | +||
Arabic (ISO) | ISO-8859-6 | + |
Baltic (ISO) | ISO-8859-4 | + |
Central European (ISO) | ISO-8859-2 | + |
Cyrillic (ISO) | ISO-8859-5 | + |
Estonian (ISO) | ISO-8859-13 | + |
Greek (ISO) | ISO-8859-7 | + |
Hebrew (ISO-Logical) | ISO-8859-8-l | + |
Hebrew (ISO-Visual) | ISO-8859-8 | + |
Latin 9 (ISO) | ISO-8859-15 | + |
Turkish (ISO) | ISO-8859-9 | + |
Western European (ISO) | ISO-8859-1 | + |
Other | + +||
Chinese Simplified (GB18030) | GB18030 | |
Chinese Simplified (GB2312) | GB2312 | |
Chinese Simplified (HZ) | HZ | |
Chinese Traditional (Big5) | Big5 | |
Japanese (Shift-JIS) | Shift_JIS | |
Japanese (EUC) | EUC-JP | |
Korean | EUC-KR | |
Unicode (UTF-8) | UTF-8 |
Internet Explorer does not recognize some of the more obscure +character encodings, and having to lookup the real names with a table +is a pain, so I recommend using Mozilla Firefox to find out your +character encoding.
+ +At this point, you may be asking, "Didn't we already find out our
+encoding?" Well, as it turns out, there are multiple places where
+a web developer can specify a character encoding, and one such place
+is in a META
tag:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />+ +
You'll find this in the HEAD
section of an HTML document.
+The text to the right of charset=
is the "claimed"
+encoding: the HTML claims to be this encoding, but whether or not this
+is actually the case depends on other factors. For now, take note
+if your META
tag claims that either:
META
tag at all! (horror, horror!)If your META
encoding and your real encoding match,
+savvy! You can skip this section. If they don't...
If this is the case, you'll want to add in the appropriate
+META
tag to your website. It's as simple as copy-pasting
+the code snippet above and replacing UTF-8 with whatever is the mime name
+of your real encoding.
++ +For all those skeptics out there, there is a very good reason + why the character encoding should be explicitly stated. When the + browser isn't told what the character encoding of a text is, it + has to guess: and sometimes the guess is wrong. Hackers can manipulate + this guess in order to slip XSS pass filters and then fool the + browser into executing it as active code. A great example of this + is the Google UTF-7 + exploit.
+You might be able to get away with not specifying a character + encoding with the
+META
tag as long as your webserver + sends the right Content-Type header, but why risk it?
Many other developers have already discussed the subject of Unicode, +UTF-8 and internationalization, and I would like to defer to them for +a more in-depth look into character sets and encodings.
+ +FORM
submission and i18n by A.J. Flavell,
+ discusses the pitfalls of attempting to create an internationalized
+ application without using UTF-8.