Character encoding and character sets, in truth, are not that difficult to understand. But if you don't understand them, you are going to be caught by surprise by some of HTML Purifier's behavior, namely the fact that it operates UTF-8 or the limitations of the character encoding transformations it does. This document will walk you through determining the encoding of your system and how you should handle this information.
Text in this formatting is an aside, interesting tidbits for the curious but not strictly necessary material to do the tutorial. If you read this text, you'll come out with a greater understanding of the underlying issues.
In the beginning, there was ASCII, and things were simple. But they weren't good, for no one could write in Cryllic or Thai. So there exploded a proliferation of character encodings to remedy the problem by extending the characters ASCII could express. This is ridiculously simplified version of the history of character encodings shows us that there are now many character encodings floating around.
A character encoding tells the computer how to interpret raw zeroes and ones into real characters. It usually does this by pairing numbers with characters.
There are many different types of character encodings floating around, but the ones we deal most frequently with are ASCII, 8-bit encodings, and Unicode-based encodings.
- ASCII is a 7-bit encoding based on the English alphabet.
- 8-bit encodings are extensions to ASCII that add a potpourri of useful, non-standard characters like é and æ. They can only add 127 characters, so usually only support one script at a time. When you see a page on the web, chances are it's encoded in one of these encodings.
- Unicode-based encodings implement the Unicode standard and include UTF-8, UCS-2 and UTF-16. They go beyond 8-bits (the first two are variable length, while the second one uses 16-bits), and support almost every language in the world. UTF-8 is gaining traction as the dominant international encoding of the web.
The first step of our journey is to find out what the encoding of insert-application-here is. The most reliable way is to ask your browser:
Internet Explorer won't give you the mime (i.e. useful/real) name of the character encoding, so you'll have to look it up using their description. Some common ones:
IE's Description | Mime Name | |
---|---|---|
Windows | ||
Arabic (Windows) | Windows-1256 | |
Baltic (Windows) | Windows-1257 | |
Central European (Windows) | Windows-1250 | |
Cyrillic (Windows) | Windows-1251 | |
Greek (Windows) | Windows-1253 | |
Hebrew (Windows) | Windows-1255 | |
Thai (Windows) | TIS-620 | |
Turkish (Windows) | Windows-1254 | |
Vietnamese (Windows) | Windows-1258 | |
Western European (Windows) | Windows-1252 | |
ISO | ||
Arabic (ISO) | ISO-8859-6 | |
Baltic (ISO) | ISO-8859-4 | |
Central European (ISO) | ISO-8859-2 | |
Cyrillic (ISO) | ISO-8859-5 | |
Estonian (ISO) | ISO-8859-13 | |
Greek (ISO) | ISO-8859-7 | |
Hebrew (ISO-Logical) | ISO-8859-8-l | |
Hebrew (ISO-Visual) | ISO-8859-8 | |
Latin 9 (ISO) | ISO-8859-15 | |
Turkish (ISO) | ISO-8859-9 | |
Western European (ISO) | ISO-8859-1 | |
Other | ||
Chinese Simplified (GB18030) | GB18030 | |
Chinese Simplified (GB2312) | GB2312 | |
Chinese Simplified (HZ) | HZ | |
Chinese Traditional (Big5) | Big5 | |
Japanese (Shift-JIS) | Shift_JIS | |
Japanese (EUC) | EUC-JP | |
Korean | EUC-KR | |
Unicode (UTF-8) | UTF-8 |
Internet Explorer does not recognize some of the more obscure character encodings, and having to lookup the real names with a table is a pain, so I recommend using Mozilla Firefox to find out your character encoding.
At this point, you may be asking, "Didn't we already find out our
encoding?" Well, as it turns out, there are multiple places where
a web developer can specify a character encoding, and one such place
is in a META
tag:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
You'll find this in the HEAD
section of an HTML document.
The text to the right of charset=
is the "claimed"
encoding: the HTML claims to be this encoding, but whether or not this
is actually the case depends on other factors. For now, take note
if your META
tag claims that either:
META
tag at all! (horror, horror!)If your META
encoding and your real encoding match,
savvy! You can skip this section. If they don't...
If this is the case, you'll want to add in the appropriate
META
tag to your website. It's as simple as copy-pasting
the code snippet above and replacing UTF-8 with whatever is the mime name
of your real encoding.
For all those skeptics out there, there is a very good reason why the character encoding should be explicitly stated. When the browser isn't told what the character encoding of a text is, it has to guess: and sometimes the guess is wrong. Hackers can manipulate this guess in order to slip XSS pass filters and then fool the browser into executing it as active code. A great example of this is the Google UTF-7 exploit.
You might be able to get away with not specifying a character encoding with the
META
tag as long as your webserver sends the right Content-Type header, but why risk it?
Many other developers have already discussed the subject of Unicode, UTF-8 and internationalization, and I would like to defer to them for a more in-depth look into character sets and encodings.
FORM
submission and i18n by A.J. Flavell,
discusses the pitfalls of attempting to create an internationalized
application without using UTF-8.