UTF-8

Filed under End-User
Return to the index.

Character encoding and character sets, in truth, are not that difficult to understand. But if you don't understand them, you are going to be caught by surprise by some of HTML Purifier's behavior, namely the fact that it operates UTF-8 or the limitations of the character encoding transformations it does. This document will walk you through determining the encoding of your system and how you should handle this information.

Text in this formatting is an aside, interesting tidbits for the curious but not strictly necessary material to do the tutorial. If you read this text, you'll come out with a greater understanding of the underlying issues.

Finding the real encoding

In the beginning, there was ASCII, and things were simple. But they weren't good, for no one could write in Cryllic or Thai. So there exploded a proliferation of character encodings to remedy the problem by extending the characters ASCII could express. This is ridiculously simplified version of the history of character encodings shows us that there are now many character encodings floating around.

A character encoding tells the computer how to interpret raw zeroes and ones into real characters. It usually does this by pairing numbers with characters.

There are many different types of character encodings floating around, but the ones we deal most frequently with are ASCII, 8-bit encodings, and Unicode-based encodings.

The first step of our journey is to find out what the encoding of insert-application-here is. The most reliable way is to ask your browser:

Mozilla Firefox
Tools > Page Info: Encoding
Internet Explorer
View > Encoding: bulleted item is unofficial name

Internet Explorer won't give you the mime (i.e. useful/real) name of the character encoding, so you'll have to look it up using their description. Some common ones:

IE's Description Mime Name
Windows
Arabic (Windows)Windows-1256
Baltic (Windows)Windows-1257
Central European (Windows)Windows-1250
Cyrillic (Windows)Windows-1251
Greek (Windows)Windows-1253
Hebrew (Windows)Windows-1255
Thai (Windows)TIS-620
Turkish (Windows)Windows-1254
Vietnamese (Windows)Windows-1258
Western European (Windows)Windows-1252
ISO
Arabic (ISO)ISO-8859-6
Baltic (ISO)ISO-8859-4
Central European (ISO)ISO-8859-2
Cyrillic (ISO)ISO-8859-5
Estonian (ISO)ISO-8859-13
Greek (ISO)ISO-8859-7
Hebrew (ISO-Logical)ISO-8859-8-l
Hebrew (ISO-Visual)ISO-8859-8
Latin 9 (ISO)ISO-8859-15
Turkish (ISO)ISO-8859-9
Western European (ISO)ISO-8859-1
Other
Chinese Simplified (GB18030)GB18030
Chinese Simplified (GB2312)GB2312
Chinese Simplified (HZ)HZ
Chinese Traditional (Big5)Big5
Japanese (Shift-JIS)Shift_JIS
Japanese (EUC)EUC-JP
KoreanEUC-KR
Unicode (UTF-8)UTF-8

Internet Explorer does not recognize some of the more obscure character encodings, and having to lookup the real names with a table is a pain, so I recommend using Mozilla Firefox to find out your character encoding.

Finding the embedded encoding

At this point, you may be asking, "Didn't we already find out our encoding?" Well, as it turns out, there are multiple places where a web developer can specify a character encoding, and one such place is in a META tag:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

You'll find this in the HEAD section of an HTML document. The text to the right of charset= is the "claimed" encoding: the HTML claims to be this encoding, but whether or not this is actually the case depends on other factors. For now, take note if your META tag claims that either:

  1. The character encoding is the same as the one reported by the browser,
  2. The character encoding is different from the browser's, or
  3. There is no META tag at all! (horror, horror!)

Fixing the embedded encoding

If your META encoding and your real encoding match, savvy! You can skip this section. If they don't...

I have no embedded encoding!

If this is the case, you'll want to add in the appropriate META tag to your website. It's as simple as copy-pasting the code snippet above and replacing UTF-8 with whatever is the mime name of your real encoding.

For all those skeptics out there, there is a very good reason why the character encoding should be explicitly stated. When the browser isn't told what the character encoding of a text is, it has to guess: and sometimes the guess is wrong. Hackers can manipulate this guess in order to slip XSS pass filters and then fool the browser into executing it as active code. A great example of this is the Google UTF-7 exploit.

You might be able to get away with not specifying a character encoding with the META tag as long as your webserver sends the right Content-Type header, but why risk it?

Huh? The embedded encoding disagrees!

Further Reading

Many other developers have already discussed the subject of Unicode, UTF-8 and internationalization, and I would like to defer to them for a more in-depth look into character sets and encodings.