UTF-8

Filed under End-User

Return to the index.

Character encoding and character sets, in truth, are not that difficult to understand. But if you don't understand them, you are going to be caught by surprise by some of HTML Purifier's behavior, namely the fact that it operates UTF-8 or the limitations of the character encoding transformations it does. This document will walk you through determining the encoding of your system and how you should handle this information.

Text in this formatting is an aside, interesting tidbits for the curious but not strictly necessary material to do the tutorial. If you read this text, you'll come out with a greater understanding of the underlying issues.

Finding the real encoding

In the beginning, there was ASCII, and things were simple. But they weren't good, for no one could write in Cryllic or Thai. So there exploded a proliferation of character encodings to remedy the problem by extending the characters ASCII could express. This is ridiculously simplified version of the history of character encodings shows us that there are now many character encodings floating around.

A character encoding tells the computer how to interpret raw zeroes and ones into real characters. It usually does this by pairing numbers with characters.

There are many different types of character encodings floating around, but the ones we deal most frequently with are ASCII, 8-bit encodings, and Unicode-based encodings.

ASCII is a 7-bit encoding based on the English alphabet.

8-bit encodings are extensions to ASCII that add a potpourri of useful, non-standard characters like é and æ. They can only add 127 characters, so usually only support one script at a time. When you see a page on the web, chances are it's encoded in one of these encodings.

Unicode-based encodings implement the Unicode standard and include UTF-8, UCS-2 and UTF-16. They go beyond 8-bits (the first two are variable length, while the second one uses 16-bits), and support almost every language in the world. UTF-8 is gaining traction as the dominant international encoding of the web.

The first step of our journey is to find out what the encoding of insert-application-here is. The most reliable way is to ask your browser:

Mozilla Firefox: Tools > Page Info: Encoding
Internet Explorer: View > Encoding: bulleted item is unofficial name

Internet Explorer won't give you the mime (i.e. useful/real) name of the character encoding, so you'll have to look it up using their description. Some common ones:

IE's Description	Mime Name
Windows
Arabic (Windows)	Windows-1256
Baltic (Windows)	Windows-1257
Central European (Windows)	Windows-1250
Cyrillic (Windows)	Windows-1251
Greek (Windows)	Windows-1253
Hebrew (Windows)	Windows-1255
Thai (Windows)	TIS-620
Turkish (Windows)	Windows-1254
Vietnamese (Windows)	Windows-1258
Western European (Windows)	Windows-1252
ISO
Arabic (ISO)	ISO-8859-6
Baltic (ISO)	ISO-8859-4
Central European (ISO)	ISO-8859-2
Cyrillic (ISO)	ISO-8859-5
Estonian (ISO)	ISO-8859-13
Greek (ISO)	ISO-8859-7
Hebrew (ISO-Logical)	ISO-8859-8-l
Hebrew (ISO-Visual)	ISO-8859-8
Latin 9 (ISO)	ISO-8859-15
Turkish (ISO)	ISO-8859-9
Western European (ISO)	ISO-8859-1
Other
Chinese Simplified (GB18030)	GB18030
Chinese Simplified (GB2312)	GB2312
Chinese Simplified (HZ)	HZ
Chinese Traditional (Big5)	Big5
Japanese (Shift-JIS)	Shift_JIS
Japanese (EUC)	EUC-JP
Korean	EUC-KR
Unicode (UTF-8)	UTF-8

Internet Explorer does not recognize some of the more obscure character encodings, and having to lookup the real names with a table is a pain, so I recommend using Mozilla Firefox to find out your character encoding.

Finding the embedded encoding

At this point, you may be asking, "Didn't we already find out our encoding?" Well, as it turns out, there are multiple places where a web developer can specify a character encoding, and one such place is in a META tag:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

You'll find this in the HEAD section of an HTML document. The text to the right of charset= is the "claimed" encoding: the HTML claims to be this encoding, but whether or not this is actually the case depends on other factors. For now, take note if your META tag claims that either:

The character encoding is the same as the one reported by the browser,
The character encoding is different from the browser's, or
There is no META tag at all! (horror, horror!)

Fixing the embedded encoding

If your META encoding and your real encoding match, savvy! You can skip this section. If they don't...

I have no embedded encoding!

If this is the case, you'll want to add in the appropriate META tag to your website. It's as simple as copy-pasting the code snippet above and replacing UTF-8 with whatever is the mime name of your real encoding.

For all those skeptics out there, there is a very good reason why the character encoding should be explicitly stated. When the browser isn't told what the character encoding of a text is, it has to guess: and sometimes the guess is wrong. Hackers can manipulate this guess in order to slip XSS pass filters and then fool the browser into executing it as active code. A great example of this is the Google UTF-7 exploit.

You might be able to get away with not specifying a character encoding with the META tag as long as your webserver sends the right Content-Type header, but why risk it?

UTF-8

Finding the real encoding

Finding the embedded encoding

Fixing the embedded encoding

I have no embedded encoding!

Huh? The embedded encoding disagrees!

Further Reading