From e75b67665697ae17b50f85b7a1f45727980c22d0 Mon Sep 17 00:00:00 2001
From: "Edward Z. Yang"
The first step of our journey is to find out what the encoding of -insert-application-here is. The most reliable way is to ask your +your website is. The most reliable way is to ask your browser:
header('Content-Type:text/html; charset=UTF-8');+
header('Content-Type:text/html; charset=UTF-8');
...replacing UTF-8 with whatever your embedded encoding is. This code must come before any output, so be careful about stray whitespace in your application.
+PHP also has a neat little ini directive that can save you a
+header call: default_charset
. Using this code:
ini_set('default_charset', 'UTF-8');+ +
...will also do the trick. If PHP is running as an Apache module (and +not as FastCGI, consult +phpinfo() for details), you can even use htaccess do apply this property +globally:
+ +php_value default_charset "UTF-8"+ +
+As with all INI directives, this can +also go in your php.ini file. Some hosting providers allow you to customize +your own php.ini file, ask your support for details. Use:
+default_charset = "utf-8"
You may, for whatever reason, may need to set the character encoding @@ -416,9 +436,170 @@ much to the chagrin of HTML authors who can't set these headers.
So, you've gone through all the trouble of ensuring that...
+So, you've gone through all the trouble of ensuring that your +server and embedded characters all line up properly and are +present. Good job: at +this point, you could quit and rest easy knowing that your pages +are not vulnerable to character encoding style XSS attacks. +However, just as having a character encoding is better than +having no character encoding at all, having UTF-8 as your +character encoding is better than having some other random +character encoding, and the next step is to convert to UTF-8. +But why?
-+Needs completion!
Many software projects, at one point or another, suddenly realize +that they should be supporting more than one language. Even regular +usage in one language sometimes requires the occasional special character +that, without surprise, is not available in your character set. Sometimes +developers get around this by adding support for multiple encodings: when +using Chinese, use Big5, when using Japanese, use Shift-JIS, when +using Greek, etc. Other times, they use character entities with great +zeal.
+ +UTF-8, however, obviates the need for any of these complicated +measures. After getting the system to use UTF-8 and adjusting for +sources that are outside the hand of the browser (more on this later), +UTF-8 just works. You can use it for any language, even many languages +at once, you don't have to worry about managing multiple encodings, +you don't have to use those user-unfriendly entities.
+ +Websites encoded in Latin-1 (ISO-8859-1) which ocassionally need
+a special character outside of their scope often will use a character
+entity to achieve the desired effect. For instance, θ can be
+written θ
, regardless of the character encoding's
+support of Greek letters.
This works nicely for limited use of special characters, but +say you wanted this sentence of Chinese text: 激光, +這兩個字是甚麼意思. +The entity-ized version would look like this:
+ +激光, 這兩個字是甚麼意思+ +
Extremely inconvenient for those of us who actually know what
+character entities are, totally unintelligible to poor users who don't!
+Even the slightly more user-friendly, "intelligible" character
+entities like θ
will leave users who are
+uninterested in learning HTML scratching their heads. On the other
+hand, if they see θ in an edit box, they'll know that it's a
+special character, and treat it accordingly, even if they don't know
+how to write that character themselves.
+ +Wikipedia is a great case study for +an application that originally used ISO-8859-1 but switched to UTF-8 +when it became far to cumbersome to support foreign languages. Bots +will now actually go through articles and convert character entities +to their corresponding real characters for the sake of user-friendliness +and searcheability. See +Meta's +page on special characters for more details. +
While we're on the tack of users, how do non-UTF-8 web forms deal +with characters that our outside of their character set? Rather than +discuss what UTF-8 does right, we're going to show what could go wrong +if you didn't use UTF-8 and people tried to use characters outside +of your character encoding.
+ +The troubles are large, extensive, and extremely difficult to fix (or,
+at least, difficult enough that if you had the time and resources to invest
+in doing the fix, you would be probably better off migrating to UTF-8).
+There are two types of form submission: application/x-www-form-urlencoded
+which is used for GET and by default for POST, and multipart/form-data
+which may be used by POST, and is required when you want to upload
+files.
The following is a summarization of notes from
+
+FORM
submission and i18n. That document contains lots
+of useful information, but is written in a rambly manner, so
+here I try to get right to the point.
application/x-www-form-urlencoded
This is the Content-Type that GET requests must use, and POST requests
+use by default. It involves the ubiquituous percent encoding format that
+looks something like: %C3%86
. There is no official way of
+determining the character encoding of such a request, since the percent
+encoding operates on a byte level, so it is usually assumed that it
+is the same as the encoding the page containing the form was submitted
+in. You'll run into very few problems if you only use characters in
+the character encoding you chose.
However, once you start adding characters outside of your encoding +(and this is a lot more common than you may think: take curly +"smart" quotes from Microsoft as an example), +a whole manner of strange things start to happen. Depending on the +browser you're using, they might:
+ +To properly guard against these behaviors, you'd have to sniff out +the browser agent, compile a database of different behaviors, and +take appropriate conversion action against the string (disregarding +a spate of extremely mysterious, random and devastating bugs Internet +Explorer manifests every once in a while). Or you could +use UTF-8 and rest easy knowing that none of this could possibly happen +since UTF-8 supports every character.
+ +multipart/form-data
Multipart form submission takes a way a lot of the ambiguity +that percent-encoding had: the server now can explicitly ask for +certain encodings, and the client can explicitly tell the server +during the form submission what encoding the fields are in.
+ +There are two ways you go with this functionality: leave it +unset and have the browser send in the same encoding as the page, +or set it to UTF-8 and then do another conversion server-side. +Each method has deficiencies, especially the former.
+ +If you tell the browser to send the form in the same encoding as +the page, you still have the trouble of what to do with characters +that are outside of the character encoding's range. The behavior, once +again, varies: Firefox 2.0 entity-izes them while Internet Explorer +7.0 mangles them beyond intelligibility. For serious I18N purposes, +this is not an option.
+ +The other possibility is to set Accept-Encoding to UTF-8, which +begs the question: Why aren't you using UTF-8 for everything then? +This route is more palatable, but there's a notable caveat: your data +will come in as UTF-8, so you will have to explicitly convert it into +your favored local character encoding.
+ +I object to this approach on idealogical grounds: you're +digging yourself deeper into +the hole when you could have been converting to UTF-8 +instead. And, of course, you can't use this method for GET requests.
+ +FORM
submission and i18n by A.J. Flavell,
- discusses the pitfalls of attempting to create an internationalized
- application without using UTF-8.