From e75b67665697ae17b50f85b7a1f45727980c22d0 Mon Sep 17 00:00:00 2001 From: "Edward Z. Yang" Date: Mon, 15 Jan 2007 19:18:17 +0000 Subject: [PATCH] Done up to Forms. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@645 48356398-32a2-884e-a903-53898d9a118a --- docs/enduser-utf8.html | 195 +++++++++++++++++++++++++++++++++++++++-- 1 file changed, 186 insertions(+), 9 deletions(-) diff --git a/docs/enduser-utf8.html b/docs/enduser-utf8.html index 8a24f7ac..79db106f 100644 --- a/docs/enduser-utf8.html +++ b/docs/enduser-utf8.html @@ -14,7 +14,8 @@ +to send HTML as ISO-8859-1. So I will, many times, go against my +own advice for sake of portability. --> @@ -76,7 +77,7 @@ there are now many character encodings floating around.

The first step of our journey is to find out what the encoding of -insert-application-here is. The most reliable way is to ask your +your website is. The most reliable way is to ask your browser:

@@ -246,12 +247,31 @@ similar things in other languages. The appropriate code is:

-
header('Content-Type:text/html; charset=UTF-8');
+
header('Content-Type:text/html; charset=UTF-8');

...replacing UTF-8 with whatever your embedded encoding is. This code must come before any output, so be careful about stray whitespace in your application.

+

PHP ini directive

+ +

PHP also has a neat little ini directive that can save you a +header call: default_charset. Using this code:

+ +
ini_set('default_charset', 'UTF-8');
+ +

...will also do the trick. If PHP is running as an Apache module (and +not as FastCGI, consult +phpinfo() for details), you can even use htaccess do apply this property +globally:

+ +
php_value default_charset "UTF-8"
+ +

As with all INI directives, this can +also go in your php.ini file. Some hosting providers allow you to customize +your own php.ini file, ask your support for details. Use:

+
default_charset = "utf-8"
+

Non-PHP

You may, for whatever reason, may need to set the character encoding @@ -416,9 +436,170 @@ much to the chagrin of HTML authors who can't set these headers.

Why UTF-8?

-

So, you've gone through all the trouble of ensuring that...

+

So, you've gone through all the trouble of ensuring that your +server and embedded characters all line up properly and are +present. Good job: at +this point, you could quit and rest easy knowing that your pages +are not vulnerable to character encoding style XSS attacks. +However, just as having a character encoding is better than +having no character encoding at all, having UTF-8 as your +character encoding is better than having some other random +character encoding, and the next step is to convert to UTF-8. +But why?

-

Needs completion!

+

Internationalization

+ +

Many software projects, at one point or another, suddenly realize +that they should be supporting more than one language. Even regular +usage in one language sometimes requires the occasional special character +that, without surprise, is not available in your character set. Sometimes +developers get around this by adding support for multiple encodings: when +using Chinese, use Big5, when using Japanese, use Shift-JIS, when +using Greek, etc. Other times, they use character entities with great +zeal.

+ +

UTF-8, however, obviates the need for any of these complicated +measures. After getting the system to use UTF-8 and adjusting for +sources that are outside the hand of the browser (more on this later), +UTF-8 just works. You can use it for any language, even many languages +at once, you don't have to worry about managing multiple encodings, +you don't have to use those user-unfriendly entities.

+ +

User-friendly

+ +

Websites encoded in Latin-1 (ISO-8859-1) which ocassionally need +a special character outside of their scope often will use a character +entity to achieve the desired effect. For instance, θ can be +written θ, regardless of the character encoding's +support of Greek letters.

+ +

This works nicely for limited use of special characters, but +say you wanted this sentence of Chinese text: 激光, +這兩個字是甚麼意思. +The entity-ized version would look like this:

+ +
激光, 這兩個字是甚麼意思
+ +

Extremely inconvenient for those of us who actually know what +character entities are, totally unintelligible to poor users who don't! +Even the slightly more user-friendly, "intelligible" character +entities like θ will leave users who are +uninterested in learning HTML scratching their heads. On the other +hand, if they see θ in an edit box, they'll know that it's a +special character, and treat it accordingly, even if they don't know +how to write that character themselves.

+ +

Wikipedia is a great case study for +an application that originally used ISO-8859-1 but switched to UTF-8 +when it became far to cumbersome to support foreign languages. Bots +will now actually go through articles and convert character entities +to their corresponding real characters for the sake of user-friendliness +and searcheability. See +Meta's +page on special characters for more details. +

+ +

Forms

+ +

While we're on the tack of users, how do non-UTF-8 web forms deal +with characters that our outside of their character set? Rather than +discuss what UTF-8 does right, we're going to show what could go wrong +if you didn't use UTF-8 and people tried to use characters outside +of your character encoding.

+ +

The troubles are large, extensive, and extremely difficult to fix (or, +at least, difficult enough that if you had the time and resources to invest +in doing the fix, you would be probably better off migrating to UTF-8). +There are two types of form submission: application/x-www-form-urlencoded +which is used for GET and by default for POST, and multipart/form-data +which may be used by POST, and is required when you want to upload +files.

+ +

The following is a summarization of notes from + +FORM submission and i18n. That document contains lots +of useful information, but is written in a rambly manner, so +here I try to get right to the point.

+ +

application/x-www-form-urlencoded

+ +

This is the Content-Type that GET requests must use, and POST requests +use by default. It involves the ubiquituous percent encoding format that +looks something like: %C3%86. There is no official way of +determining the character encoding of such a request, since the percent +encoding operates on a byte level, so it is usually assumed that it +is the same as the encoding the page containing the form was submitted +in. You'll run into very few problems if you only use characters in +the character encoding you chose.

+ +

However, once you start adding characters outside of your encoding +(and this is a lot more common than you may think: take curly +"smart" quotes from Microsoft as an example), +a whole manner of strange things start to happen. Depending on the +browser you're using, they might:

+ + + +

To properly guard against these behaviors, you'd have to sniff out +the browser agent, compile a database of different behaviors, and +take appropriate conversion action against the string (disregarding +a spate of extremely mysterious, random and devastating bugs Internet +Explorer manifests every once in a while). Or you could +use UTF-8 and rest easy knowing that none of this could possibly happen +since UTF-8 supports every character.

+ +

multipart/form-data

+ +

Multipart form submission takes a way a lot of the ambiguity +that percent-encoding had: the server now can explicitly ask for +certain encodings, and the client can explicitly tell the server +during the form submission what encoding the fields are in.

+ +

There are two ways you go with this functionality: leave it +unset and have the browser send in the same encoding as the page, +or set it to UTF-8 and then do another conversion server-side. +Each method has deficiencies, especially the former.

+ +

If you tell the browser to send the form in the same encoding as +the page, you still have the trouble of what to do with characters +that are outside of the character encoding's range. The behavior, once +again, varies: Firefox 2.0 entity-izes them while Internet Explorer +7.0 mangles them beyond intelligibility. For serious I18N purposes, +this is not an option.

+ +

The other possibility is to set Accept-Encoding to UTF-8, which +begs the question: Why aren't you using UTF-8 for everything then? +This route is more palatable, but there's a notable caveat: your data +will come in as UTF-8, so you will have to explicitly convert it into +your favored local character encoding.

+ +

I object to this approach on idealogical grounds: you're +digging yourself deeper into +the hole when you could have been converting to UTF-8 +instead. And, of course, you can't use this method for GET requests.

+ +

Well supported

+ +

HTML Purifier

+ +

Migrate to UTF-8

+ +

Text editor

+ +

Configuring your database

+ +

Convert old text

+ +

Byte Order Mark (headers already sent!)

+ +

Dealing with variable width in functions

@@ -436,10 +617,6 @@ a more in-depth look into character sets and encodings.

provides a lot of useful details into the innards of UTF-8, although it may be a little off-putting to people who don't know much about Unicode to begin with. -
  • - FORM submission and i18n by A.J. Flavell, - discusses the pitfalls of attempting to create an internationalized - application without using UTF-8.