From 0ead9558b4b89d9361b6980f43a3dea33b1c1928 Mon Sep 17 00:00:00 2001 From: "Edward Z. Yang" Date: Wed, 24 Jan 2007 01:29:25 +0000 Subject: [PATCH] Finish up to BOM. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@694 48356398-32a2-884e-a903-53898d9a118a --- docs/enduser-utf8.html | 150 ++++++++++++++++++++++++++++++++++++++++- 1 file changed, 147 insertions(+), 3 deletions(-) diff --git a/docs/enduser-utf8.html b/docs/enduser-utf8.html index fda59c20..cd0dc7ec 100644 --- a/docs/enduser-utf8.html +++ b/docs/enduser-utf8.html @@ -676,18 +676,162 @@ not being sarcastic here: some people could care less about other languages)

Migrate to UTF-8

-

Text editor

+

So, you've decided to bite the bullet, and want to migrate to UTF-8. +Note that this is not for the faint-hearted, and you should expect +the process to take longer than you think it will take.

+ +

The general idea is that you convert all existing text to UTF-8, +and then you set all the headers and META tags we discussed earlier +to UTF-8. There are many ways going about doing this: you could +write a conversion script that runs through the database and re-encodes +everything as UTF-8 or you could do the conversion on the fly when someone +reads the page. The details depend on your system, but I will cover +some of the more subtle points of migration that may trip you up.

Configuring your database

-

Convert old text

+

Most modern databases, the most prominent open-source ones being MySQL +4.1+ and PostgreSQL, support character encodings. If you're switching +to UTF-8, logically speaking, you'd want to make sure your database +knows about the change too. There are some caveats though:

+ +

Legit method

+ +

Standardization in terms of SQL syntax for specifying character +encodings is notoriously spotty. Refer to your respective database's +documentation on how to do this properly.

+ +

For MySQL, ALTER will magically perform the +character encoding conversion for you. However, you have +to make sure that the text inside the column is what is says it is: +if you had put Shift-JIS in an ISO 8859-1 column, MySQL will irreversibly mangle +the text when you try to convert it to UTF-8. You'll have to convert +it to a binary field, convert it to a Shift-JIS field (the real encoding), +and then finally to UTF-8. Many a website had pages irreversibly mangled +because they didn't realize that they'd been deluding themselves about +the character encoding all along, don't become the next victim.

+ +

For PostgreSQL, there appears to be no direct way to change the +encoding of a database (as of 8.2). You will have to dump the data, and then reimport +it into a new table. Make sure that your client encoding is set properly: +this is how PostgreSQL knows to perform an encoding conversion.

+ +

Many times, you will be also asked about the "collation" of +the new column. Collation is how a DBMS sorts text, like ordering +B, C and A into A, B and C (the problem gets surprisingly complicated +when you get to languages like Thai and Japanese). If in doubt, +going with the default setting is usually a safe bet.

+ +

Once the conversion is all said and done, you still have to remember +to set the client encoding (your encoding) properly on each database +connection using SET NAMES (which is standard SQL and is +usually supported).

+ +

Binary

+ +

Due to the abovementioned compatibility issues, a more interoperable +way of storing UTF-8 text is to stuff it in a binary datatype. +CHAR becomes BINARY, VARCHAR becomes +VARBINARY and TEXT becomes BLOB. +Doing so can save you some huge headaches:

+ + + +

MediaWiki, a very prominent I18N application, uses binary fields +for storing their data because of point three.

+ +

There are drawbacks, of course:

+ + + +

Choose based on your circumstances.

+ +

Text editor

+ +

For more flat-file oriented systems, you will often be tasked with +converting reams of existing text and HTML files into UTF-8, as well as +making sure that all new files uploaded are properly encoded. Once again, +I can only point vaguely in the right direction for converting your +existing files: make sure you backup, make sure you use +iconv(), and +make sure you know what the original character encoding of the files +is (or are, depending on the tidiness of your system).

+ +

However, I can proffer more specific advice on the subject of +text editors. Many text editors have notoriously spotty Unicode support. +To find out how your editor is doing, you can check out this list +or Wikipedia's list. +I personally use Notepad++, which works like a charm when it comes to UTF-8. +You will usually have to explicitly tell the editor through some dialogue +(usually Save as or Format) what encoding you want it to use. An editor +will often offer "Unicode" as a method of saving, which is +ambiguous. Make sure you know whether or not they really mean UTF-8 +or UTF-16 (which is another flavor of Unicode).

+ +

The two things to look out for are whether or not the editor +supports font mixing (multiple +fonts in one document) and whether or not it adds a BOM. +Font mixing is important because fonts rarely have support for every +language known to mankind: in order to be flexible, an editor must +be able to take a little from here and a little from there, otherwise +all your Chinese characters will come as nice boxes. We'll discuss +BOM below.

Byte Order Mark (headers already sent!)

-

Dealing with variable width in functions

+

The BOM, or Byte +Order Mark, is a magical, invisible character placed at +the beginning of UTF-8 files to tell people what the encoding is and +what the endianness of the text is. It is also unnecessary.

+ +

Because it's invisible, it often +catches people by surprise when it starts doing things it shouldn't +be doing. For example, this PHP file:

+ +
BOM<?php
+header('Location: index.php');
+?>
+ +

...will fail with the all too familiar Headers already sent +PHP error. And because the BOM is invisible, this culprit will go unnoticed. +My suggestion is to only use ASCII in PHP pages, but if you must, make +sure the page is saved WITHOUT the BOM.

+ +
+

The headers the error is referring to are HTTP headers, + which are sent to the browser before any HTML to tell it various + information. The moment any regular text (and yes, a BOM counts as + ordinary text) is output, the headers must be sent, and you are + not allowed to send anymore. Thus, the error.

+
+ +

If you are reading in text files to insert into the middle of another +page, it is strongly advised that you replace out the UTF-8 byte +sequence for BOM "\xEF\xBB\xBF" before inserting it in.

Fonts

+

Dealing with variable width in functions

+

Many other developers have already discussed the subject of Unicode,