Commit initial draft of UTF-8 document. Incomplete.

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@637 48356398-32a2-884e-a903-53898d9a118a
2024-12-22 16:31:53 +00:00 · 2007-01-13 03:58:02 +00:00 · 2007-01-13 03:58:02 +00:00 · 02006d6e64
commit 02006d6e64
parent dcaa374dae
1 changed files with 206 additions and 0 deletions
--- a/docs/enduser-utf8.html
+++ b/docs/enduser-utf8.html
@ -0,0 +1,206 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
+    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head>
+<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
+<meta name="description" content="Describes the rationale for using UTF-8, the ramifications otherwise, and how to make the switch." />
+<link rel="stylesheet" type="text/css" href="./style.css" />
+<style type="text/css">
+    .minor td {font-style:italic;}
+</style>
+
+<title>UTF-8 - HTML Purifier</title>
+
+</head><body>
+
+<h1>UTF-8</h1>
+
+<div id="filing">Filed under End-User</div>
+<div id="index">Return to the <a href="index.html">index</a>.</div>
+
+<p>Character encoding and character sets, in truth, are not that
+difficult to understand. But if you don't understand them, you are going
+to be caught by surprise by some of HTML Purifier's behavior, namely
+the fact that it operates UTF-8 or the limitations of the character
+encoding transformations it does. This document will walk you through
+determining the encoding of your system and how you should handle
+this information.</p>
+
+<blockquote class="aside">Text in this formatting is an <strong>aside</strong>,
+    interesting tidbits for the curious but not strictly necessary material to
+    do the tutorial. If you read this text, you'll come out
+    with a greater understanding of the underlying issues.</blockquote>
+
+<h2>Finding the real encoding</h2>
+
+<p>In the beginning, there was ASCII, and things were simple. But they
+weren't good, for no one could write in Cryllic or Thai. So there
+exploded a proliferation of character encodings to remedy the problem
+by extending the characters ASCII could express. This is ridiculously
+simplified version of the history of character encodings shows us that
+there are now many character encodings floating around.</p>
+
+<blockquote class="aside">
+    <p>A <strong>character encoding</strong> tells the computer how to
+    interpret raw zeroes and ones into real characters. It
+    usually does this by pairing numbers with characters.</p>
+    <p>There are many different types of character encodings floating
+    around, but the ones we deal most frequently with are ASCII, 
+    8-bit encodings, and Unicode-based encodings.</p>
+    <ul>
+        <li><strong>ASCII</strong> is a 7-bit encoding based on the
+            English alphabet.</li>
+        <li><strong>8-bit encodings</strong> are extensions to ASCII
+            that add a potpourri of useful, non-standard characters
+            like &eacute; and &aelig;. They can only add 127 characters,
+            so usually only support one script at a time. When you
+            see a page on the web, chances are it's encoded in one
+            of these encodings.</li>
+        <li><strong>Unicode-based encodings</strong> implement the
+            Unicode standard and include UTF-8, UCS-2 and UTF-16.
+            They go beyond 8-bits (the first two are variable length,
+            while the second one uses 16-bits), and support almost
+            every language in the world. UTF-8 is gaining traction
+            as the dominant international encoding of the web.</li>
+    </ul>
+</blockquote>
+
+<p>The first step of our journey is to find out what the encoding of
+<em>insert-application-here</em> is. The most reliable way is to ask your
+browser:</p>
+
+<dl>
+    <dt>Mozilla Firefox</dt>
+    <dd>Tools &gt; Page Info: Encoding</dd>
+    <dt>Internet Explorer</dt>
+    <dd>View &gt; Encoding: bulleted item is unofficial name</dd>
+</dl>
+
+<p>Internet Explorer won't give you the mime (i.e. useful/real) name of the
+character encoding, so you'll have to look it up using their description.
+Some common ones:</p>
+
+<table class="table">
+    <thead><tr>
+        <th>IE's Description</th>
+        <th>Mime Name</th>
+    </tr></thead>
+    <tr><th colspan="2">Windows</th></td>
+    <tbody>
+        <tr><td>Arabic (Windows)</td><td>Windows-1256</td></tr>
+        <tr><td>Baltic (Windows)</td><td>Windows-1257</td></tr>
+        <tr><td>Central European (Windows)</td><td>Windows-1250</td></tr>
+        <tr><td>Cyrillic (Windows)</td><td>Windows-1251</td></tr>
+        <tr><td>Greek (Windows)</td><td>Windows-1253</td></tr>
+        <tr><td>Hebrew (Windows)</td><td>Windows-1255</td></tr>
+        <tr><td>Thai (Windows)</td><td>TIS-620</td></tr>
+        <tr><td>Turkish (Windows)</td><td>Windows-1254</td></tr>
+        <tr><td>Vietnamese (Windows)</td><td>Windows-1258</td></tr>
+        <tr><td>Western European (Windows)</td><td>Windows-1252</td></tr>
+    </tbody>
+    <tr><th colspan="2">ISO</th></td>
+    <tbody>
+        <tr><td>Arabic (ISO)</td><td>ISO-8859-6</td><td>
+        <tr><td>Baltic (ISO)</td><td>ISO-8859-4</td><td>
+        <tr><td>Central European (ISO)</td><td>ISO-8859-2</td><td>
+        <tr><td>Cyrillic (ISO)</td><td>ISO-8859-5</td><td>
+        <tr class="minor"><td>Estonian (ISO)</td><td>ISO-8859-13</td><td>
+        <tr class="minor"><td>Greek (ISO)</td><td>ISO-8859-7</td><td>
+        <tr><td>Hebrew (ISO-Logical)</td><td>ISO-8859-8-l</td><td>
+        <tr><td>Hebrew (ISO-Visual)</td><td>ISO-8859-8</td><td>
+        <tr class="minor"><td>Latin 9 (ISO)</td><td>ISO-8859-15</td><td>
+        <tr class="minor"><td>Turkish (ISO)</td><td>ISO-8859-9</td><td>
+        <tr><td>Western European (ISO)</td><td>ISO-8859-1</td><td>
+    </tbody>
+    <tr><th colspan="2">Other</th></td>
+    </tbody>
+        <tr><td>Chinese Simplified (GB18030)</td><td>GB18030</td></tr>
+        <tr><td>Chinese Simplified (GB2312)</td><td>GB2312</td></tr>
+        <tr><td>Chinese Simplified (HZ)</td><td>HZ</td></tr>
+        <tr><td>Chinese Traditional (Big5)</td><td>Big5</td></tr>
+        <tr><td>Japanese (Shift-JIS)</td><td>Shift_JIS</td></tr>
+        <tr><td>Japanese (EUC)</td><td>EUC-JP</td></tr>
+        <tr><td>Korean</td><td>EUC-KR</td></tr>
+        <tr><td>Unicode (UTF-8)</td><td>UTF-8</td></tr>
+    </tbody>
+</table>
+
+<p>Internet Explorer does not recognize some of the more obscure
+character encodings, and having to lookup the real names with a table
+is a pain, so I recommend using Mozilla Firefox to find out your
+character encoding.</p>
+
+<h2>Finding the embedded encoding</h2>
+
+<p>At this point, you may be asking, &quot;Didn't we already find out our
+encoding?&quot; Well, as it turns out, there are multiple places where
+a web developer can specify a character encoding, and one such place
+is in a <code>META</code> tag:</p>
+
+<pre>&lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html; charset=UTF-8&quot; /&gt;</pre>
+
+<p>You'll find this in the <code>HEAD</code> section of an HTML document.
+The text to the right of <code>charset=</code> is the &quot;claimed&quot;
+encoding: the HTML claims to be this encoding, but whether or not this
+is actually the case depends on other factors. For now, take note
+if your <code>META</code> tag claims that either:</p>
+
+<ol>
+    <li>The character encoding is the same as the one reported by the
+        browser,</li>
+    <li>The character encoding is different from the browser's, or</li>
+    <li>There is no <code>META</code> tag at all! (horror, horror!)</li>
+</ol>
+
+<h2>Fixing the embedded encoding</h2>
+
+<p>If your <code>META</code> encoding and your real encoding match,
+savvy! You can skip this section. If they don't...</p>
+
+<h3>I have no embedded encoding!</h3>
+
+<p>If this is the case, you'll want to add in the appropriate
+<code>META</code> tag to your website. It's as simple as copy-pasting
+the code snippet above and replacing UTF-8 with whatever is the mime name
+of your real encoding.</p>
+
+<blockquote class="aside">
+    <p>For all those skeptics out there, there is a very good reason
+    why the character encoding should be explicitly stated. When the
+    browser isn't told what the character encoding of a text is, it
+    has to guess: and sometimes the guess is wrong. Hackers can manipulate
+    this guess in order to slip XSS pass filters and then fool the
+    browser into executing it as active code. A great example of this
+    is the <a href="http://shiflett.org/archive/177">Google UTF-7
+    exploit</a>.</p>
+    <p>You might be able to get away with not specifying a character
+    encoding with the <code>META</code> tag as long as your webserver
+    sends the right Content-Type header, but why risk it?</p>
+</blockquote>
+
+<h3>Huh? The embedded encoding disagrees!</h3>
+
+<h2>Further Reading</h2>
+
+<p>Many other developers have already discussed the subject of Unicode,
+UTF-8 and internationalization, and I would like to defer to them for
+a more in-depth look into character sets and encodings.</p>
+
+<ul>
+    <li><a href="http://www.joelonsoftware.com/articles/Unicode.html">
+        The Absolute Minimum Every Software Developer Absolutely,
+        Positively Must Know About Unicode and Character Sets
+        (No Excuses!)</a> by Joel Spolsky, provides a <em>very</em>
+        good high-level look at Unicode and character sets in general.</li>
+    <li><a href="http://en.wikipedia.org/wiki/UTF-8">UTF-8 on Wikipedia</a>,
+        provides a lot of useful details into the innards of UTF-8, although
+        it may be a little off-putting to people who don't know much
+        about Unicode to begin with.</li>
+    <li><a href="http://ppewww.physics.gla.ac.uk/~flavell/charset/form-i18n.html">
+        <code>FORM</code> submission and i18n</a> by A.J. Flavell,
+        discusses the pitfalls of attempting to create an internationalized
+        application without using UTF-8.</li>
+</ul>
+
+</body>
+</html>