From 29c3c21b34c116448c9efd648cc977fa24a28ff6 Mon Sep 17 00:00:00 2001 From: "Edward Z. Yang" Date: Sun, 26 Aug 2007 18:20:46 +0000 Subject: [PATCH] [2.1.2] Merge in Brett Zamir's patches. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1397 48356398-32a2-884e-a903-53898d9a118a --- NEWS | 1 + docs/enduser-customize.html | 30 +++++++++++++++------- docs/enduser-id.html | 6 ++--- docs/enduser-tidy.html | 8 +++--- docs/enduser-utf8.html | 50 ++++++++++++++++++++++--------------- 5 files changed, 59 insertions(+), 36 deletions(-) diff --git a/NEWS b/NEWS index 8bd8c3b6..ca3e02eb 100644 --- a/NEWS +++ b/NEWS @@ -27,6 +27,7 @@ NEWS ( CHANGELOG and HISTORY ) HTMLPurifier - Hammer out a bunch of edge-case bugs in the standalone distribution - Inclusion reflection removed from URISchemeRegistry; you must manually include any new schema files you wish to use +- Numerous typo fixes in documentation thanks to Brett Zamir . Unit test refactoring for one logical test per test function . Config and context parameters in ComplexHarness deprecated: instead, edit the $config and $context member variables diff --git a/docs/enduser-customize.html b/docs/enduser-customize.html index 8e9fe1dd..8634021c 100644 --- a/docs/enduser-customize.html +++ b/docs/enduser-customize.html @@ -32,7 +32,7 @@ Before we even write any code, it is paramount to consider whether or not the code we're writing is necessary or not. HTML Purifier, by default, contains a large set of elements and attributes: large enough so that - any element or attribute in XHTML 1.0 (and its HTML variant) + any element or attribute in XHTML 1.0 or 1.1 (and its HTML variants) that can be safely used by the general public is implemented.

@@ -76,11 +76,12 @@

XHTML 1.1

- We have not implemented the + As of HTMLPurifier 2.1.0, we have implemented the Ruby module, which defines a set of tags for publishing short annotations for text, used mostly in Japanese - and Chinese school texts. + and Chinese school texts, but applicable for positioning any text (not + limited to translations) above or below other corresponding text.

XHTML 2.0

@@ -492,10 +493,11 @@ $def =& $config->getHTMLDefinition(true);

The (%flow;)* indicates the allowed children of the li tag: li allows any number of flow - elements as its children. In HTML Purifier, we'd write it like - Flow (here's where the content sets we were - discussing earlier come into play). There are three shorthand content models you - can specify: + elements as its children. (The - O allows the closing tag to be + omitted, though in XML this is not allowed.) In HTML Purifier, + we'd write it like Flow (here's where the content sets + we were discussing earlier come into play). There are three shorthand + content models you can specify:

@@ -668,12 +670,22 @@ $def =& $config->getHTMLDefinition(true); Common is a combination of the above-mentioned collections.

+

+ Readers familiar with the modularization may have noticed that the Core + attribute collection differs from that specified by the abstract + modules of the XHTML Modularization 1.1. We believe this section + to be in error, as br permits the use of the style + attribute even though it uses the Core collection, and + the DTD and XML Schemas supplied by W3C support our interpretation. +

+

Attributes

- If you didn't read the previous section on + If you didn't read the earlier section on adding attributes, read it now. The last parameter is simply - array of attribute names to attribute implementations, in the exact + an array of attribute names to attribute implementations, in the exact same format as addAttribute().

diff --git a/docs/enduser-id.html b/docs/enduser-id.html index 8321a0a2..051ae7ca 100644 --- a/docs/enduser-id.html +++ b/docs/enduser-id.html @@ -58,7 +58,7 @@ appear elsewhere on the document. The method is simple:

$config->set('HTML', 'EnableAttrID', true);
 $config->set('Attr', 'IDBlacklist' array(
-    'list', 'of', 'attributes', 'that', 'are', 'forbidden'
+    'list', 'of', 'attribute', 'values', 'that', 'are', 'forbidden'
 ));

That being said, there are some notable drawbacks. First of all, you have to @@ -71,9 +71,9 @@ to possible standards-compliance issues.

Furthermore, this position becomes untenable when a single web page must hold multiple portions of user-submitted content. Since there's obviously no way to find out before-hand what IDs users will use, the blacklist is helpless. -And even since HTML Purifier validates each segment seperately, perhaps doing +And since HTML Purifier validates each segment separately, perhaps doing so at different times, it would be extremely difficult to dynamically update -the blacklist inbetween runs.

+the blacklist in between runs.

Finally, simply destroying the ID is extremely un-userfriendly behavior: after all, they might have simply specified a duplicate ID by accident.

diff --git a/docs/enduser-tidy.html b/docs/enduser-tidy.html index b3f79f60..56c9b288 100644 --- a/docs/enduser-tidy.html +++ b/docs/enduser-tidy.html @@ -22,7 +22,7 @@ out:

This ain't HTML Tidy!

-

Rather, Tidy stands for a cool set of Tidy-inspired in HTML Purifier +

Rather, Tidy stands for a cool set of Tidy-inspired features in HTML Purifier that allows users to submit deprecated elements and attributes and get valid strict markup back. For example:

@@ -33,8 +33,8 @@ valid strict markup back. For example:

<div style="text-align:center;">Centered</div>

...when this particular fix is run on the HTML. This tutorial will give -you down the lowdown of what exactly HTML Purifier will do when Tidy -is on, and how to fine tune this behavior. Once again, you do +you the lowdown of what exactly HTML Purifier will do when Tidy +is on, and how to fine-tune this behavior. Once again, you do not need Tidy installed on your PHP to use these features!

What does it do?

@@ -221,7 +221,7 @@ general syntax:

The lowdown is, quite frankly, HTML Purifier's default settings are probably good enough. The next step is to bump the level up to heavy, -and if that still doesn't satisfy your appetite, do some fine tuning. +and if that still doesn't satisfy your appetite, do some fine-tuning. Other than that, don't worry about it: this all works silently and effectively in the background.

diff --git a/docs/enduser-utf8.html b/docs/enduser-utf8.html index b8cee57d..9933f1dd 100644 --- a/docs/enduser-utf8.html +++ b/docs/enduser-utf8.html @@ -96,7 +96,7 @@ which can be a rewarding (but difficult) task.

Finding the real encoding

In the beginning, there was ASCII, and things were simple. But they -weren't good, for no one could write in Cryllic or Thai. So there +weren't good, for no one could write in Cyrillic or Thai. So there exploded a proliferation of character encodings to remedy the problem by extending the characters ASCII could express. This ridiculously simplified version of the history of character encodings shows us that @@ -138,7 +138,7 @@ browser:

View > Encoding: bulleted item is unofficial name
-

Internet Explorer won't give you the mime (i.e. useful/real) name of the +

Internet Explorer won't give you the MIME (i.e. useful/real) name of the character encoding, so you'll have to look it up using their description. Some common ones:

@@ -216,6 +216,12 @@ if your META tag claims that either:

Fixing the encoding

+

The advice given here is for pages being served as +vanilla text/html. Different practices must be used +for application/xml or application/xml+xhtml, see +W3C's +document on XHTML media types for more information.

+

If your META encoding and your real encoding match, savvy! You can skip this section. If they don't...

@@ -302,7 +308,8 @@ languages. The appropriate code is:

...replacing UTF-8 with whatever your embedded encoding is. This code must come before any output, so be careful about -stray whitespace in your application.

+stray whitespace in your application (i.e., any whitespace before +output excluding whitespace within <?php ?> tags).

PHP ini directive

@@ -313,8 +320,8 @@ header call: default

...will also do the trick. If PHP is running as an Apache module (and not as FastCGI, consult -phpinfo() for details), you can even use htaccess do apply this property -globally:

+phpinfo() for details), you can even use htaccess to apply this property +across many PHP files:

php_value default_charset "UTF-8"
@@ -360,10 +367,11 @@ to send anything at all:

AddDefaultCharset Off
-

...making your META tags the sole source of -character encoding information. In these cases, it is -especially important to make sure you have valid META -tags on your pages and all the text before them is ASCII.

+

...making your internal charset declaration (usually the META tags) +the sole source of character encoding +information. In these cases, it is especially important to make +sure you have valid META tags on your pages and all the +text before them is ASCII.

These directives can also be placed in httpd.conf file for Apache, but @@ -428,28 +436,30 @@ IIS to change character encodings, I'd be grateful.

META tags are the most common source of embedded encodings, but they can also come from somewhere else: XML -processing instructions. They look like:

+Declarations. They look like:

<?xml version="1.0" encoding="UTF-8"?>

...and are most often found in XML documents (including XHTML).

-

For XHTML, this processing instruction theoretically +

For XHTML, this XML Declaration theoretically overrides the META tag. In reality, this happens only when the XHTML is actually served as legit XML and not HTML, which is almost always never due to Internet Explorer's lack of support for application/xhtml+xml (even though doing so is often -argued to be good practice).

+argued to be good +practice and is required by the XHTML 1.1 specification).

-

For XML, however, this processing instruction is extremely important. +

For XML, however, this XML Declaration is extremely important. Since most webservers are not configured to send charsets for .xml files, this is the only thing a parser has to go on. Furthermore, the default for XML files is UTF-8, which often butts heads with more common ISO-8859-1 encoding (you see this in garbled RSS feeds).

In short, if you use XHTML and have gone through the -trouble of adding the XML header, make sure it jives -with your META tags and HTTP headers.

+trouble of adding the XML Declaration, make sure it jives +with your META tags (which should only be present +if served in text/html) and HTTP headers.

Inside the process

@@ -545,7 +555,7 @@ an application that originally used ISO-8859-1 but switched to UTF-8 when it became far to cumbersome to support foreign languages. Bots will now actually go through articles and convert character entities to their corresponding real characters for the sake of user-friendliness -and searcheability. See +and searchability. See Meta's page on special characters for more details.

@@ -609,7 +619,7 @@ since UTF-8 supports every character.

multipart/form-data

-

Multipart form submission takes a way a lot of the ambiguity +

Multipart form submission takes away a lot of the ambiguity that percent-encoding had: the server now can explicitly ask for certain encodings, and the client can explicitly tell the server during the form submission what encoding the fields are in.

@@ -678,7 +688,7 @@ set the encoding correctly using %Core.Encoding):

  • The Encoder will transform the text from ISO 8859-1 to UTF-8 - (note that theta is preserved since it doesn't actually use + (note that theta is preserved here since it doesn't actually use any non-ASCII characters): &theta;
  • The EntityParser will transform all named and numeric character entities to their corresponding raw UTF-8 equivalents: @@ -723,7 +733,7 @@ by the target encoding, but that would require reimplementing iconv with HTML awareness, something I will not do.

    So there: either it's UTF-8 or crippled international support. Your pick! (and I'm -not being sarcastic here: some people could care less about other languages)

    +not being sarcastic here: some people could care less about other languages).

    Migrate to UTF-8

    @@ -985,7 +995,7 @@ and yes, it is variable width. Other traits:

    in different ways. It is beyond the scope of this document to explain what precisely these implications are. PHPWact provides a very good reference document -on what to expect from each functions, although coverage is spotty in +on what to expect from each function, although coverage is spotty in some areas. Their more general notes on character sets are also worth looking at for information on UTF-8. Some rules of thumb