diff --git a/INSTALL b/INSTALL index 52991e33..0013705c 100644 --- a/INSTALL +++ b/INSTALL @@ -8,6 +8,7 @@ installation GUI, you've come to the wrong place!) The impatient can scroll down to the bottom of this INSTALL document to see the code, but you really should make sure a few things are properly done. +Todo: Convert to using the array syntax for configuration. 1. Compatibility diff --git a/NEWS b/NEWS index 34d9e1c8..b8ae6e24 100644 --- a/NEWS +++ b/NEWS @@ -10,10 +10,14 @@ NEWS ( CHANGELOG and HISTORY ) HTMLPurifier ========================== 1.4.0, unknown release date -(major feature release) +! Implemented list-style-image, URIs now allowed in list-style +! Implemented background-image, background-repeat and background-attachment + CSS properties. background shorthand property HAS NOT been extended + to allow these, and background-position IS NOT implemented yet. +. Implemented AttrDef_CSSURI for url(http://google.com) style declarations -1.3.3, unknown release date, may be dropped -(security/bugfix/minor feature release) +1.3.3, unknown release date, likely to be dropped +! Moved SLOW to docs/enduser-slow.html and added code examples 1.3.2, released 2006-12-25 ! HTMLPurifier object now accepts configuration arrays, no need to manually diff --git a/README b/README index 78e171ad..bfd270d8 100644 --- a/README +++ b/README @@ -1,13 +1,22 @@ README - All about HTMLPurifier + All about HTML Purifier -HTMLPurifier is an HTML filtering solution. It uses a unique combination of -robust whitelists and agressive parsing to ensure that not only are XSS -attacks thwarted, but the resulting HTML is standards compliant. +HTML Purifier is an HTML filtering solution that uses a unique combination +of robust whitelists and agressive parsing to ensure that not only are +XSS attacks thwarted, but the resulting HTML is standards compliant. -See INSTALL on how to use the library. See docs/ for more developer-oriented -documentation as well as some code examples. Users of TinyMCE or FCKeditor -may be especially interested in WYSIWYG. +HTML Purifier is oriented towards richly formatted documents from +untrusted sources that require CSS and a full tag-set. This library can +be configured to accept a more restrictive set of tags, but it won't be +as efficient as more bare-bones parsers. It will, however, do the job +right, which may be more important. -HTMLPurifier can be found on the web at: http://hp.jpsband.org/ +Places to go: + +* See INSTALL for a quick installation guide +* See docs/ for developer-oriented documentation, code examples and + an in-depth installation guide. +* See WYSIWYG for information on editors like TinyMCE and FCKeditor + +HTML Purifier can be found on the web at: http://hp.jpsband.org/ diff --git a/SLOW b/SLOW deleted file mode 100644 index bc8616d9..00000000 --- a/SLOW +++ /dev/null @@ -1,40 +0,0 @@ - -SLOW - also known as the HELP ME LIBRARY IS TOO SLOW MY PAGE TAKE TOO LONG LOAD page - -HTML Purifier is a very powerful library. But with power comes great -responsibility, or, at least, longer execution times. Remember, this -library isn't lightly grazing over submitted HTML: it's deconstructing -the whole thing, rigorously checking the parts, and then putting it -back together. - -So, if it so turns out that HTML Purifier is kinda too slow for outbound -filtering, you've got a few options: - -1. Inbound filtering - perform filtering of HTML when it's submitted by the -user. Since the user is already submitting something, an extra half a -second tacked on to the load time probably isn't going to be that huge of -a problem. Then, displaying the content is a simple a manner of outputting -it directly from your database/filesystem. The trouble with this method is -that your user loses the original text, and when doing edits, will be -handling the filtered text. While this may be a good thing, especially if -you're using a WYSIWYG editor, it can also result in data-loss if a user -makes a typo. - -2. Caching the filtered output - accept the submitted text and put it -unaltered into the database, but then also generate a filtered version and -stash that in the database. Serve the filtered version to readers, and the -unaltered version to editors. If need be, you can invalidate the cache and -have the cached filtered version be regenerated on the first page view. Pros? -Full data retention. Cons? It's more complicated, and opens other editors -up to XSS if they are using a WYSIWYG editor (to fix that, they'd have to -be able to get their hands on the *really* original text served in plaintext -mode). - -In short, inbound filtering is almost as simple as outbound filtering, but -it has some drawbacks which cannot be fixed unless you save both the original -and the filtered versions. - -There is a third option: profile and optimize HTMLPurifier yourself. Be sure -to report back your results if you decide to do that! Especially if you -port HTML Purifier to C++. ;-) diff --git a/WYSIWYG b/WYSIWYG index 6fab8bcc..718f8959 100644 --- a/WYSIWYG +++ b/WYSIWYG @@ -18,4 +18,5 @@ HTML Purifier is perfect for filtering pure-HTML input from WYSIWYG editors. Enough said. There is a proof-of-concept integration of HTML Purifier with the Mantis -bugtracker at http://hp.jpsband.org/mantis/ +bugtracker at http://hp.jpsband.org/mantis/ You can see notes on how +this integration was acheived at http://hp.jpsband.org/mantis_notes.txt diff --git a/docs/dev-progress.html b/docs/dev-progress.html index 78cd56fc..0262f170 100644 --- a/docs/dev-progress.html +++ b/docs/dev-progress.html @@ -59,7 +59,7 @@ thead th {text-align:left;padding:0.1em;background-color:#EEE;}
HTML Purifier is a very powerful library. But with power comes great +responsibility, in the form of longer execution times. Remember, this +library isn't lightly grazing over submitted HTML: it's deconstructing +the whole thing, rigorously checking the parts, and then putting it back +together.
+ +So, if it so turns out that HTML Purifier is kinda too slow for outbound +filtering, you've got a few options:
+ +Perform filtering of HTML when it's submitted by the user. Since the +user is already submitting something, an extra half a second tacked on +to the load time probably isn't going to be that huge of a problem. +Then, displaying the content is a simple a manner of outputting it +directly from your database/filesystem. The trouble with this method is +that your user loses the original text, and when doing edits, will be +handling the filtered text. While this may be a good thing, especially +if you're using a WYSIWYG editor, it can also result in data-loss if a +user makes a typo.
+ +Example (non-functional):
+ +<?php + /** + * FORM SUBMISSION PAGE + * display_error($message) : displays nice error page with message + * display_success() : displays a nice success page + * display_form() : displays the HTML submission form + * database_insert($html) : inserts data into database as new row + */ + if (!empty($_POST)) { + require_once '/path/to/library/HTMLPurifier.auto.php'; + require_once 'HTMLPurifier.func.php'; + $dirty_html = isset($_POST['html']) ? $_POST['html'] : false; + if (!$dirty_html) { + display_error('You must write some HTML!'); + } + $html = HTMLPurifier($dirty_html); + database_insert($html); + display_success(); + // notice that $dirty_html is *not* saved + } else { + display_form(); + } +?>+ +
Accept the submitted text and put it unaltered into the database, but +then also generate a filtered version and stash that in the database. +Serve the filtered version to readers, and the unaltered version to +editors. If need be, you can invalidate the cache and have the cached +filtered version be regenerated on the first page view. Pros? Full data +retention. Cons? It's more complicated, and opens other editors up to +XSS if they are using a WYSIWYG editor (to fix that, they'd have to be +able to get their hands on the *really* original text served in +plaintext mode).
+ +Example (non-functional):
+ +<?php + /** + * VIEW PAGE + * display_error($message) : displays nice error page with message + * cache_get($id) : retrieves HTML from fast cache (db or file) + * cache_insert($id, $html) : inserts good HTML into cache system + * database_get($id) : retrieves raw HTML from database + */ + $id = isset($_GET['id']) ? (int) $_GET['id'] : false; + if (!$id) { + display_error('Must specify ID.'); + exit; + } + $html = cache_get($id); // filesystem or database + if ($html === false) { + // cache didn't have the HTML, generate it + $raw_html = database_get($id); + require_once '/path/to/library/HTMLPurifier.auto.php'; + require_once 'HTMLPurifier.func.php'; + $html = HTMLPurifier($raw_html); + cache_insert($id, $html); + } + echo $html; +?>+ +
In short, inbound filtering is the simple option and caching is the +robust option (albeit with bigger storage requirements).
+ +There is a third option, independent of the two we've discussed: profile +and optimize HTMLPurifier yourself. Be sure to report back your results +if you decide to do that! Especially if you port HTML Purifier to C++. +;-)
+ + + \ No newline at end of file diff --git a/docs/enduser-utf8.html b/docs/enduser-utf8.html new file mode 100644 index 00000000..2b8338f4 --- /dev/null +++ b/docs/enduser-utf8.html @@ -0,0 +1,623 @@ + + + + + + + + + +Character encoding and character sets, in truth, are not that +difficult to understand. But if you don't understand them, you are going +to be caught by surprise by some of HTML Purifier's behavior, namely +the fact that it operates UTF-8 or the limitations of the character +encoding transformations it does. This document will walk you through +determining the encoding of your system and how you should handle +this information. It will stay away from excessive discussion on +the internals of character encoding, but offer the information in +asides that can easily be skipped.
+ +++ +Asides+Text in this formatting is an aside, + interesting tidbits for the curious but not strictly necessary material to + do the tutorial. If you read this text, you'll come out + with a greater understanding of the underlying issues.
+
In the beginning, there was ASCII, and things were simple. But they +weren't good, for no one could write in Cryllic or Thai. So there +exploded a proliferation of character encodings to remedy the problem +by extending the characters ASCII could express. This ridiculously +simplified version of the history of character encodings shows us that +there are now many character encodings floating around.
+ +++ +A character encoding tells the computer how to + interpret raw zeroes and ones into real characters. It + usually does this by pairing numbers with characters.
+There are many different types of character encodings floating + around, but the ones we deal most frequently with are ASCII, + 8-bit encodings, and Unicode-based encodings.
++
+- ASCII is a 7-bit encoding based on the + English alphabet.
+- 8-bit encodings are extensions to ASCII + that add a potpourri of useful, non-standard characters + like é and æ. They can only add 127 characters, + so usually only support one script at a time. When you + see a page on the web, chances are it's encoded in one + of these encodings.
+- Unicode-based encodings implement the + Unicode standard and include UTF-8, UCS-2 and UTF-16. + They go beyond 8-bits (the first two are variable length, + while the second one uses 16-bits), and support almost + every language in the world. UTF-8 is gaining traction + as the dominant international encoding of the web.
+
The first step of our journey is to find out what the encoding of +your website is. The most reliable way is to ask your +browser:
+ +Internet Explorer won't give you the mime (i.e. useful/real) name of the +character encoding, so you'll have to look it up using their description. +Some common ones:
+ +IE's Description | +Mime Name | +
---|---|
Windows | |
Arabic (Windows) | Windows-1256 |
Baltic (Windows) | Windows-1257 |
Central European (Windows) | Windows-1250 |
Cyrillic (Windows) | Windows-1251 |
Greek (Windows) | Windows-1253 |
Hebrew (Windows) | Windows-1255 |
Thai (Windows) | TIS-620 |
Turkish (Windows) | Windows-1254 |
Vietnamese (Windows) | Windows-1258 |
Western European (Windows) | Windows-1252 |
ISO | |
Arabic (ISO) | ISO-8859-6 |
Baltic (ISO) | ISO-8859-4 |
Central European (ISO) | ISO-8859-2 |
Cyrillic (ISO) | ISO-8859-5 |
Estonian (ISO) | ISO-8859-13 |
Greek (ISO) | ISO-8859-7 |
Hebrew (ISO-Logical) | ISO-8859-8-l |
Hebrew (ISO-Visual) | ISO-8859-8 |
Latin 9 (ISO) | ISO-8859-15 |
Turkish (ISO) | ISO-8859-9 |
Western European (ISO) | ISO-8859-1 |
Other | |
Chinese Simplified (GB18030) | GB18030 |
Chinese Simplified (GB2312) | GB2312 |
Chinese Simplified (HZ) | HZ |
Chinese Traditional (Big5) | Big5 |
Japanese (Shift-JIS) | Shift_JIS |
Japanese (EUC) | EUC-JP |
Korean | EUC-KR |
Unicode (UTF-8) | UTF-8 |
Internet Explorer does not recognize some of the more obscure +character encodings, and having to lookup the real names with a table +is a pain, so I recommend using Mozilla Firefox to find out your +character encoding.
+ +At this point, you may be asking, "Didn't we already find out our
+encoding?" Well, as it turns out, there are multiple places where
+a web developer can specify a character encoding, and one such place
+is in a META
tag:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />+ +
You'll find this in the HEAD
section of an HTML document.
+The text to the right of charset=
is the "claimed"
+encoding: the HTML claims to be this encoding, but whether or not this
+is actually the case depends on other factors. For now, take note
+if your META
tag claims that either:
META
tag at all! (horror, horror!)If your META
encoding and your real encoding match,
+savvy! You can skip this section. If they don't...
If this is the case, you'll want to add in the appropriate
+META
tag to your website. It's as simple as copy-pasting
+the code snippet above and replacing UTF-8 with whatever is the mime name
+of your real encoding.
++ +For all those skeptics out there, there is a very good reason + why the character encoding should be explicitly stated. When the + browser isn't told what the character encoding of a text is, it + has to guess: and sometimes the guess is wrong. Hackers can manipulate + this guess in order to slip XSS pass filters and then fool the + browser into executing it as active code. A great example of this + is the Google UTF-7 + exploit.
+You might be able to get away with not specifying a character + encoding with the
+META
tag as long as your webserver + sends the right Content-Type header, but why risk it? Besides, if + the user downloads the HTML file, there is no longer any webserver + to define the character encoding.
This is an extremely common mistake: another source is telling +the browser what the +character encoding is and is overriding the embedded encoding. This +source usually is the Content-Type HTTP header that the webserver (i.e. +Apache) sends. A usual Content-Type header sent with a page might +look like this:
+ +Content-Type: text/html; charset=ISO-8859-1+ +
Notice how there is a charset parameter: this is the webserver's
+way of telling a browser what the character encoding is, much like
+the META
tags we touched upon previously.
+ +In fact, the
META
tag is +designed as a substitute for the HTTP header for contexts where +sending headers is impossible (such as locally stored files without +a webserver). Thus the namehttp-equiv
(HTTP equivalent). +
There are two ways to go about fixing this: changing the META
+tag to match the HTTP header, or changing the HTTP header to match
+the META
tag. How do we know which to do? It depends
+on the website's content: after all, headers and tags are only ways of
+describing the actual characters on the web page.
If your website:
+ +Changing a META tag is easy: just swap out the old encoding +for the new. Changing the server (HTTP header) encoding, however, +is slightly more difficult.
+ +The simplest way to handle this problem is to send the encoding +yourself, via your programming language. Since you're using HTML +Purifier, I'll assume PHP, although it's not too difficult to do +similar things in +other +languages. The appropriate code is:
+ +header('Content-Type:text/html; charset=UTF-8');+ +
...replacing UTF-8 with whatever your embedded encoding is. +This code must come before any output, so be careful about +stray whitespace in your application.
+ +PHP also has a neat little ini directive that can save you a
+header call: default_charset
. Using this code:
ini_set('default_charset', 'UTF-8');+ +
...will also do the trick. If PHP is running as an Apache module (and +not as FastCGI, consult +phpinfo() for details), you can even use htaccess do apply this property +globally:
+ +php_value default_charset "UTF-8"+ +
+ +As with all INI directives, this can +also go in your php.ini file. Some hosting providers allow you to customize +your own php.ini file, ask your support for details. Use:
+default_charset = "utf-8"
You may, for whatever reason, may need to set the character encoding +on non-PHP files, usually plain ol' HTML files. Doing this +is more of a hit-or-miss process: depending on the software being +used as a webserver and the configuration of that software, certain +techniques may work, or may not work.
+ +On Apache, you can use an .htaccess file to change the character +encoding. I'll defer to +W3C +for the in-depth explanation, but it boils down to creating a file +named .htaccess with the contents:
+ +AddCharset UTF-8 .html+ +
Where UTF-8 is replaced with the character encoding you want to +use and .html is a file extension that this will be applied to. This +character encoding will then be set for any file directly in +or in the subdirectories of directory you place this file in.
+ +If you're feeling particularly courageous, you can use:
+ +AddDefaultCharset UTF-8+ +
...which changes the character set Apache adds to any document that
+doesn't have any Content-Type parameters. This directive, which the
+default configuration file sets to iso-8859-1 for security
+reasons, is probably why your headers mismatch
+with the META
tag. If you would prefer Apache not to be
+butting in on your character encodings, you can tell it not
+to send anything at all:
AddDefaultCharset Off+ +
...making your META
tags the sole source of
+character encoding information. In these cases, it is
+especially important to make sure you have valid META
+tags on your pages and all the text before them is ASCII.
+ +These directives can also be +placed in httpd.conf file for Apache, but +in most shared hosting situations you won't be able to edit this file. +
If you're not allowed to use .htaccess files, you can often +piggy-back off of Apache's default AddCharset declarations to get +your files in the proper extension. Here are Apache's default +character set declarations:
+ +Charset | +File extension(s) | +
---|---|
ISO-8859-1 | .iso8859-1 .latin1 |
ISO-8859-2 | .iso8859-2 .latin2 .cen |
ISO-8859-3 | .iso8859-3 .latin3 |
ISO-8859-4 | .iso8859-4 .latin4 |
ISO-8859-5 | .iso8859-5 .latin5 .cyr .iso-ru |
ISO-8859-6 | .iso8859-6 .latin6 .arb |
ISO-8859-7 | .iso8859-7 .latin7 .grk |
ISO-8859-8 | .iso8859-8 .latin8 .heb |
ISO-8859-9 | .iso8859-9 .latin9 .trk |
ISO-2022-JP | .iso2022-jp .jis |
ISO-2022-KR | .iso2022-kr .kis |
ISO-2022-CN | .iso2022-cn .cis |
Big5 | .Big5 .big5 .b5 |
WINDOWS-1251 | .cp-1251 .win-1251 |
CP866 | .cp866 |
KOI8-r | .koi8-r .koi8-ru |
KOI8-ru | .koi8-uk .ua |
ISO-10646-UCS-2 | .ucs2 |
ISO-10646-UCS-4 | .ucs4 |
UTF-8 | .utf8 |
GB2312 | .gb2312 .gb |
utf-7 | .utf7 |
EUC-TW | .euc-tw |
EUC-JP | .euc-jp |
EUC-KR | .euc-kr |
shift_jis | .sjis |
So, for example, a file named page.utf8.html
or
+page.html.utf8
will probably be sent with the UTF-8 charset
+attached, the difference being that if there is an
+AddCharset charset .html
declaration, it will override
+the .utf8 extension in page.utf8.html
(precedence moves
+from right to left). By default, Apache has no such declaration.
If anyone can contribute information on how to configure Microsoft +IIS to change character encodings, I'd be grateful.
+ +META
tags are the most common source of embedded
+encodings, but they can also come from somewhere else: XML
+processing instructions. They look like:
<?xml version="1.0" encoding="UTF-8"?>+ +
...and are most often found in XML documents (including XHTML).
+ +For XHTML, this processing instruction theoretically
+overrides the META
tag. In reality, this happens only when the
+XHTML is actually served as legit XML and not HTML, which is almost
+always never due to Internet Explorer's lack of support for
+application/xhtml+xml
(even though doing so is often
+argued to be good practice).
For XML, however, this processing instruction is extremely important. +Since most webservers are not configured to send charsets for .xml files, +this is the only thing a parser has to go on. Furthermore, the default +for XML files is UTF-8, which often butts heads with more common +ISO-8859-1 encoding (you see this in garbled RSS feeds).
+ +In short, if you use XHTML and have gone through the
+trouble of adding the XML header, be sure to make sure it jives
+with your META
tags and HTTP headers.
This section is not required reading, +but may answer some of your questions on what's going on in all +this character encoding hocus pocus. If you're interested in +moving on to the next phase, skip this section.
+ +A logical question that follows all of our wheeling and dealing +with multiple sources of character encodings is "Why are there +so many options?" To answer this question, we have to turn +back our definition of character encodings: they allow a program +to interpret bytes into human-readable characters.
+ +Thus, a chicken-egg problem: a character encoding
+is necessary to interpret the
+text of a document. A META
tag is in the text of a document.
+The META
tag gives the character encoding. How can we
+determine the contents of a META
tag, inside the text,
+if we don't know it's character encoding? And how do we figure out
+the character encoding, if we don't know the contents of the
+META
tag?
Fortunantely for us, the characters we need to write the
+META
are in ASCII, which is pretty much universal
+over every character encoding that is in common use today. So,
+all the web-browser has to do is parse all the way down until
+it gets to the Content-Type tag, extract the character encoding
+tag, then re-parse the document according to this new information.
Obviously this is complicated, so browsers prefer the simpler +and more efficient solution: get the character encoding from a +somewhere other than the document itself, i.e. the HTTP headers, +much to the chagrin of HTML authors who can't set these headers.
+ +So, you've gone through all the trouble of ensuring that your +server and embedded characters all line up properly and are +present. Good job: at +this point, you could quit and rest easy knowing that your pages +are not vulnerable to character encoding style XSS attacks. +However, just as having a character encoding is better than +having no character encoding at all, having UTF-8 as your +character encoding is better than having some other random +character encoding, and the next step is to convert to UTF-8. +But why?
+ +Many software projects, at one point or another, suddenly realize +that they should be supporting more than one language. Even regular +usage in one language sometimes requires the occasional special character +that, without surprise, is not available in your character set. Sometimes +developers get around this by adding support for multiple encodings: when +using Chinese, use Big5, when using Japanese, use Shift-JIS, when +using Greek, etc. Other times, they use character entities with great +zeal.
+ +UTF-8, however, obviates the need for any of these complicated +measures. After getting the system to use UTF-8 and adjusting for +sources that are outside the hand of the browser (more on this later), +UTF-8 just works. You can use it for any language, even many languages +at once, you don't have to worry about managing multiple encodings, +you don't have to use those user-unfriendly entities.
+ +Websites encoded in Latin-1 (ISO-8859-1) which ocassionally need
+a special character outside of their scope often will use a character
+entity to achieve the desired effect. For instance, θ can be
+written θ
, regardless of the character encoding's
+support of Greek letters.
This works nicely for limited use of special characters, but +say you wanted this sentence of Chinese text: 激光, +這兩個字是甚麼意思. +The entity-ized version would look like this:
+ +激光, 這兩個字是甚麼意思+ +
Extremely inconvenient for those of us who actually know what
+character entities are, totally unintelligible to poor users who don't!
+Even the slightly more user-friendly, "intelligible" character
+entities like θ
will leave users who are
+uninterested in learning HTML scratching their heads. On the other
+hand, if they see θ in an edit box, they'll know that it's a
+special character, and treat it accordingly, even if they don't know
+how to write that character themselves.
+ +Wikipedia is a great case study for +an application that originally used ISO-8859-1 but switched to UTF-8 +when it became far to cumbersome to support foreign languages. Bots +will now actually go through articles and convert character entities +to their corresponding real characters for the sake of user-friendliness +and searcheability. See +Meta's +page on special characters for more details. +
While we're on the tack of users, how do non-UTF-8 web forms deal +with characters that our outside of their character set? Rather than +discuss what UTF-8 does right, we're going to show what could go wrong +if you didn't use UTF-8 and people tried to use characters outside +of your character encoding.
+ +The troubles are large, extensive, and extremely difficult to fix (or,
+at least, difficult enough that if you had the time and resources to invest
+in doing the fix, you would be probably better off migrating to UTF-8).
+There are two types of form submission: application/x-www-form-urlencoded
+which is used for GET and by default for POST, and multipart/form-data
+which may be used by POST, and is required when you want to upload
+files.
The following is a summarization of notes from
+
+FORM
submission and i18n. That document contains lots
+of useful information, but is written in a rambly manner, so
+here I try to get right to the point.
application/x-www-form-urlencoded
This is the Content-Type that GET requests must use, and POST requests
+use by default. It involves the ubiquituous percent encoding format that
+looks something like: %C3%86
. There is no official way of
+determining the character encoding of such a request, since the percent
+encoding operates on a byte level, so it is usually assumed that it
+is the same as the encoding the page containing the form was submitted
+in. You'll run into very few problems if you only use characters in
+the character encoding you chose.
However, once you start adding characters outside of your encoding +(and this is a lot more common than you may think: take curly +"smart" quotes from Microsoft as an example), +a whole manner of strange things start to happen. Depending on the +browser you're using, they might:
+ +To properly guard against these behaviors, you'd have to sniff out +the browser agent, compile a database of different behaviors, and +take appropriate conversion action against the string (disregarding +a spate of extremely mysterious, random and devastating bugs Internet +Explorer manifests every once in a while). Or you could +use UTF-8 and rest easy knowing that none of this could possibly happen +since UTF-8 supports every character.
+ +multipart/form-data
Multipart form submission takes a way a lot of the ambiguity +that percent-encoding had: the server now can explicitly ask for +certain encodings, and the client can explicitly tell the server +during the form submission what encoding the fields are in.
+ +There are two ways you go with this functionality: leave it +unset and have the browser send in the same encoding as the page, +or set it to UTF-8 and then do another conversion server-side. +Each method has deficiencies, especially the former.
+ +If you tell the browser to send the form in the same encoding as +the page, you still have the trouble of what to do with characters +that are outside of the character encoding's range. The behavior, once +again, varies: Firefox 2.0 entity-izes them while Internet Explorer +7.0 mangles them beyond intelligibility. For serious I18N purposes, +this is not an option.
+ +The other possibility is to set Accept-Encoding to UTF-8, which +begs the question: Why aren't you using UTF-8 for everything then? +This route is more palatable, but there's a notable caveat: your data +will come in as UTF-8, so you will have to explicitly convert it into +your favored local character encoding.
+ +I object to this approach on idealogical grounds: you're +digging yourself deeper into +the hole when you could have been converting to UTF-8 +instead. And, of course, you can't use this method for GET requests.
+ +Many other developers have already discussed the subject of Unicode, +UTF-8 and internationalization, and I would like to defer to them for +a more in-depth look into character sets and encodings.
+ +This page will allow you to see precisely what HTML Purifier's internal + +
HTML Purifier claims to have a robust yet permissive whitelist: this +page will allow you to see precisely what HTML Purifier's internal whitelist is. You can also twiddle with the configuration settings to see how a directive influences the internal workings of the definition objects.
+You can specify an array by typing in a comma-separated diff --git a/smoketests/utf8.php b/smoketests/utf8.php index e5e57857..2d23330b 100644 --- a/smoketests/utf8.php +++ b/smoketests/utf8.php @@ -1,5 +1,7 @@ '; diff --git a/smoketests/xssAttacks.xml b/smoketests/xssAttacks.xml index dd8a5feb..5b833f8d 100644 --- a/smoketests/xssAttacks.xml +++ b/smoketests/xssAttacks.xml @@ -978,8 +978,6 @@ alert(a.source)</SCRIPT> -onErrorUpdate() (fires on a databound object when an error occurs while updating the associated data in the data source object) --onExit() (fires when someone clicks on a link or presses the back button) - -onFilterChange() (fires when a visual filter completes state change) -onFinish() (attacker could create the exploit when marquee is finished looping) diff --git a/tests/Debugger.php b/tests/Debugger.php index 3213af3c..0bde21bb 100644 --- a/tests/Debugger.php +++ b/tests/Debugger.php @@ -70,7 +70,7 @@ class Debugger $this->add_pre = !extension_loaded('xdebug'); } - function &instance() { + static function &instance() { static $soleInstance = false; if (!$soleInstance) $soleInstance = new Debugger(); return $soleInstance; diff --git a/tests/HTMLPurifier/AttrDef/CSSTest.php b/tests/HTMLPurifier/AttrDef/CSSTest.php index 7afa2172..cb5e8083 100644 --- a/tests/HTMLPurifier/AttrDef/CSSTest.php +++ b/tests/HTMLPurifier/AttrDef/CSSTest.php @@ -1,6 +1,7 @@ assertDef('vertical-align:12px;'); $this->assertDef('vertical-align:50%;'); $this->assertDef('table-layout:fixed;'); + $this->assertDef('list-style-image:url(nice.jpg);'); + $this->assertDef('list-style:disc url(nice.jpg) inside;'); + $this->assertDef('background-image:url(foo.jpg);'); + $this->assertDef('background-image:none;'); + $this->assertDef('background-repeat:repeat-y;'); + $this->assertDef('background-attachment:fixed;'); // duplicates $this->assertDef('text-align:right;text-align:left;', diff --git a/tests/HTMLPurifier/AttrDef/CSSURITest.php b/tests/HTMLPurifier/AttrDef/CSSURITest.php new file mode 100644 index 00000000..1fe1a3dc --- /dev/null +++ b/tests/HTMLPurifier/AttrDef/CSSURITest.php @@ -0,0 +1,37 @@ +def = new HTMLPurifier_AttrDef_CSSURI(); + + $this->assertDef('', false); + + // we could be nice but we won't be + $this->assertDef('http://www.example.com/', false); + + // no quotes are used, since that's the most widely supported + // syntax + $this->assertDef('url(', false); + $this->assertDef('url()', true); + $result = "url(http://www.example.com/)"; + $this->assertDef('url(http://www.example.com/)', $result); + $this->assertDef('url("http://www.example.com/")', $result); + $this->assertDef("url('http://www.example.com/')", $result); + $this->assertDef( + ' url( "http://www.example.com/" ) ', $result); + + // escaping + $this->assertDef("url(http://www.example.com/foo,bar\))", + "url(http://www.example.com/foo\,bar\))"); + + } + +} + +?> \ No newline at end of file diff --git a/tests/HTMLPurifier/AttrDef/CompositeTest.php b/tests/HTMLPurifier/AttrDef/CompositeTest.php index 9c49a289..a61db20c 100644 --- a/tests/HTMLPurifier/AttrDef/CompositeTest.php +++ b/tests/HTMLPurifier/AttrDef/CompositeTest.php @@ -28,10 +28,10 @@ class HTMLPurifier_AttrDef_CompositeTest extends HTMLPurifier_AttrDefHarness // first test: value properly validates on first definition // so second def is never called - $def1 =& new HTMLPurifier_AttrDefMock($this); - $def2 =& new HTMLPurifier_AttrDefMock($this); + $def1 = new HTMLPurifier_AttrDefMock($this); + $def2 = new HTMLPurifier_AttrDefMock($this); $defs = array(&$def1, &$def2); - $def =& new HTMLPurifier_AttrDef_Composite_Testable($defs); + $def = new HTMLPurifier_AttrDef_Composite_Testable($defs); $input = 'FOOBAR'; $output = 'foobar'; $def1_params = array($input, $config, $context); @@ -47,10 +47,10 @@ class HTMLPurifier_AttrDef_CompositeTest extends HTMLPurifier_AttrDefHarness // second test, first def fails, second def works - $def1 =& new HTMLPurifier_AttrDefMock($this); - $def2 =& new HTMLPurifier_AttrDefMock($this); + $def1 = new HTMLPurifier_AttrDefMock($this); + $def2 = new HTMLPurifier_AttrDefMock($this); $defs = array(&$def1, &$def2); - $def =& new HTMLPurifier_AttrDef_Composite_Testable($defs); + $def = new HTMLPurifier_AttrDef_Composite_Testable($defs); $input = 'BOOMA'; $output = 'booma'; $def_params = array($input, $config, $context); @@ -67,10 +67,10 @@ class HTMLPurifier_AttrDef_CompositeTest extends HTMLPurifier_AttrDefHarness // third test, all fail, so composite faiils - $def1 =& new HTMLPurifier_AttrDefMock($this); - $def2 =& new HTMLPurifier_AttrDefMock($this); + $def1 = new HTMLPurifier_AttrDefMock($this); + $def2 = new HTMLPurifier_AttrDefMock($this); $defs = array(&$def1, &$def2); - $def =& new HTMLPurifier_AttrDef_Composite_Testable($defs); + $def = new HTMLPurifier_AttrDef_Composite_Testable($defs); $input = 'BOOMA'; $output = false; $def_params = array($input, $config, $context); diff --git a/tests/HTMLPurifier/AttrDef/ListStyleTest.php b/tests/HTMLPurifier/AttrDef/ListStyleTest.php index a12080f8..95ef9444 100644 --- a/tests/HTMLPurifier/AttrDef/ListStyleTest.php +++ b/tests/HTMLPurifier/AttrDef/ListStyleTest.php @@ -15,9 +15,20 @@ class HTMLPurifier_AttrDef_ListStyleTest extends HTMLPurifier_AttrDefHarness $this->assertDef('circle outside'); $this->assertDef('inside'); $this->assertDef('none'); + $this->assertDef('url(foo.gif)'); + $this->assertDef('circle url(foo.gif) inside'); + // invalid values $this->assertDef('outside inside', 'outside'); + + // ordering + $this->assertDef('url(foo.gif) none', 'none url(foo.gif)'); $this->assertDef('circle lower-alpha', 'circle'); + // the spec is ambiguous about what happens in these + // cases, so we're going off the W3C CSS validator + $this->assertDef('disc none', 'disc'); + $this->assertDef('none disc', 'none'); + } diff --git a/tests/HTMLPurifier/AttrDef/URITest.php b/tests/HTMLPurifier/AttrDef/URITest.php index a80c436f..f9a9ab41 100644 --- a/tests/HTMLPurifier/AttrDef/URITest.php +++ b/tests/HTMLPurifier/AttrDef/URITest.php @@ -206,7 +206,7 @@ class HTMLPurifier_AttrDef_URITest extends HTMLPurifier_AttrDefHarness $registry =& HTMLPurifier_URISchemeRegistry::instance($fake_registry); // now, let's add a pseudo-scheme to the registry - $this->scheme =& new HTMLPurifier_URISchemeMock($this); + $this->scheme = new HTMLPurifier_URISchemeMock($this); // here are the schemes we will support with overloaded mocks $registry->setReturnReference('getScheme', $this->scheme, array('http', $this->config, $this->context)); diff --git a/tests/HTMLPurifier/ContextTest.php b/tests/HTMLPurifier/ContextTest.php index 68604d5c..88c0f615 100644 --- a/tests/HTMLPurifier/ContextTest.php +++ b/tests/HTMLPurifier/ContextTest.php @@ -20,7 +20,7 @@ class HTMLPurifier_ContextTest extends UnitTestCase $this->assertFalse($this->context->exists('IDAccumulator')); - $accumulator =& new HTMLPurifier_IDAccumulatorMock($this); + $accumulator = new HTMLPurifier_IDAccumulatorMock($this); $this->context->register('IDAccumulator', $accumulator); $this->assertTrue($this->context->exists('IDAccumulator')); diff --git a/tests/HTMLPurifier/LexerTest.php b/tests/HTMLPurifier/LexerTest.php index a690466b..26875181 100644 --- a/tests/HTMLPurifier/LexerTest.php +++ b/tests/HTMLPurifier/LexerTest.php @@ -16,7 +16,9 @@ class HTMLPurifier_LexerTest extends UnitTestCase $this->DirectLex = new HTMLPurifier_Lexer_DirectLex(); - if ( $GLOBALS['HTMLPurifierTest']['PEAR'] ) { + if ( $GLOBALS['HTMLPurifierTest']['PEAR'] && + ((error_reporting() & E_STRICT) != E_STRICT) + ) { $this->_has_pear = true; require_once 'HTMLPurifier/Lexer/PEARSax3.php'; $this->PEARSax3 = new HTMLPurifier_Lexer_PEARSax3(); diff --git a/tests/index.php b/tests/index.php index 92c845fe..3f9775aa 100644 --- a/tests/index.php +++ b/tests/index.php @@ -1,6 +1,6 @@