diff --git a/NEWS b/NEWS index 34d9e1c8..9ef2fe26 100644 --- a/NEWS +++ b/NEWS @@ -13,7 +13,7 @@ NEWS ( CHANGELOG and HISTORY ) HTMLPurifier (major feature release) 1.3.3, unknown release date, may be dropped -(security/bugfix/minor feature release) +! Moved SLOW to docs/enduser-slow.html and added code examples 1.3.2, released 2006-12-25 ! HTMLPurifier object now accepts configuration arrays, no need to manually diff --git a/README b/README index 78e171ad..bfd270d8 100644 --- a/README +++ b/README @@ -1,13 +1,22 @@ README - All about HTMLPurifier + All about HTML Purifier -HTMLPurifier is an HTML filtering solution. It uses a unique combination of -robust whitelists and agressive parsing to ensure that not only are XSS -attacks thwarted, but the resulting HTML is standards compliant. +HTML Purifier is an HTML filtering solution that uses a unique combination +of robust whitelists and agressive parsing to ensure that not only are +XSS attacks thwarted, but the resulting HTML is standards compliant. -See INSTALL on how to use the library. See docs/ for more developer-oriented -documentation as well as some code examples. Users of TinyMCE or FCKeditor -may be especially interested in WYSIWYG. +HTML Purifier is oriented towards richly formatted documents from +untrusted sources that require CSS and a full tag-set. This library can +be configured to accept a more restrictive set of tags, but it won't be +as efficient as more bare-bones parsers. It will, however, do the job +right, which may be more important. -HTMLPurifier can be found on the web at: http://hp.jpsband.org/ +Places to go: + +* See INSTALL for a quick installation guide +* See docs/ for developer-oriented documentation, code examples and + an in-depth installation guide. +* See WYSIWYG for information on editors like TinyMCE and FCKeditor + +HTML Purifier can be found on the web at: http://hp.jpsband.org/ diff --git a/SLOW b/SLOW deleted file mode 100644 index bc8616d9..00000000 --- a/SLOW +++ /dev/null @@ -1,40 +0,0 @@ - -SLOW - also known as the HELP ME LIBRARY IS TOO SLOW MY PAGE TAKE TOO LONG LOAD page - -HTML Purifier is a very powerful library. But with power comes great -responsibility, or, at least, longer execution times. Remember, this -library isn't lightly grazing over submitted HTML: it's deconstructing -the whole thing, rigorously checking the parts, and then putting it -back together. - -So, if it so turns out that HTML Purifier is kinda too slow for outbound -filtering, you've got a few options: - -1. Inbound filtering - perform filtering of HTML when it's submitted by the -user. Since the user is already submitting something, an extra half a -second tacked on to the load time probably isn't going to be that huge of -a problem. Then, displaying the content is a simple a manner of outputting -it directly from your database/filesystem. The trouble with this method is -that your user loses the original text, and when doing edits, will be -handling the filtered text. While this may be a good thing, especially if -you're using a WYSIWYG editor, it can also result in data-loss if a user -makes a typo. - -2. Caching the filtered output - accept the submitted text and put it -unaltered into the database, but then also generate a filtered version and -stash that in the database. Serve the filtered version to readers, and the -unaltered version to editors. If need be, you can invalidate the cache and -have the cached filtered version be regenerated on the first page view. Pros? -Full data retention. Cons? It's more complicated, and opens other editors -up to XSS if they are using a WYSIWYG editor (to fix that, they'd have to -be able to get their hands on the *really* original text served in plaintext -mode). - -In short, inbound filtering is almost as simple as outbound filtering, but -it has some drawbacks which cannot be fixed unless you save both the original -and the filtered versions. - -There is a third option: profile and optimize HTMLPurifier yourself. Be sure -to report back your results if you decide to do that! Especially if you -port HTML Purifier to C++. ;-) diff --git a/WYSIWYG b/WYSIWYG index 6fab8bcc..718f8959 100644 --- a/WYSIWYG +++ b/WYSIWYG @@ -18,4 +18,5 @@ HTML Purifier is perfect for filtering pure-HTML input from WYSIWYG editors. Enough said. There is a proof-of-concept integration of HTML Purifier with the Mantis -bugtracker at http://hp.jpsband.org/mantis/ +bugtracker at http://hp.jpsband.org/mantis/ You can see notes on how +this integration was acheived at http://hp.jpsband.org/mantis_notes.txt diff --git a/docs/enduser-slow.html b/docs/enduser-slow.html new file mode 100644 index 00000000..bac0704d --- /dev/null +++ b/docs/enduser-slow.html @@ -0,0 +1,116 @@ + + + + + + + +Speeding up HTML Purifier - HTML Purifier + + + +

Speeding up HTML Purifier

+
...also known as the HELP ME LIBRARY IS TOO SLOW MY PAGE TAKE TOO LONG page
+ +
Filed under End-User
+
Return to the index.
+ +

HTML Purifier is a very powerful library. But with power comes great +responsibility, in the form of longer execution times. Remember, this +library isn't lightly grazing over submitted HTML: it's deconstructing +the whole thing, rigorously checking the parts, and then putting it back +together.

+ +

So, if it so turns out that HTML Purifier is kinda too slow for outbound +filtering, you've got a few options:

+ +

Inbound filtering

+ +

Perform filtering of HTML when it's submitted by the user. Since the +user is already submitting something, an extra half a second tacked on +to the load time probably isn't going to be that huge of a problem. +Then, displaying the content is a simple a manner of outputting it +directly from your database/filesystem. The trouble with this method is +that your user loses the original text, and when doing edits, will be +handling the filtered text. While this may be a good thing, especially +if you're using a WYSIWYG editor, it can also result in data-loss if a +user makes a typo.

+ +

Example (non-functional):

+ +
<?php
+    /**
+     * FORM SUBMISSION PAGE
+     * display_error($message) : displays nice error page with message
+     * display_success() : displays a nice success page
+     * display_form() : displays the HTML submission form
+     * database_insert($html) : inserts data into database as new row
+     */
+    if (!empty($_POST)) {
+        require_once '/path/to/library/HTMLPurifier.auto.php';
+        require_once 'HTMLPurifier.func.php';
+        $dirty_html = isset($_POST['html']) ? $_POST['html'] : false;
+        if (!$dirty_html) {
+            display_error('You must write some HTML!');
+        }
+        $html = HTMLPurifier($dirty_html);
+        database_insert($html);
+        display_success();
+        // notice that $dirty_html is *not* saved
+    } else {
+        display_form();
+    }
+?>
+ +

Caching the filtered output

+ +

Accept the submitted text and put it unaltered into the database, but +then also generate a filtered version and stash that in the database. +Serve the filtered version to readers, and the unaltered version to +editors. If need be, you can invalidate the cache and have the cached +filtered version be regenerated on the first page view. Pros? Full data +retention. Cons? It's more complicated, and opens other editors up to +XSS if they are using a WYSIWYG editor (to fix that, they'd have to be +able to get their hands on the *really* original text served in +plaintext mode).

+ +

Example (non-functional):

+ +
<?php
+    /**
+     * VIEW PAGE
+     * display_error($message) : displays nice error page with message
+     * cache_get($id) : retrieves HTML from fast cache (db or file)
+     * cache_insert($id, $html) : inserts good HTML into cache system
+     * database_get($id) : retrieves raw HTML from database
+     */
+    $id = isset($_GET['id']) ? (int) $_GET['id'] : false;
+    if (!$id) {
+        display_error('Must specify ID.');
+        exit;
+    }
+    $html = cache_get($id); // filesystem or database
+    if ($html === false) {
+        // cache didn't have the HTML, generate it
+        $raw_html = database_get($id);
+        require_once '/path/to/library/HTMLPurifier.auto.php';
+        require_once 'HTMLPurifier.func.php';
+        $html = HTMLPurifier($raw_html);
+        cache_insert($id, $html);
+    }
+    echo $html;
+?>
+ +

Summary

+ +

In short, inbound filtering is the simple option and caching is the +robust option (albeit with bigger storage requirements).

+ +

There is a third option, independent of the two we've discussed: profile +and optimize HTMLPurifier yourself. Be sure to report back your results +if you decide to do that! Especially if you port HTML Purifier to C++. +;-)

+ + + \ No newline at end of file diff --git a/docs/index.html b/docs/index.html index 12d839db..5179205a 100644 --- a/docs/index.html +++ b/docs/index.html @@ -28,6 +28,9 @@ information for casual developers using HTML Purifier.

Embedding YouTube videos
Explains how to safely allow the embedding of flash from trusted sites.
+
Speeding up HTML Purifier
+
Explains how to speed up HTML Purifier through caching or inbound filtering.
+

Development

diff --git a/library/HTMLPurifier.func.php b/library/HTMLPurifier.func.php index c9a81eb1..876ad7b2 100644 --- a/library/HTMLPurifier.func.php +++ b/library/HTMLPurifier.func.php @@ -6,6 +6,7 @@ * this is efficient for instances when you only use HTML Purifier * on a few of your pages, it murders bytecode caching. You still * need to add HTML Purifier to your path. + * @note ''HTMLPurifier()'' is NOT the same as ''new HTMLPurifier()'' */ function HTMLPurifier($html, $config = null) {