diff --git a/INSTALL b/INSTALL index b3382056..168d1026 100644 --- a/INSTALL +++ b/INSTALL @@ -2,145 +2,183 @@ Install How to install HTML Purifier -Being a library, there's no fancy GUI that will take you step-by-step through -configuring database credentials and other mumbo-jumbo. HTML Purifier is -designed to run "out of the box." Regardless, there are still a couple of -things you should be mindful of. +HTML Purifier is designed to run out of the box, so actually using the library +is extremely easy. (Although, if you were looking for a step-by-step +installation GUI, you've come to the wrong place!) The impatient can scroll +down to the bottom of this INSTALL document to see the code, but you really +should make sure a few things are properly done. -0. Compatibility +1. Compatibility -HTML Purifier works in both PHP 4 and PHP 5. I have run the test suite on -these versions: +HTML Purifier works in both PHP 4 and PHP 5, from PHP 4.3.9 and up. It has no +core dependencies with other libraries. (Whoopee!) - - 4.3.9, 4.3.11 - - 4.4.0, 4.4.4 - - 5.0.0, 5.0.4 - - 5.1.0, 5.1.6 - -And can confidently say that HTML Purifier should work in all versions -between and afterwards. HTML Purifier definitely does not support PHP 4.2, -and PHP 4.3 branch support may go further back than that, but I haven't tested -any earlier versions. - -I have been unable to get PHP 5.0.5 working on my computer, so if someone -wants to test that, be my guest. All tests were done on Windows XP Home, -but operating system should not be a major factor in the library. +Optional extensions are iconv (usually installed) and tidy (also common). +If you use UTF-8 and don't plan on pretty-printing HTML, you can get away with +not having either of these extensions. -1. Including the proper files +2. Including the library -The library/ directory must be added to your path: HTML Purifier will not be -able to find the necessary includes otherwise. This is as simple as: +Simply use: - set_include_path('/path/to/htmlpurifier/library' . PATH_SEPARATOR . - get_include_path() ); + require_once '/path/to/library/HTMLPurifier.auto.php'; -...replacing /path/to/htmlpurifier with the actual location of the folder. Don't -worry, HTML Purifier is namespaced so unless you have another file named -HTMLPurifier.php, the files won't collide with any of your includes. +...and you're good to go. Since HTML Purifier's codebase is fairly +large, I recommend only including HTML Purifier when you need it. -Then, it's a simple matter of including the base file: +If you don't like your include_path to be fiddled around with, simply set +HTML Purifier's library/ directory to the include path yourself and then: - require_once 'HTMLPurifier.php'; + require_once 'HTMLPurifier.php'; -...and you're good to go. The library/ folder contains all the files you need, -so you can get rid of most of everything else when using the library in a -production environment. +Only the contents in the library/ folder are necessary, so you can remove +everything else when using HTML Purifier in a production environment. -2. Preparing the proper environment +3. Preparing the proper output environment -While no configuration is necessary, you first should take precautions regarding -the other output HTML that the filtered content will be going along with. Here -is a (short) checklist: +HTML Purifier is all about web-standards, so accordingly your webpages should +be standards compliant. HTML Purifier can deal with these doctypes: - * Have I specified XHTML 1.0 Transitional as the doctype? - * Have I specified UTF-8 as the character encoding? +* XHTML 1.0 Transitional (default) +* HTML 4.01 Transitional + +...and these character encodings: + +* UTF-8 (default) +* Any encoding iconv supports (support is crippled for i18n though) + +The defaults are there for a reason: they are best-practice choices that +should not be changed lightly. For those of you in the dark, you can determine +the doctype from this code in your HTML documents: -To find out what these are, browse to your website and view its source code. -You can figure out the doctype from the a declaration that looks like -or no doctype. You can figure out the character encoding by looking for + +...and the character encoding from this code: + -I cannot stress the importance of these two bullets enough. Omitting either -of them could have dire consequences not only for security but for plain -old usability. You can find a more in-depth discussion of why this is needed -in docs/security.txt, in the meantime, try to change your output so this is -the case. If you can't, well, we might be able to accomodate you (read -section 3). +For legacy codebases these declarations may be missing. If that is the case, +STOP, and read up on character encodings and doctypes (in that order). Here +are some links: + +* http://www.joelonsoftware.com/articles/Unicode.html +* http://alistapart.com/stories/doctype/ + +You may currently be vulnerable to XSS and other security threats, and HTML +Purifier won't be able to fix that. -3. Configuring HTML Purifier +4. Configuration HTML Purifier is designed to run out-of-the-box, but occasionally HTML -Purifier needs to be told what to do. +Purifier needs to be told what to do. If you answered no to any of these +questions, read on, otherwise, you can skip to the next section (or, if you're +into configuring things just for the heck of it, skip to 4.3). -If, for some reason, you are unable to switch to UTF-8 immediately, you can -switch HTML Purifier's encoding. Note that the availability of encodings is -dependent on iconv, and you'll be missing characters if the charset you -choose doesn't have them. +* Am I using UTF-8? +* Am I using XHTML 1.0 Transitional? + +If you answered yes to any of these questions, instantiate a configuration +object and read on: + + $config = HTMLPurifier_Config::createDefault(); + + + +4.1. Setting a different character encoding + +You really shouldn't use any other encoding except UTF-8, especially if you +plan to support multilingual websites (read section three for more details). +However, switching to UTF-8 is not always immediately feasible, so we can +adapt. + +HTML Purifier uses iconv to support other character encodings, as such, +any encoding that iconv supports +HTML Purifier supports with this code: $config->set('Core', 'Encoding', /* put your encoding here */); -An example usage for Latin-1 websites: +An example usage for Latin-1 websites (the most common encoding for English +websites): $config->set('Core', 'Encoding', 'ISO-8859-1'); +Note that HTML Purifier's support for non-Unicode encodings is crippled by the +fact that any character not supported by that encoding will be silently +dropped, EVEN if it is ampersand escaped. This is a current limitation of +HTML Purifier that we are NOT actively working to fix. Patches are welcome, +but there are so many other gotchas and problems in I18N for non-Unicode +encodings that this functionality is low priority. See + for a more +detailed lowdown on the topic. + + + +4.2. Setting a different doctype + For those of you stuck using HTML 4.01 Transitional, you can disable XHTML output like this: $config->set('Core', 'XHTML', false); -However, I strongly recommend that you use XHTML. Currently, we can only -guarantee transitional-complaint output, future versions will also allow strict -output. There are more configuration directives which can be read about -here: http://hp.jpsband.org/live/configdoc/plain.html +I recommend that you use XHTML, although not as much as I recommend UTF-8. If +your HTML 4.01 page validates, good for you! + +Currently, we can only guarantee transitional-complaint output, future +versions will also allow strict-compliant output. -3. Using the code +4.3. Other settings + +There are more configuration directives which can be read about +here: They're a bit boring, +but they can help out for those of you who like to exert maximum control over +your code. + + + +5. Using the code The interface is mind-numbingly simple: $purifier = new HTMLPurifier(); - $clean_html = $purifier->purify($dirty_html); + $clean_html = $purifier->purify( $dirty_html ); -Or, if you're using the configuration object: +...or, if you're using the configuration object: $purifier = new HTMLPurifier($config); - $clean_html = $purifier->purify($dirty_html); + $clean_html = $purifier->purify( $dirty_html ); -That's it. For more examples, check out docs/examples/. Also, SLOW gives -advice on what to do if HTML Purifier is slowing down your application. +That's it! For more examples, check out docs/examples/ (they aren't very +different though). Also, SLOW gives advice on what to do if HTML Purifier +is slowing down your application. -4. Quick install +6. Quick install If your website is in UTF-8 and XHTML Transitional, use this code: purify($dirty_html); ?> If your website is in a different encoding or doctype, use this code: set('Core', 'Encoding', 'ISO-8859-1'); //replace with your encoding diff --git a/TODO b/TODO index 79c32c89..e6a971eb 100644 --- a/TODO +++ b/TODO @@ -45,6 +45,8 @@ Unknown release (on a scratch-an-itch basis) empty-cells:show is applied to have compatibility with Internet Explorer - Non-lossy dumb alternate character encoding transformations, achieved by numerically encoding all non-ASCII characters + - Semi-lossy dumb alternate character encoding transformations, achieved by + encoding all characters that have string entity equivalents Wontfix - Non-lossy smart alternate character encoding transformations diff --git a/library/HTMLPurifier.auto.php b/library/HTMLPurifier.auto.php new file mode 100644 index 00000000..a66fd2e2 --- /dev/null +++ b/library/HTMLPurifier.auto.php @@ -0,0 +1,10 @@ + \ No newline at end of file