mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2024-11-08 14:58:42 +00:00
e99520ab96
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1253 48356398-32a2-884e-a903-53898d9a118a
231 lines
7.4 KiB
Plaintext
231 lines
7.4 KiB
Plaintext
|
|
Install
|
|
How to install HTML Purifier
|
|
|
|
HTML Purifier is designed to run out of the box, so actually using the library
|
|
is extremely easy. (Although, if you were looking for a step-by-step
|
|
installation GUI, you've come to the wrong place!) The impatient can scroll
|
|
down to the bottom of this INSTALL document to see the code, but you really
|
|
should make sure a few things are properly done.
|
|
|
|
|
|
1. Compatibility
|
|
|
|
HTML Purifier works in both PHP 4 and PHP 5, from PHP 4.3.2 and up. It has no
|
|
core dependencies with other libraries.
|
|
|
|
Optional extensions are iconv (usually installed) and tidy (also common).
|
|
If you use UTF-8 and don't plan on pretty-printing HTML, you can get away with
|
|
not having either of these extensions.
|
|
|
|
|
|
|
|
2. Including the library
|
|
|
|
Simply use:
|
|
|
|
require_once '/path/to/library/HTMLPurifier.auto.php';
|
|
|
|
...and you're good to go. Since HTML Purifier's codebase is fairly
|
|
large, I recommend only including HTML Purifier when you need it.
|
|
|
|
If you don't like your include_path to be fiddled around with, simply set
|
|
HTML Purifier's library/ directory to the include path yourself and then:
|
|
|
|
require_once 'HTMLPurifier.php';
|
|
|
|
Only the contents in the library/ folder are necessary, so you can remove
|
|
everything else when using HTML Purifier in a production environment.
|
|
|
|
|
|
|
|
3. Preparing the proper output environment
|
|
|
|
HTML Purifier is all about web-standards, so accordingly your webpages should
|
|
be standards compliant. HTML Purifier can deal with these doctypes:
|
|
|
|
* XHTML 1.0 Transitional (default)
|
|
* XHTML 1.0 Strict
|
|
* HTML 4.01 Transitional
|
|
* HTML 4.01 Strict
|
|
* XHTML 1.1 (sans Ruby)
|
|
|
|
...and these character encodings:
|
|
|
|
* UTF-8 (default)
|
|
* Any encoding iconv supports (support is crippled for i18n though)
|
|
|
|
The defaults are there for a reason: they are best-practice choices that
|
|
should not be changed lightly. For those of you in the dark, you can determine
|
|
the doctype from this code in your HTML documents:
|
|
|
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
|
|
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|
|
|
...and the character encoding from this code:
|
|
|
|
<meta http-equiv="Content-type" content="text/html;charset=ENCODING">
|
|
|
|
For legacy codebases these declarations may be missing. If that is the case,
|
|
STOP, and read docs/enduser-utf8.html
|
|
|
|
You may currently be vulnerable to XSS and other security threats, and HTML
|
|
Purifier won't be able to fix that.
|
|
|
|
|
|
|
|
4. Configuration
|
|
|
|
HTML Purifier is designed to run out-of-the-box, but occasionally HTML
|
|
Purifier needs to be told what to do. If you answered no to any of these
|
|
questions, read on, otherwise, you can skip to the next section (or, if you're
|
|
into configuring things just for the heck of it, skip to 4.3).
|
|
|
|
* Am I using UTF-8?
|
|
* Am I using XHTML 1.0 Transitional?
|
|
|
|
If you answered no to any of these questions, instantiate a configuration
|
|
object and read on:
|
|
|
|
$config = HTMLPurifier_Config::createDefault();
|
|
|
|
|
|
|
|
4.1. Setting a different character encoding
|
|
|
|
You really shouldn't use any other encoding except UTF-8, especially if you
|
|
plan to support multilingual websites (read section three for more details).
|
|
However, switching to UTF-8 is not always immediately feasible, so we can
|
|
adapt.
|
|
|
|
HTML Purifier uses iconv to support other character encodings, as such,
|
|
any encoding that iconv supports <http://www.gnu.org/software/libiconv/>
|
|
HTML Purifier supports with this code:
|
|
|
|
$config->set('Core', 'Encoding', /* put your encoding here */);
|
|
|
|
An example usage for Latin-1 websites (the most common encoding for English
|
|
websites):
|
|
|
|
$config->set('Core', 'Encoding', 'ISO-8859-1');
|
|
|
|
Note that HTML Purifier's support for non-Unicode encodings is crippled by the
|
|
fact that any character not supported by that encoding will be silently
|
|
dropped, EVEN if it is ampersand escaped. If you want to work around
|
|
this, you are welcome to read docs/enduser-utf8.html for a fix,
|
|
but please be cognizant of the issues the "solution" creates (for this
|
|
reason, I do not include the solution in this document).
|
|
|
|
|
|
|
|
4.2. Setting a different doctype
|
|
|
|
For those of you using HTML 4.01 Transitional, you can disable
|
|
XHTML output like this:
|
|
|
|
$config->set('HTML', 'Doctype', 'HTML 4.01 Transitional');
|
|
|
|
Other supported doctypes include:
|
|
|
|
* HTML 4.01 Strict
|
|
* HTML 4.01 Transitional
|
|
* XHTML 1.0 Strict
|
|
* XHTML 1.0 Transitional
|
|
* XHTML 1.1
|
|
|
|
|
|
|
|
4.3. Other settings
|
|
|
|
There are more configuration directives which can be read about
|
|
here: <http://htmlpurifier.org/live/configdoc/plain.html> They're a bit boring,
|
|
but they can help out for those of you who like to exert maximum control over
|
|
your code. Some of the more interesting ones are configurable at the
|
|
demo <http://htmlpurifier.org/demo.php> and are well worth looking into
|
|
for your own system.
|
|
|
|
|
|
|
|
5. Using the code
|
|
|
|
The interface is mind-numbingly simple:
|
|
|
|
$purifier = new HTMLPurifier();
|
|
$clean_html = $purifier->purify( $dirty_html );
|
|
|
|
...or, if you're using the configuration object:
|
|
|
|
$purifier = new HTMLPurifier($config);
|
|
$clean_html = $purifier->purify( $dirty_html );
|
|
|
|
That's it! For more examples, check out docs/examples/ (they aren't very
|
|
different though). Also, docs/enduser-slow.html gives advice on what to
|
|
do if HTML Purifier is slowing down your application.
|
|
|
|
|
|
|
|
6. Quick install
|
|
|
|
First, make sure library/HTMLPurifier/DefinitionCache/Serializer is
|
|
writable by the webserver (see Section 7: Caching below for details).
|
|
If your website is in UTF-8 and XHTML Transitional, use this code:
|
|
|
|
<?php
|
|
require_once '/path/to/htmlpurifier/library/HTMLPurifier.auto.php';
|
|
|
|
$purifier = new HTMLPurifier();
|
|
$clean_html = $purifier->purify($dirty_html);
|
|
?>
|
|
|
|
If your website is in a different encoding or doctype, use this code:
|
|
|
|
<?php
|
|
require_once '/path/to/htmlpurifier/library/HTMLPurifier.auto.php';
|
|
|
|
$config = HTMLPurifier_Config::createDefault();
|
|
$config->set('Core', 'Encoding', 'ISO-8859-1'); // replace with your encoding
|
|
$config->set('HTML', 'Doctype', 'HTML 4.01 Transitional'); // replace with your doctype
|
|
$purifier = new HTMLPurifier($config);
|
|
|
|
$clean_html = $purifier->purify($dirty_html);
|
|
?>
|
|
|
|
|
|
|
|
7. Caching
|
|
|
|
HTML Purifier generates some cache files (generally one or two) to speed up
|
|
its execution. For maximum performance, make sure that
|
|
library/HTMLPurifier/DefinitionCache/Serializer is writeable by the webserver.
|
|
|
|
If you are in the library/ folder of HTML Purifier, you can set the
|
|
appropriate permissions using:
|
|
|
|
chmod -R 0755 HTMLPurifier/DefinitionCache/Serializer
|
|
|
|
If the above command doesn't work, you may need to assign write permissions
|
|
to all. This may be necessary if your webserver runs as nobody, but is
|
|
not recommended since it means any other user can write files in the
|
|
directory. Use:
|
|
|
|
chmod -R 0777 HTMLPurifier/DefinitionCache/Serializer
|
|
|
|
You can also chmod files via your FTP client; this option
|
|
is usually accessible by right clicking the corresponding directory and
|
|
then selecting "chmod" or "file permissions".
|
|
|
|
Starting with 2.0.1, HTML Purifier will generate friendly error messages
|
|
that will tell you exactly what you have to chmod the directory to, if in doubt,
|
|
follow its advice.
|
|
|
|
If you are unable or unwilling to give write permissions to the cache
|
|
directory, you can either disable the cache (and suffer a performance
|
|
hit):
|
|
|
|
$config->set('Core', 'DefinitionCache', null);
|
|
|
|
Or move the cache directory somewhere else (no trailing slash):
|
|
|
|
$config->set('Cache', 'SerializerPath', '/home/user/absolute/path');
|
|
|