0
0
mirror of https://github.com/ezyang/htmlpurifier.git synced 2025-01-03 05:11:52 +00:00
- Add HTMLPurifier.auto.php stub class that automatically configures include path
- Rewrite INSTALL document
- Add semi-lossy dumb character entity conversion to TODO list

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@469 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
Edward Z. Yang 2006-09-28 00:31:12 +00:00
parent cbdd48811d
commit 32c5b5080b
3 changed files with 123 additions and 73 deletions

178
INSTALL
View File

@ -2,145 +2,183 @@
Install Install
How to install HTML Purifier How to install HTML Purifier
Being a library, there's no fancy GUI that will take you step-by-step through HTML Purifier is designed to run out of the box, so actually using the library
configuring database credentials and other mumbo-jumbo. HTML Purifier is is extremely easy. (Although, if you were looking for a step-by-step
designed to run "out of the box." Regardless, there are still a couple of installation GUI, you've come to the wrong place!) The impatient can scroll
things you should be mindful of. down to the bottom of this INSTALL document to see the code, but you really
should make sure a few things are properly done.
0. Compatibility 1. Compatibility
HTML Purifier works in both PHP 4 and PHP 5. I have run the test suite on HTML Purifier works in both PHP 4 and PHP 5, from PHP 4.3.9 and up. It has no
these versions: core dependencies with other libraries. (Whoopee!)
- 4.3.9, 4.3.11 Optional extensions are iconv (usually installed) and tidy (also common).
- 4.4.0, 4.4.4 If you use UTF-8 and don't plan on pretty-printing HTML, you can get away with
- 5.0.0, 5.0.4 not having either of these extensions.
- 5.1.0, 5.1.6
And can confidently say that HTML Purifier should work in all versions
between and afterwards. HTML Purifier definitely does not support PHP 4.2,
and PHP 4.3 branch support may go further back than that, but I haven't tested
any earlier versions.
I have been unable to get PHP 5.0.5 working on my computer, so if someone
wants to test that, be my guest. All tests were done on Windows XP Home,
but operating system should not be a major factor in the library.
1. Including the proper files 2. Including the library
The library/ directory must be added to your path: HTML Purifier will not be Simply use:
able to find the necessary includes otherwise. This is as simple as:
set_include_path('/path/to/htmlpurifier/library' . PATH_SEPARATOR . require_once '/path/to/library/HTMLPurifier.auto.php';
get_include_path() );
...replacing /path/to/htmlpurifier with the actual location of the folder. Don't ...and you're good to go. Since HTML Purifier's codebase is fairly
worry, HTML Purifier is namespaced so unless you have another file named large, I recommend only including HTML Purifier when you need it.
HTMLPurifier.php, the files won't collide with any of your includes.
Then, it's a simple matter of including the base file: If you don't like your include_path to be fiddled around with, simply set
HTML Purifier's library/ directory to the include path yourself and then:
require_once 'HTMLPurifier.php'; require_once 'HTMLPurifier.php';
...and you're good to go. The library/ folder contains all the files you need, Only the contents in the library/ folder are necessary, so you can remove
so you can get rid of most of everything else when using the library in a everything else when using HTML Purifier in a production environment.
production environment.
2. Preparing the proper environment 3. Preparing the proper output environment
While no configuration is necessary, you first should take precautions regarding HTML Purifier is all about web-standards, so accordingly your webpages should
the other output HTML that the filtered content will be going along with. Here be standards compliant. HTML Purifier can deal with these doctypes:
is a (short) checklist:
* Have I specified XHTML 1.0 Transitional as the doctype? * XHTML 1.0 Transitional (default)
* Have I specified UTF-8 as the character encoding? * HTML 4.01 Transitional
...and these character encodings:
* UTF-8 (default)
* Any encoding iconv supports (support is crippled for i18n though)
The defaults are there for a reason: they are best-practice choices that
should not be changed lightly. For those of you in the dark, you can determine
the doctype from this code in your HTML documents:
To find out what these are, browse to your website and view its source code.
You can figure out the doctype from the a declaration that looks like
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
or no doctype. You can figure out the character encoding by looking for
...and the character encoding from this code:
<meta http-equiv="Content-type" content="text/html;charset=ENCODING"> <meta http-equiv="Content-type" content="text/html;charset=ENCODING">
I cannot stress the importance of these two bullets enough. Omitting either For legacy codebases these declarations may be missing. If that is the case,
of them could have dire consequences not only for security but for plain STOP, and read up on character encodings and doctypes (in that order). Here
old usability. You can find a more in-depth discussion of why this is needed are some links:
in docs/security.txt, in the meantime, try to change your output so this is
the case. If you can't, well, we might be able to accomodate you (read * http://www.joelonsoftware.com/articles/Unicode.html
section 3). * http://alistapart.com/stories/doctype/
You may currently be vulnerable to XSS and other security threats, and HTML
Purifier won't be able to fix that.
3. Configuring HTML Purifier 4. Configuration
HTML Purifier is designed to run out-of-the-box, but occasionally HTML HTML Purifier is designed to run out-of-the-box, but occasionally HTML
Purifier needs to be told what to do. Purifier needs to be told what to do. If you answered no to any of these
questions, read on, otherwise, you can skip to the next section (or, if you're
into configuring things just for the heck of it, skip to 4.3).
If, for some reason, you are unable to switch to UTF-8 immediately, you can * Am I using UTF-8?
switch HTML Purifier's encoding. Note that the availability of encodings is * Am I using XHTML 1.0 Transitional?
dependent on iconv, and you'll be missing characters if the charset you
choose doesn't have them. If you answered yes to any of these questions, instantiate a configuration
object and read on:
$config = HTMLPurifier_Config::createDefault();
4.1. Setting a different character encoding
You really shouldn't use any other encoding except UTF-8, especially if you
plan to support multilingual websites (read section three for more details).
However, switching to UTF-8 is not always immediately feasible, so we can
adapt.
HTML Purifier uses iconv to support other character encodings, as such,
any encoding that iconv supports <http://www.gnu.org/software/libiconv/>
HTML Purifier supports with this code:
$config->set('Core', 'Encoding', /* put your encoding here */); $config->set('Core', 'Encoding', /* put your encoding here */);
An example usage for Latin-1 websites: An example usage for Latin-1 websites (the most common encoding for English
websites):
$config->set('Core', 'Encoding', 'ISO-8859-1'); $config->set('Core', 'Encoding', 'ISO-8859-1');
Note that HTML Purifier's support for non-Unicode encodings is crippled by the
fact that any character not supported by that encoding will be silently
dropped, EVEN if it is ampersand escaped. This is a current limitation of
HTML Purifier that we are NOT actively working to fix. Patches are welcome,
but there are so many other gotchas and problems in I18N for non-Unicode
encodings that this functionality is low priority. See
<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html> for a more
detailed lowdown on the topic.
4.2. Setting a different doctype
For those of you stuck using HTML 4.01 Transitional, you can disable For those of you stuck using HTML 4.01 Transitional, you can disable
XHTML output like this: XHTML output like this:
$config->set('Core', 'XHTML', false); $config->set('Core', 'XHTML', false);
However, I strongly recommend that you use XHTML. Currently, we can only I recommend that you use XHTML, although not as much as I recommend UTF-8. If
guarantee transitional-complaint output, future versions will also allow strict your HTML 4.01 page validates, good for you!
output. There are more configuration directives which can be read about
here: http://hp.jpsband.org/live/configdoc/plain.html Currently, we can only guarantee transitional-complaint output, future
versions will also allow strict-compliant output.
3. Using the code 4.3. Other settings
There are more configuration directives which can be read about
here: <http://hp.jpsband.org/live/configdoc/plain.html> They're a bit boring,
but they can help out for those of you who like to exert maximum control over
your code.
5. Using the code
The interface is mind-numbingly simple: The interface is mind-numbingly simple:
$purifier = new HTMLPurifier(); $purifier = new HTMLPurifier();
$clean_html = $purifier->purify( $dirty_html ); $clean_html = $purifier->purify( $dirty_html );
Or, if you're using the configuration object: ...or, if you're using the configuration object:
$purifier = new HTMLPurifier($config); $purifier = new HTMLPurifier($config);
$clean_html = $purifier->purify( $dirty_html ); $clean_html = $purifier->purify( $dirty_html );
That's it. For more examples, check out docs/examples/. Also, SLOW gives That's it! For more examples, check out docs/examples/ (they aren't very
advice on what to do if HTML Purifier is slowing down your application. different though). Also, SLOW gives advice on what to do if HTML Purifier
is slowing down your application.
4. Quick install 6. Quick install
If your website is in UTF-8 and XHTML Transitional, use this code: If your website is in UTF-8 and XHTML Transitional, use this code:
<?php <?php
set_include_path('/path/to/htmlpurifier/library' require_once '/path/to/htmlpurifier/library/HTMLPurifier.auto.php';
. PATH_SEPARATOR . get_include_path() );
require_once 'HTMLPurifier.php';
$purifier = new HTMLPurifier();
$purifier = new HTMLPurifier();
$clean_html = $purifier->purify($dirty_html); $clean_html = $purifier->purify($dirty_html);
?> ?>
If your website is in a different encoding or doctype, use this code: If your website is in a different encoding or doctype, use this code:
<?php <?php
set_include_path('/path/to/htmlpurifier/library' require_once '/path/to/htmlpurifier/library/HTMLPurifier.auto.php';
. PATH_SEPARATOR . get_include_path() );
require_once 'HTMLPurifier.php';
$config = HTMLPurifier_Config::createDefault(); $config = HTMLPurifier_Config::createDefault();
$config->set('Core', 'Encoding', 'ISO-8859-1'); //replace with your encoding $config->set('Core', 'Encoding', 'ISO-8859-1'); //replace with your encoding

2
TODO
View File

@ -45,6 +45,8 @@ Unknown release (on a scratch-an-itch basis)
empty-cells:show is applied to have compatibility with Internet Explorer empty-cells:show is applied to have compatibility with Internet Explorer
- Non-lossy dumb alternate character encoding transformations, achieved by - Non-lossy dumb alternate character encoding transformations, achieved by
numerically encoding all non-ASCII characters numerically encoding all non-ASCII characters
- Semi-lossy dumb alternate character encoding transformations, achieved by
encoding all characters that have string entity equivalents
Wontfix Wontfix
- Non-lossy smart alternate character encoding transformations - Non-lossy smart alternate character encoding transformations

View File

@ -0,0 +1,10 @@
<?php
/**
* This is a stub include that automatically configures the include path.
*/
set_include_path(dirname(__FILE__) . PATH_SEPARATOR . get_include_path() );
require_once 'HTMLPurifier.php';
?>