mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2025-01-08 07:01:53 +00:00
Merged 463:474 for 1.1.2 release.
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/branches/1.1@475 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
parent
6ef8abd04f
commit
8104145580
11
Doxyfile
11
Doxyfile
@ -4,7 +4,7 @@
|
||||
# Project related configuration options
|
||||
#---------------------------------------------------------------------------
|
||||
PROJECT_NAME = HTML Purifier
|
||||
PROJECT_NUMBER = 1.0.0
|
||||
PROJECT_NUMBER = 1.1.2
|
||||
OUTPUT_DIRECTORY = "C:/Documents and Settings/Edward/My Documents/My Webs/htmlpurifier/docs/doxygen"
|
||||
CREATE_SUBDIRS = NO
|
||||
OUTPUT_LANGUAGE = English
|
||||
@ -89,9 +89,12 @@ EXCLUDE =
|
||||
EXCLUDE_SYMLINKS = NO
|
||||
EXCLUDE_PATTERNS = */tests/* \
|
||||
*/benchmarks/* \
|
||||
*/docs/phpdoc/* \
|
||||
*/docs/doxygen/* \
|
||||
*/test-settings.php
|
||||
*/docs/* \
|
||||
*/test-settings.php \
|
||||
*/configdoc/* \
|
||||
*/test-settings.php \
|
||||
*/maintenance/* \
|
||||
*/smoketests/*
|
||||
EXAMPLE_PATH =
|
||||
EXAMPLE_PATTERNS = *
|
||||
EXAMPLE_RECURSIVE = NO
|
||||
|
178
INSTALL
178
INSTALL
@ -2,145 +2,183 @@
|
||||
Install
|
||||
How to install HTML Purifier
|
||||
|
||||
Being a library, there's no fancy GUI that will take you step-by-step through
|
||||
configuring database credentials and other mumbo-jumbo. HTML Purifier is
|
||||
designed to run "out of the box." Regardless, there are still a couple of
|
||||
things you should be mindful of.
|
||||
HTML Purifier is designed to run out of the box, so actually using the library
|
||||
is extremely easy. (Although, if you were looking for a step-by-step
|
||||
installation GUI, you've come to the wrong place!) The impatient can scroll
|
||||
down to the bottom of this INSTALL document to see the code, but you really
|
||||
should make sure a few things are properly done.
|
||||
|
||||
|
||||
|
||||
0. Compatibility
|
||||
1. Compatibility
|
||||
|
||||
HTML Purifier works in both PHP 4 and PHP 5. I have run the test suite on
|
||||
these versions:
|
||||
HTML Purifier works in both PHP 4 and PHP 5, from PHP 4.3.9 and up. It has no
|
||||
core dependencies with other libraries. (Whoopee!)
|
||||
|
||||
- 4.3.9, 4.3.11
|
||||
- 4.4.0, 4.4.4
|
||||
- 5.0.0, 5.0.4
|
||||
- 5.1.0, 5.1.6
|
||||
|
||||
And can confidently say that HTML Purifier should work in all versions
|
||||
between and afterwards. HTML Purifier definitely does not support PHP 4.2,
|
||||
and PHP 4.3 branch support may go further back than that, but I haven't tested
|
||||
any earlier versions.
|
||||
|
||||
I have been unable to get PHP 5.0.5 working on my computer, so if someone
|
||||
wants to test that, be my guest. All tests were done on Windows XP Home,
|
||||
but operating system should not be a major factor in the library.
|
||||
Optional extensions are iconv (usually installed) and tidy (also common).
|
||||
If you use UTF-8 and don't plan on pretty-printing HTML, you can get away with
|
||||
not having either of these extensions.
|
||||
|
||||
|
||||
|
||||
1. Including the proper files
|
||||
2. Including the library
|
||||
|
||||
The library/ directory must be added to your path: HTML Purifier will not be
|
||||
able to find the necessary includes otherwise. This is as simple as:
|
||||
Simply use:
|
||||
|
||||
set_include_path('/path/to/htmlpurifier/library' . PATH_SEPARATOR .
|
||||
get_include_path() );
|
||||
require_once '/path/to/library/HTMLPurifier.auto.php';
|
||||
|
||||
...replacing /path/to/htmlpurifier with the actual location of the folder. Don't
|
||||
worry, HTML Purifier is namespaced so unless you have another file named
|
||||
HTMLPurifier.php, the files won't collide with any of your includes.
|
||||
...and you're good to go. Since HTML Purifier's codebase is fairly
|
||||
large, I recommend only including HTML Purifier when you need it.
|
||||
|
||||
Then, it's a simple matter of including the base file:
|
||||
If you don't like your include_path to be fiddled around with, simply set
|
||||
HTML Purifier's library/ directory to the include path yourself and then:
|
||||
|
||||
require_once 'HTMLPurifier.php';
|
||||
|
||||
...and you're good to go. The library/ folder contains all the files you need,
|
||||
so you can get rid of most of everything else when using the library in a
|
||||
production environment.
|
||||
Only the contents in the library/ folder are necessary, so you can remove
|
||||
everything else when using HTML Purifier in a production environment.
|
||||
|
||||
|
||||
|
||||
2. Preparing the proper environment
|
||||
3. Preparing the proper output environment
|
||||
|
||||
While no configuration is necessary, you first should take precautions regarding
|
||||
the other output HTML that the filtered content will be going along with. Here
|
||||
is a (short) checklist:
|
||||
HTML Purifier is all about web-standards, so accordingly your webpages should
|
||||
be standards compliant. HTML Purifier can deal with these doctypes:
|
||||
|
||||
* Have I specified XHTML 1.0 Transitional as the doctype?
|
||||
* Have I specified UTF-8 as the character encoding?
|
||||
* XHTML 1.0 Transitional (default)
|
||||
* HTML 4.01 Transitional
|
||||
|
||||
...and these character encodings:
|
||||
|
||||
* UTF-8 (default)
|
||||
* Any encoding iconv supports (support is crippled for i18n though)
|
||||
|
||||
The defaults are there for a reason: they are best-practice choices that
|
||||
should not be changed lightly. For those of you in the dark, you can determine
|
||||
the doctype from this code in your HTML documents:
|
||||
|
||||
To find out what these are, browse to your website and view its source code.
|
||||
You can figure out the doctype from the a declaration that looks like
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
||||
or no doctype. You can figure out the character encoding by looking for
|
||||
|
||||
...and the character encoding from this code:
|
||||
|
||||
<meta http-equiv="Content-type" content="text/html;charset=ENCODING">
|
||||
|
||||
I cannot stress the importance of these two bullets enough. Omitting either
|
||||
of them could have dire consequences not only for security but for plain
|
||||
old usability. You can find a more in-depth discussion of why this is needed
|
||||
in docs/security.txt, in the meantime, try to change your output so this is
|
||||
the case. If you can't, well, we might be able to accomodate you (read
|
||||
section 3).
|
||||
For legacy codebases these declarations may be missing. If that is the case,
|
||||
STOP, and read up on character encodings and doctypes (in that order). Here
|
||||
are some links:
|
||||
|
||||
* http://www.joelonsoftware.com/articles/Unicode.html
|
||||
* http://alistapart.com/stories/doctype/
|
||||
|
||||
You may currently be vulnerable to XSS and other security threats, and HTML
|
||||
Purifier won't be able to fix that.
|
||||
|
||||
|
||||
|
||||
3. Configuring HTML Purifier
|
||||
4. Configuration
|
||||
|
||||
HTML Purifier is designed to run out-of-the-box, but occasionally HTML
|
||||
Purifier needs to be told what to do.
|
||||
Purifier needs to be told what to do. If you answered no to any of these
|
||||
questions, read on, otherwise, you can skip to the next section (or, if you're
|
||||
into configuring things just for the heck of it, skip to 4.3).
|
||||
|
||||
If, for some reason, you are unable to switch to UTF-8 immediately, you can
|
||||
switch HTML Purifier's encoding. Note that the availability of encodings is
|
||||
dependent on iconv, and you'll be missing characters if the charset you
|
||||
choose doesn't have them.
|
||||
* Am I using UTF-8?
|
||||
* Am I using XHTML 1.0 Transitional?
|
||||
|
||||
If you answered yes to any of these questions, instantiate a configuration
|
||||
object and read on:
|
||||
|
||||
$config = HTMLPurifier_Config::createDefault();
|
||||
|
||||
|
||||
|
||||
4.1. Setting a different character encoding
|
||||
|
||||
You really shouldn't use any other encoding except UTF-8, especially if you
|
||||
plan to support multilingual websites (read section three for more details).
|
||||
However, switching to UTF-8 is not always immediately feasible, so we can
|
||||
adapt.
|
||||
|
||||
HTML Purifier uses iconv to support other character encodings, as such,
|
||||
any encoding that iconv supports <http://www.gnu.org/software/libiconv/>
|
||||
HTML Purifier supports with this code:
|
||||
|
||||
$config->set('Core', 'Encoding', /* put your encoding here */);
|
||||
|
||||
An example usage for Latin-1 websites:
|
||||
An example usage for Latin-1 websites (the most common encoding for English
|
||||
websites):
|
||||
|
||||
$config->set('Core', 'Encoding', 'ISO-8859-1');
|
||||
|
||||
Note that HTML Purifier's support for non-Unicode encodings is crippled by the
|
||||
fact that any character not supported by that encoding will be silently
|
||||
dropped, EVEN if it is ampersand escaped. This is a current limitation of
|
||||
HTML Purifier that we are NOT actively working to fix. Patches are welcome,
|
||||
but there are so many other gotchas and problems in I18N for non-Unicode
|
||||
encodings that this functionality is low priority. See
|
||||
<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html> for a more
|
||||
detailed lowdown on the topic.
|
||||
|
||||
|
||||
|
||||
4.2. Setting a different doctype
|
||||
|
||||
For those of you stuck using HTML 4.01 Transitional, you can disable
|
||||
XHTML output like this:
|
||||
|
||||
$config->set('Core', 'XHTML', false);
|
||||
|
||||
However, I strongly recommend that you use XHTML. Currently, we can only
|
||||
guarantee transitional-complaint output, future versions will also allow strict
|
||||
output. There are more configuration directives which can be read about
|
||||
here: http://hp.jpsband.org/live/configdoc/plain.html
|
||||
I recommend that you use XHTML, although not as much as I recommend UTF-8. If
|
||||
your HTML 4.01 page validates, good for you!
|
||||
|
||||
Currently, we can only guarantee transitional-complaint output, future
|
||||
versions will also allow strict-compliant output.
|
||||
|
||||
|
||||
|
||||
3. Using the code
|
||||
4.3. Other settings
|
||||
|
||||
There are more configuration directives which can be read about
|
||||
here: <http://hp.jpsband.org/live/configdoc/plain.html> They're a bit boring,
|
||||
but they can help out for those of you who like to exert maximum control over
|
||||
your code.
|
||||
|
||||
|
||||
|
||||
5. Using the code
|
||||
|
||||
The interface is mind-numbingly simple:
|
||||
|
||||
$purifier = new HTMLPurifier();
|
||||
$clean_html = $purifier->purify( $dirty_html );
|
||||
|
||||
Or, if you're using the configuration object:
|
||||
...or, if you're using the configuration object:
|
||||
|
||||
$purifier = new HTMLPurifier($config);
|
||||
$clean_html = $purifier->purify( $dirty_html );
|
||||
|
||||
That's it. For more examples, check out docs/examples/. Also, SLOW gives
|
||||
advice on what to do if HTML Purifier is slowing down your application.
|
||||
That's it! For more examples, check out docs/examples/ (they aren't very
|
||||
different though). Also, SLOW gives advice on what to do if HTML Purifier
|
||||
is slowing down your application.
|
||||
|
||||
|
||||
|
||||
4. Quick install
|
||||
6. Quick install
|
||||
|
||||
If your website is in UTF-8 and XHTML Transitional, use this code:
|
||||
|
||||
<?php
|
||||
set_include_path('/path/to/htmlpurifier/library'
|
||||
. PATH_SEPARATOR . get_include_path() );
|
||||
require_once 'HTMLPurifier.php';
|
||||
$purifier = new HTMLPurifier();
|
||||
require_once '/path/to/htmlpurifier/library/HTMLPurifier.auto.php';
|
||||
|
||||
$purifier = new HTMLPurifier();
|
||||
$clean_html = $purifier->purify($dirty_html);
|
||||
?>
|
||||
|
||||
If your website is in a different encoding or doctype, use this code:
|
||||
|
||||
<?php
|
||||
set_include_path('/path/to/htmlpurifier/library'
|
||||
. PATH_SEPARATOR . get_include_path() );
|
||||
require_once 'HTMLPurifier.php';
|
||||
require_once '/path/to/htmlpurifier/library/HTMLPurifier.auto.php';
|
||||
|
||||
$config = HTMLPurifier_Config::createDefault();
|
||||
$config->set('Core', 'Encoding', 'ISO-8859-1'); //replace with your encoding
|
||||
|
57
NEWS
57
NEWS
@ -1,24 +1,37 @@
|
||||
NEWS ( CHANGELOG and HISTORY ) HTMLPurifier
|
||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||
|
||||
1.1.1, released 2006-09-24
|
||||
- Various documentation updates
|
||||
- Fixed parse error in configuration documentation script
|
||||
- Fixed fatal error in benchmark scripts, slightly augmented
|
||||
- As far as possible, whitespace is preserved in-between table children
|
||||
- Configuration option to optionally Tidy up output for indentation to make up
|
||||
for dropped whitespace by DOMLex (pretty-printing for the entire application
|
||||
should be done by a page-wide Tidy)
|
||||
- Sample test-settings.php file included
|
||||
= KEY ====================
|
||||
! Feature
|
||||
- Bugfix
|
||||
+ Sub-comment
|
||||
. Internal change
|
||||
==========================
|
||||
|
||||
1.1.2, released 2006-09-30
|
||||
! Add HTMLPurifier.auto.php stub file that automatically configures pathx
|
||||
- Documentation updated
|
||||
+ INSTALL document rewritten
|
||||
+ TODO added semi-lossy conversion
|
||||
+ API Doxygen docs' file exclusions updated
|
||||
+ Added notes on HTML versus XML attribute whitespace handling
|
||||
+ Noted that HTMLPurifier_ChildDef_Custom isn't being used
|
||||
+ Noted that config object's definitions are cached versions
|
||||
- Fixed lack of attribute parsing in HTMLPurifier_Lexer_PEARSax3
|
||||
- ftp:// URIs now have their typecodes checked
|
||||
- Hooked up HTMLPurifier_ChildDef_Custom's unit tests (they weren't being run)
|
||||
. Line endings standardized throughout project (svn:eol-style standardized)
|
||||
. Refactored parseData() to general Lexer class
|
||||
. Tester named "HTML Purifier" not "HTMLPurifier"
|
||||
|
||||
1.1.0, released 2006-09-16
|
||||
! Directive documentation generation using XSLT
|
||||
! XHTML can now be turned off, output becomes <br>
|
||||
- Made URI validator more forgiving: will ignore leading and trailing
|
||||
quotes, apostrophes and less than or greater than signs.
|
||||
- Enforce alphanumeric namespace and directive names for configuration.
|
||||
- Directive documentation generation using XSLT
|
||||
- Table child definition made more flexible, will fix up poorly ordered elements
|
||||
- XHTML generation can now be turned off, allowing things like <br>
|
||||
- Renamed ConfigDef to ConfigSchema
|
||||
. Renamed ConfigDef to ConfigSchema
|
||||
|
||||
1.0.1, released 2006-09-04
|
||||
- Fixed slight bug in DOMLex attribute parsing
|
||||
@ -28,17 +41,17 @@ NEWS ( CHANGELOG and HISTORY ) HTMLPurifier
|
||||
space in them. This manifested in TinyMCE.
|
||||
|
||||
1.0.0, released 2006-09-01
|
||||
! Shorthand CSS properties implemented: font, border, background, list-style
|
||||
! Basic color keywords translated into hexadecimal values
|
||||
! Table CSS properties implemented
|
||||
! Support for charsets other than UTF-8 (defined by iconv)
|
||||
! Malformed UTF-8 and non-SGML character detection and cleaning implemented
|
||||
- Fixed broken numeric entity conversion
|
||||
- Malformed UTF-8 and non-SGML character detection and cleaning implemented
|
||||
- API documentation completed
|
||||
- Shorthand CSS properties implemented: font, border, background, list-style
|
||||
- Basic color keywords translated into hexadecimal values
|
||||
- Table CSS properties implemented
|
||||
- (HTML|CSS)Definition de-singleton-ized
|
||||
- Support for charsets other than UTF-8 (defined by iconv)
|
||||
. (HTML|CSS)Definition de-singleton-ized
|
||||
|
||||
1.0.0beta, released 2006-08-16
|
||||
- First public release, most functionality implemented. Notable omissions are:
|
||||
. Shorthand CSS properties
|
||||
. Table CSS properties
|
||||
. Deprecated attribute transformations
|
||||
! First public release, most functionality implemented. Notable omissions are:
|
||||
+ Shorthand CSS properties
|
||||
+ Table CSS properties
|
||||
+ Deprecated attribute transformations
|
||||
|
2
TODO
2
TODO
@ -45,6 +45,8 @@ Unknown release (on a scratch-an-itch basis)
|
||||
empty-cells:show is applied to have compatibility with Internet Explorer
|
||||
- Non-lossy dumb alternate character encoding transformations, achieved by
|
||||
numerically encoding all non-ASCII characters
|
||||
- Semi-lossy dumb alternate character encoding transformations, achieved by
|
||||
encoding all characters that have string entity equivalents
|
||||
|
||||
Wontfix
|
||||
- Non-lossy smart alternate character encoding transformations
|
||||
|
10
library/HTMLPurifier.auto.php
Normal file
10
library/HTMLPurifier.auto.php
Normal file
@ -0,0 +1,10 @@
|
||||
<?php
|
||||
|
||||
/**
|
||||
* This is a stub include that automatically configures the include path.
|
||||
*/
|
||||
|
||||
set_include_path(dirname(__FILE__) . PATH_SEPARATOR . get_include_path() );
|
||||
require_once 'HTMLPurifier.php';
|
||||
|
||||
?>
|
@ -48,7 +48,16 @@ class HTMLPurifier_AttrDef
|
||||
*
|
||||
* @note This method is not entirely standards compliant, as trim() removes
|
||||
* more types of whitespace than specified in the spec. In practice,
|
||||
* this is rarely a problem.
|
||||
* this is rarely a problem, as those extra characters usually have
|
||||
* already been removed by HTMLPurifier_Encoder.
|
||||
*
|
||||
* @warning This processing is inconsistent with XML's whitespace handling
|
||||
* as specified by section 3.3.3 and referenced XHTML 1.0 section
|
||||
* 4.7. Compliant processing requires all line breaks normalized
|
||||
* to "\n", so the fix is not as simple as fixing it in this
|
||||
* function. Trim and whitespace collapsing are supposed to only
|
||||
* occur in NMTOKENs. However, note that we are NOT necessarily
|
||||
* parsing XML, thus, this behavior may still be correct.
|
||||
*
|
||||
* @public
|
||||
*/
|
||||
|
@ -56,6 +56,8 @@ class HTMLPurifier_ChildDef
|
||||
*
|
||||
* @warning Currently this class is an all or nothing proposition, that is,
|
||||
* it will only give a bool return value.
|
||||
* @note This class is currently not used by any code, although it is unit
|
||||
* tested.
|
||||
*/
|
||||
class HTMLPurifier_ChildDef_Custom extends HTMLPurifier_ChildDef
|
||||
{
|
||||
|
@ -26,12 +26,12 @@ class HTMLPurifier_Config
|
||||
var $def;
|
||||
|
||||
/**
|
||||
* Instance of HTMLPurifier_HTMLDefinition
|
||||
* Cached instance of HTMLPurifier_HTMLDefinition
|
||||
*/
|
||||
var $html_definition;
|
||||
|
||||
/**
|
||||
* Instance of HTMLPurifier_CSSDefinition
|
||||
* Cached instance of HTMLPurifier_CSSDefinition
|
||||
*/
|
||||
var $css_definition;
|
||||
|
||||
|
@ -60,6 +60,60 @@ class HTMLPurifier_Lexer
|
||||
$this->_entity_parser = new HTMLPurifier_EntityParser();
|
||||
}
|
||||
|
||||
|
||||
/**
|
||||
* Most common entity to raw value conversion table for special entities.
|
||||
* @protected
|
||||
*/
|
||||
var $_special_entity2str =
|
||||
array(
|
||||
'"' => '"',
|
||||
'&' => '&',
|
||||
'<' => '<',
|
||||
'>' => '>',
|
||||
''' => "'",
|
||||
''' => "'",
|
||||
''' => "'"
|
||||
);
|
||||
|
||||
/**
|
||||
* Parses special entities into the proper characters.
|
||||
*
|
||||
* This string will translate escaped versions of the special characters
|
||||
* into the correct ones.
|
||||
*
|
||||
* @warning
|
||||
* You should be able to treat the output of this function as
|
||||
* completely parsed, but that's only because all other entities should
|
||||
* have been handled previously in substituteNonSpecialEntities()
|
||||
*
|
||||
* @param $string String character data to be parsed.
|
||||
* @returns Parsed character data.
|
||||
*/
|
||||
function parseData($string) {
|
||||
|
||||
// following functions require at least one character
|
||||
if ($string === '') return '';
|
||||
|
||||
// subtracts amps that cannot possibly be escaped
|
||||
$num_amp = substr_count($string, '&') - substr_count($string, '& ') -
|
||||
($string[strlen($string)-1] === '&' ? 1 : 0);
|
||||
|
||||
if (!$num_amp) return $string; // abort if no entities
|
||||
$num_esc_amp = substr_count($string, '&');
|
||||
$string = strtr($string, $this->_special_entity2str);
|
||||
|
||||
// code duplication for sake of optimization, see above
|
||||
$num_amp_2 = substr_count($string, '&') - substr_count($string, '& ') -
|
||||
($string[strlen($string)-1] === '&' ? 1 : 0);
|
||||
|
||||
if ($num_amp_2 <= $num_esc_amp) return $string;
|
||||
|
||||
// hmm... now we have some uncommon entities. Use the callback.
|
||||
$string = $this->_entity_parser->substituteSpecialEntities($string);
|
||||
return $string;
|
||||
}
|
||||
|
||||
var $_encoder;
|
||||
|
||||
/**
|
||||
|
@ -12,64 +12,12 @@ require_once 'HTMLPurifier/Lexer.php';
|
||||
* completely eventually.
|
||||
*
|
||||
* @todo Reread XML spec and document differences.
|
||||
* @todo Add support for CDATA sections.
|
||||
* @todo Determine correct behavior in outputting comment data. (preserve dashes?)
|
||||
* @todo Optimize main function tokenizeHTML().
|
||||
* @todo Less than sign (<) being prohibited (even as entity) in attr-values?
|
||||
*
|
||||
* @todo Determine correct behavior in transforming comment data. (preserve dashes?)
|
||||
*/
|
||||
class HTMLPurifier_Lexer_DirectLex extends HTMLPurifier_Lexer
|
||||
{
|
||||
|
||||
/**
|
||||
* Most common entity to raw value conversion table for special entities.
|
||||
* @protected
|
||||
*/
|
||||
var $_special_entity2str =
|
||||
array(
|
||||
'"' => '"',
|
||||
'&' => '&',
|
||||
'<' => '<',
|
||||
'>' => '>',
|
||||
''' => "'",
|
||||
''' => "'",
|
||||
''' => "'"
|
||||
);
|
||||
|
||||
/**
|
||||
* Parses special entities into the proper characters.
|
||||
*
|
||||
* This string will translate escaped versions of the special characters
|
||||
* into the correct ones.
|
||||
*
|
||||
* @warning
|
||||
* You should be able to treat the output of this function as
|
||||
* completely parsed, but that's only because all other entities should
|
||||
* have been handled previously in substituteNonSpecialEntities()
|
||||
*
|
||||
* @param $string String character data to be parsed.
|
||||
* @returns Parsed character data.
|
||||
*/
|
||||
function parseData($string) {
|
||||
|
||||
// subtracts amps that cannot possibly be escaped
|
||||
$num_amp = substr_count($string, '&') - substr_count($string, '& ') -
|
||||
($string[strlen($string)-1] === '&' ? 1 : 0);
|
||||
|
||||
if (!$num_amp) return $string; // abort if no entities
|
||||
$num_esc_amp = substr_count($string, '&');
|
||||
$string = strtr($string, $this->_special_entity2str);
|
||||
|
||||
// code duplication for sake of optimization, see above
|
||||
$num_amp_2 = substr_count($string, '&') - substr_count($string, '& ') -
|
||||
($string[strlen($string)-1] === '&' ? 1 : 0);
|
||||
|
||||
if ($num_amp_2 <= $num_esc_amp) return $string;
|
||||
|
||||
// hmm... now we have some uncommon entities. Use the callback.
|
||||
$string = $this->_entity_parser->substituteSpecialEntities($string);
|
||||
return $string;
|
||||
}
|
||||
|
||||
/**
|
||||
* Whitespace characters for str(c)spn.
|
||||
* @protected
|
||||
|
@ -18,6 +18,8 @@ require_once 'HTMLPurifier/Lexer.php';
|
||||
* whatever it does for poorly formed HTML is up to it.
|
||||
*
|
||||
* @todo Generalize so that XML_HTMLSax is also supported.
|
||||
*
|
||||
* @warning Entity-resolution inside attributes is broken.
|
||||
*/
|
||||
|
||||
class HTMLPurifier_Lexer_PEARSax3 extends HTMLPurifier_Lexer
|
||||
@ -41,6 +43,8 @@ class HTMLPurifier_Lexer_PEARSax3 extends HTMLPurifier_Lexer
|
||||
$parser->set_element_handler('openHandler','closeHandler');
|
||||
$parser->set_data_handler('dataHandler');
|
||||
$parser->set_escape_handler('escapeHandler');
|
||||
|
||||
// doesn't seem to work correctly for attributes
|
||||
$parser->set_option('XML_OPTION_ENTITIES_PARSED', 1);
|
||||
|
||||
$parser->parse($string);
|
||||
@ -53,6 +57,10 @@ class HTMLPurifier_Lexer_PEARSax3 extends HTMLPurifier_Lexer
|
||||
* Open tag event handler, interface is defined by PEAR package.
|
||||
*/
|
||||
function openHandler(&$parser, $name, $attrs, $closed) {
|
||||
// entities are not resolved in attrs
|
||||
foreach ($attrs as $key => $attr) {
|
||||
$attrs[$key] = $this->parseData($attr);
|
||||
}
|
||||
if ($closed) {
|
||||
$this->tokens[] = new HTMLPurifier_Token_Empty($name, $attrs);
|
||||
} else {
|
||||
|
@ -4,7 +4,6 @@ require_once 'HTMLPurifier/URIScheme.php';
|
||||
|
||||
/**
|
||||
* Validates ftp (File Transfer Protocol) URIs as defined by generic RFC 1738.
|
||||
* @todo Typecode check on path
|
||||
*/
|
||||
class HTMLPurifier_URIScheme_ftp extends HTMLPurifier_URIScheme {
|
||||
|
||||
@ -16,7 +15,27 @@ class HTMLPurifier_URIScheme_ftp extends HTMLPurifier_URIScheme {
|
||||
list($userinfo, $host, $port, $path, $query) =
|
||||
parent::validateComponents(
|
||||
$userinfo, $host, $port, $path, $query, $config );
|
||||
// typecode check needed on path
|
||||
$semicolon_pos = strrpos($path, ';'); // reverse
|
||||
if ($semicolon_pos !== false) {
|
||||
// typecode check
|
||||
$type = substr($path, $semicolon_pos + 1); // no semicolon
|
||||
$path = substr($path, 0, $semicolon_pos);
|
||||
$type_ret = '';
|
||||
if (strpos($type, '=') !== false) {
|
||||
// figure out whether or not the declaration is correct
|
||||
list($key, $typecode) = explode('=', $type, 2);
|
||||
if ($key !== 'type') {
|
||||
// invalid key, tack it back on encoded
|
||||
$path .= '%3B' . $type;
|
||||
} elseif ($typecode === 'a' || $typecode === 'i' || $typecode === 'd') {
|
||||
$type_ret = ";type=$typecode";
|
||||
}
|
||||
} else {
|
||||
$path .= '%3B' . $type;
|
||||
}
|
||||
$path = str_replace(';', '%3B', $path);
|
||||
$path .= $type_ret;
|
||||
}
|
||||
return array($userinfo, $host, $port, $path, null);
|
||||
}
|
||||
|
||||
|
@ -46,18 +46,23 @@ class HTMLPurifier_ChildDefTest extends UnitTestCase
|
||||
$this->def = new HTMLPurifier_ChildDef_Custom(
|
||||
'(a, b?, c*, d+, (a, b)*)');
|
||||
|
||||
$inputs = array();
|
||||
$expect = array();
|
||||
$config = array();
|
||||
|
||||
$inputs[0] = '';
|
||||
$expect[0] = false;
|
||||
|
||||
$inputs[1] = '<a /><b /><c /><d /><a /><b />';
|
||||
$expect[1] = true;
|
||||
|
||||
$inputs[2] = '<a /><d>Dob</d><a /><b>foo</b><a href="moo"><b>foo</b>';
|
||||
$inputs[2] = '<a /><d>Dob</d><a /><b>foo</b><a href="moo" /><b>foo</b>';
|
||||
$expect[2] = true;
|
||||
|
||||
$inputs[3] = '<a /><a />';
|
||||
$expect[3] = false;
|
||||
|
||||
$this->assertSeries($inputs, $expect, $config);
|
||||
}
|
||||
|
||||
function test_table() {
|
||||
|
@ -8,6 +8,7 @@ class HTMLPurifier_ConfigTest extends UnitTestCase
|
||||
var $our_copy, $old_copy;
|
||||
|
||||
function setUp() {
|
||||
// set up a dummy schema object for testing
|
||||
$our_copy = new HTMLPurifier_ConfigSchema();
|
||||
$this->old_copy = HTMLPurifier_ConfigSchema::instance();
|
||||
$this->our_copy =& HTMLPurifier_ConfigSchema::instance($our_copy);
|
||||
@ -93,6 +94,17 @@ class HTMLPurifier_ConfigTest extends UnitTestCase
|
||||
|
||||
}
|
||||
|
||||
function test_getDefinition() {
|
||||
|
||||
$config = HTMLPurifier_Config::createDefault();
|
||||
$def = $config->getHTMLDefinition();
|
||||
$this->assertIsA($def, 'HTMLPurifier_HTMLDefinition');
|
||||
|
||||
$def = $config->getCSSDefinition();
|
||||
$this->assertIsA($def, 'HTMLPurifier_CSSDefinition');
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
?>
|
@ -11,24 +11,6 @@ class HTMLPurifier_Lexer_DirectLexTest extends UnitTestCase
|
||||
$this->DirectLex = new HTMLPurifier_Lexer_DirectLex();
|
||||
}
|
||||
|
||||
function test_parseData() {
|
||||
$HP =& $this->DirectLex;
|
||||
|
||||
$this->assertIdentical('asdf', $HP->parseData('asdf'));
|
||||
$this->assertIdentical('&', $HP->parseData('&'));
|
||||
$this->assertIdentical('"', $HP->parseData('"'));
|
||||
$this->assertIdentical("'", $HP->parseData('''));
|
||||
$this->assertIdentical("'", $HP->parseData('''));
|
||||
$this->assertIdentical('&&&', $HP->parseData('&&&'));
|
||||
$this->assertIdentical('&&', $HP->parseData('&&')); // [INVALID]
|
||||
$this->assertIdentical('Procter & Gamble',
|
||||
$HP->parseData('Procter & Gamble')); // [INVALID]
|
||||
|
||||
// This is not special, thus not converted. Test of fault tolerance,
|
||||
// realistically speaking, this should never happen
|
||||
$this->assertIdentical('-', $HP->parseData('-'));
|
||||
}
|
||||
|
||||
// internals testing
|
||||
function test_parseAttributeString() {
|
||||
|
||||
|
@ -38,6 +38,25 @@ class HTMLPurifier_LexerTest extends UnitTestCase
|
||||
$this->assertIdentical($extract, $result);
|
||||
}
|
||||
|
||||
function test_parseData() {
|
||||
$HP =& $this->Lexer;
|
||||
|
||||
$this->assertIdentical('asdf', $HP->parseData('asdf'));
|
||||
$this->assertIdentical('&', $HP->parseData('&'));
|
||||
$this->assertIdentical('"', $HP->parseData('"'));
|
||||
$this->assertIdentical("'", $HP->parseData('''));
|
||||
$this->assertIdentical("'", $HP->parseData('''));
|
||||
$this->assertIdentical('&&&', $HP->parseData('&&&'));
|
||||
$this->assertIdentical('&&', $HP->parseData('&&')); // [INVALID]
|
||||
$this->assertIdentical('Procter & Gamble',
|
||||
$HP->parseData('Procter & Gamble')); // [INVALID]
|
||||
|
||||
// This is not special, thus not converted. Test of fault tolerance,
|
||||
// realistically speaking, this should never happen
|
||||
$this->assertIdentical('-', $HP->parseData('-'));
|
||||
}
|
||||
|
||||
|
||||
function test_extractBody() {
|
||||
$this->assertExtractBody('<b>Bold</b>');
|
||||
$this->assertExtractBody('<html><body><b>Bold</b></body></html>', '<b>Bold</b>');
|
||||
@ -249,13 +268,16 @@ class HTMLPurifier_LexerTest extends UnitTestCase
|
||||
,new HTMLPurifier_Token_Text('Link')
|
||||
,new HTMLPurifier_Token_End('a')
|
||||
);
|
||||
$sax_expect[16] = false; // PEARSax doesn't support it!
|
||||
|
||||
// test that UTF-8 is preserved
|
||||
$char_hearts = $this->_entity_lookup->table['hearts'];
|
||||
$input[17] = $char_hearts;
|
||||
$expect[17] = array( new HTMLPurifier_Token_Text($char_hearts) );
|
||||
|
||||
// test weird characters in attributes
|
||||
$input[18] = '<br test="x < 6" />';
|
||||
$expect[18] = array( new HTMLPurifier_Token_Empty('br', array('test' => 'x < 6')) );
|
||||
|
||||
$default_config = HTMLPurifier_Config::createDefault();
|
||||
foreach($input as $i => $discard) {
|
||||
if (!isset($config[$i])) $config[$i] = $default_config;
|
||||
|
@ -54,12 +54,34 @@ class HTMLPurifier_URISchemeTest extends UnitTestCase
|
||||
|
||||
$scheme = new HTMLPurifier_URIScheme_ftp();
|
||||
$config = HTMLPurifier_Config::createDefault();
|
||||
|
||||
$this->assertIdentical(
|
||||
$scheme->validateComponents(
|
||||
'user', 'www.example.com', 21, '/', 's=foobar', $config),
|
||||
array('user', 'www.example.com', null, '/', null)
|
||||
);
|
||||
|
||||
// valid typecode
|
||||
$this->assertIdentical(
|
||||
$scheme->validateComponents(
|
||||
null, 'www.example.com', null, '/file.txt;type=a', null, $config),
|
||||
array(null, 'www.example.com', null, '/file.txt;type=a', null)
|
||||
);
|
||||
|
||||
// remove invalid typecode
|
||||
$this->assertIdentical(
|
||||
$scheme->validateComponents(
|
||||
null, 'www.example.com', null, '/file.txt;type=z', null, $config),
|
||||
array(null, 'www.example.com', null, '/file.txt', null)
|
||||
);
|
||||
|
||||
// encode errant semicolons
|
||||
$this->assertIdentical(
|
||||
$scheme->validateComponents(
|
||||
null, 'www.example.com', null, '/too;many;semicolons=1', null, $config),
|
||||
array(null, 'www.example.com', null, '/too%3Bmany%3Bsemicolons=1', null)
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
function test_news() {
|
||||
|
Loading…
Reference in New Issue
Block a user