0
0
mirror of https://github.com/ezyang/htmlpurifier.git synced 2025-01-23 13:51:54 +00:00

Merged r434:436 from trunk/ to branches/1.1

- Update documentation.
- Update TODO.

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/branches/1.1@437 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
Edward Z. Yang 2006-09-17 22:08:48 +00:00
parent 64d8ca9831
commit 30d75c999d
8 changed files with 45 additions and 41 deletions

3
TODO
View File

@ -9,6 +9,7 @@ Ongoing
- Additional support for poorly written HTML - Additional support for poorly written HTML
- Implement all non-essential attribute transforms - Implement all non-essential attribute transforms
- Microsoft Word HTML cleaning (i.e. MsoNormal) - Microsoft Word HTML cleaning (i.e. MsoNormal)
- Error logging for filtering and cleanup procedures
1.3 release 1.3 release
- Formatters for plaintext - Formatters for plaintext
@ -41,6 +42,8 @@ Unknown release (on a scratch-an-itch basis)
- Pretty-printing HTML (adds dependency of Generator to HTMLDefinition) - Pretty-printing HTML (adds dependency of Generator to HTMLDefinition)
- Non-lossy dumb alternate character encoding transformations, achieved by - Non-lossy dumb alternate character encoding transformations, achieved by
numerically encoding all non-ASCII characters numerically encoding all non-ASCII characters
- Preservation of indentation in tables (tricky since the contents can be
shuffled around)
Wontfix Wontfix
- Non-lossy smart alternate character encoding transformations - Non-lossy smart alternate character encoding transformations

View File

@ -11,24 +11,24 @@ profiling.
Here we go: Here we go:
AttrDef AttrDef
Class - doesn't support Unicode characters, uses regular expressions Class - doesn't support Unicode characters (fringe); uses regular
Lang - code duplication, premature optimization, doesn't consult official expressions
lists Lang - code duplication; premature optimization; doesn't consult official
Pixels/Length/MultiLength - implemented according to HTML spec (excludes lists (fringe)
code reuse in CSS) Length - easily mistaken for CSSLength
URI - multiple regular expressions, needs host validation routines factored URI - multiple regular expressions; needs host validation routines factored
out for mailto scheme, IPv6 validation is broken (fringe), unintuitive out for mailto scheme; missing validation for query; fragment and path,
variable overwriting, missing validation for query, fragment and path,
no percent-encode fixing no percent-encode fixing
CSS - parser doesn't accept advanced CSS (fringe) CSS - parser doesn't accept advanced CSS (fringe)
Number - constructor interface is inconsistent with Integer Number - constructor interface is inconsistent with Integer
AttrTransform - doesn't accept AttrContext, non-validating AttrTransform - doesn't accept AttrContext
ChildDef - not-allowed nodes translated to text, likely invalid handling
Config - "load configuration" hooks missing, rich set* accessors missing Config - "load configuration" hooks missing, rich set* accessors missing
ConfigSchema - redefinition is a mess
Strategy Strategy
FixNesting - cannot bubble nodes out of structures FixNesting - cannot bubble nodes out of structures
MakeWellFormed - insufficient automatic closing definitions (check HTML MakeWellFormed - insufficient automatic closing definitions (check HTML
spec for optional end tags). spec for optional end tags, also, closing based on type (block/inline)
might be efficient).
RemoveForeignElements - should be run in parallel with MakeWellFormed RemoveForeignElements - should be run in parallel with MakeWellFormed
URIScheme - needs to have callable generic checks URIScheme - needs to have callable generic checks
ftp - missing typecode check ftp - missing typecode check

View File

@ -2,7 +2,8 @@
Optimization Optimization
Here are some possible optimization techniques we can apply to code sections if Here are some possible optimization techniques we can apply to code sections if
they turn out to be slow. Be sure not to prematurely optimize though! they turn out to be slow. Be sure not to prematurely optimize: if you get
that itch, put it here!
- Make Tokens Flyweights (may prove problematic, probably not worth it) - Make Tokens Flyweights (may prove problematic, probably not worth it)
- Rewrite regexps into PHP code - Rewrite regexps into PHP code

View File

@ -6,30 +6,39 @@ through negligence of people. This class will do its job: no more, no less,
and it's up to you to provide it the proper information and proper context and it's up to you to provide it the proper information and proper context
to be effective. Things to remember: to be effective. Things to remember:
1. UTF-8. Currently, the parser runs under the assumption that it is dealing 1. Character Encoding: UTF-8.
Currently, the parser runs under the assumption that it is dealing
with UTF-8. Not ISO-8859-1 or Windows-1252, UTF-8. And definitely not "no with UTF-8. Not ISO-8859-1 or Windows-1252, UTF-8. And definitely not "no
character encoding explicitly stated" or UTF-7. If you're not using UTF-8 as character encoding explicitly stated" or UTF-7. If you're not using UTF-8 as
your character encoding, you should switch. Now. Make sure any input is your character encoding, make sure you configure HTML Purifier or switch
properly converted to UTF-8, or the parser will mangle it badly to UTF-8. Now. Also, make sure any input is properly converted to UTF-8, or
(though it won't be a security risk if you're outputting it as UTF-8 though). the parser will mangle it badly (though it won't be a security risk if you're
outputting it as UTF-8 though). Character encoding is, in general, a knotty
issue, but do yourself a favor and learn about it:
<http://www.joelonsoftware.com/articles/Unicode.html>
2. XHTML 1.0 Transitional. This is what the parser is outputting. For the most 2. Doctype: XHTML 1.0 Transitional
This is what the parser is outputting. For the most
part, it's compatible with HTML 4.01, but XHTML enforces some very nice things part, it's compatible with HTML 4.01, but XHTML enforces some very nice things
that all web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode that all web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode
has waaaay too many quirks for a little parser to handle. We did not select has waaaay too many quirks for a little parser to handle. We did not select
strict in order to prevent ourselves from being too draconic on users, but strict in order to prevent ourselves from being too draconic on users, but
this may be configurable in the future. this may be configurable in the future. Do you want standards compliance?
The doctype is a good place to start.
3. IDs. They need to be unique, but without some knowledge of the 3. IDs
They need to be unique, but without some knowledge of the
rest of the document, it's difficult to know what's unique. %Attr.IDBlacklist rest of the document, it's difficult to know what's unique. %Attr.IDBlacklist
needs to be set: we may want to consider disallowing IDs by default to needs to be set: we may want to consider disallowing IDs by default to
save lazy programmers. save lazy programmers.
4. [PROJECTED] Links. We're not going to try for spam protection (although 4. [PROJECTED] Links
We're not going to try for spam protection (although
some hooks for such a module might be nice) but we may offer the ability to some hooks for such a module might be nice) but we may offer the ability to
only accept relative URLs. Pick the one that's right for you. only accept relative URLs. Pick the one that's right for you.
5. CSS. While we can prevent the most flagrant cases from affecting your 5. CSS
While we can prevent the most flagrant cases from affecting your
layout (such as absolutely positioned elements), no amount of code is going layout (such as absolutely positioned elements), no amount of code is going
to protect your pages from being attacked by garish colors and plain old to protect your pages from being attacked by garish colors and plain old
bad taste. A neat feature would be the ability to define acceptable colors bad taste. A neat feature would be the ability to define acceptable colors

View File

@ -3,15 +3,15 @@
/** /**
* Internal data-structure used in attribute validation to accumulate state. * Internal data-structure used in attribute validation to accumulate state.
* *
* All it is is a data-structure that holds objects that accumulate state, like * This is a data-structure that holds objects that accumulate state, like
* HTMLPurifier_IDAccumulator. * HTMLPurifier_IDAccumulator. It's better than using globals!
* *
* @param Many functions that accept this object have it as a mandatory * @note Many functions that accept this object have it as a mandatory
* parameter, even when there is no use for it. Though this is * parameter, even when there is no use for it. Though this is
* for the same reasons as why HTMLPurifier_Config is a mandatory * for the same reasons as why HTMLPurifier_Config is a mandatory
* parameter, it is also because you cannot assign a default value * parameter, it is also because you cannot assign a default value
* to a parameter passed by reference (passing by reference is essential * to a parameter passed by reference (passing by reference is essential
* for context to work in PHP 4). * for context to work in PHP 4).
*/ */
class HTMLPurifier_AttrContext class HTMLPurifier_AttrContext

View File

@ -2,7 +2,6 @@
/** /**
* Configuration definition, defines directives and their defaults. * Configuration definition, defines directives and their defaults.
* @todo Build documentation generation capabilities.
* @todo The ability to define things multiple times is confusing and should * @todo The ability to define things multiple times is confusing and should
* be factored out to its own function named registerDependency() or * be factored out to its own function named registerDependency() or
* addNote(), where only the namespace.name and an extra descriptions * addNote(), where only the namespace.name and an extra descriptions
@ -39,7 +38,6 @@ class HTMLPurifier_ConfigSchema {
/** /**
* Lookup table of allowed types. * Lookup table of allowed types.
* @todo Add descriptions
*/ */
var $types = array( var $types = array(
'string' => 'String', 'string' => 'String',
@ -82,9 +80,6 @@ class HTMLPurifier_ConfigSchema {
/** /**
* Defines a directive for configuration * Defines a directive for configuration
* @warning Will fail of directive's namespace is defined * @warning Will fail of directive's namespace is defined
* @todo Collect information on description and allow redefinition
* so that multiple files can register a dependency on a
* configuration directive.
* @param $namespace Namespace the directive is in * @param $namespace Namespace the directive is in
* @param $name Key of directive * @param $name Key of directive
* @param $default Default value of directive * @param $default Default value of directive

View File

@ -88,7 +88,6 @@ class HTMLPurifier_EntityParser
* either index 1, 2 or 3 set with a hex value, dec value, * either index 1, 2 or 3 set with a hex value, dec value,
* or string (respectively). * or string (respectively).
* @returns Replacement string. * @returns Replacement string.
* @todo Implement string translations
*/ */
// +----------+----------+----------+----------+ // +----------+----------+----------+----------+

View File

@ -12,15 +12,12 @@ require_once 'HTMLPurifier/TokenFactory.php';
* documents, it performs twenty times faster than * documents, it performs twenty times faster than
* HTMLPurifier_Lexer_DirectLex,and is the default choice for PHP 5. * HTMLPurifier_Lexer_DirectLex,and is the default choice for PHP 5.
* *
* @notice * @note Any empty elements will have empty tokens associated with them, even if
* Any empty elements will have empty tokens associated with them, even if
* this is prohibited by the spec. This is cannot be fixed until the spec * this is prohibited by the spec. This is cannot be fixed until the spec
* comes into play. * comes into play.
* *
* @todo Determine DOM's entity parsing behavior, point to local entity files * @note PHP's DOM extension does not actually parse any entities, we use
* if necessary. * our own function to do that.
* @todo Make div access less fragile, and refrain from preprocessing when
* HTML tag and friends are already present.
*/ */
class HTMLPurifier_Lexer_DOMLex extends HTMLPurifier_Lexer class HTMLPurifier_Lexer_DOMLex extends HTMLPurifier_Lexer