From 30d75c999d24aed4364ea88880d340678f6ee5db Mon Sep 17 00:00:00 2001 From: "Edward Z. Yang" Date: Sun, 17 Sep 2006 22:08:48 +0000 Subject: [PATCH] Merged r434:436 from trunk/ to branches/1.1 - Update documentation. - Update TODO. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/branches/1.1@437 48356398-32a2-884e-a903-53898d9a118a --- TODO | 3 +++ docs/code-quality.txt | 22 +++++++++++----------- docs/optimization.txt | 3 ++- docs/security.txt | 27 ++++++++++++++++++--------- library/HTMLPurifier/AttrContext.php | 16 ++++++++-------- library/HTMLPurifier/ConfigSchema.php | 5 ----- library/HTMLPurifier/EntityParser.php | 1 - library/HTMLPurifier/Lexer/DOMLex.php | 9 +++------ 8 files changed, 45 insertions(+), 41 deletions(-) diff --git a/TODO b/TODO index b8b2caa1..71ffb20c 100644 --- a/TODO +++ b/TODO @@ -9,6 +9,7 @@ Ongoing - Additional support for poorly written HTML - Implement all non-essential attribute transforms - Microsoft Word HTML cleaning (i.e. MsoNormal) + - Error logging for filtering and cleanup procedures 1.3 release - Formatters for plaintext @@ -41,6 +42,8 @@ Unknown release (on a scratch-an-itch basis) - Pretty-printing HTML (adds dependency of Generator to HTMLDefinition) - Non-lossy dumb alternate character encoding transformations, achieved by numerically encoding all non-ASCII characters + - Preservation of indentation in tables (tricky since the contents can be + shuffled around) Wontfix - Non-lossy smart alternate character encoding transformations diff --git a/docs/code-quality.txt b/docs/code-quality.txt index 18b6a7dd..5b54b699 100644 --- a/docs/code-quality.txt +++ b/docs/code-quality.txt @@ -11,24 +11,24 @@ profiling. Here we go: AttrDef - Class - doesn't support Unicode characters, uses regular expressions - Lang - code duplication, premature optimization, doesn't consult official - lists - Pixels/Length/MultiLength - implemented according to HTML spec (excludes - code reuse in CSS) - URI - multiple regular expressions, needs host validation routines factored - out for mailto scheme, IPv6 validation is broken (fringe), unintuitive - variable overwriting, missing validation for query, fragment and path, + Class - doesn't support Unicode characters (fringe); uses regular + expressions + Lang - code duplication; premature optimization; doesn't consult official + lists (fringe) + Length - easily mistaken for CSSLength + URI - multiple regular expressions; needs host validation routines factored + out for mailto scheme; missing validation for query; fragment and path, no percent-encode fixing CSS - parser doesn't accept advanced CSS (fringe) Number - constructor interface is inconsistent with Integer -AttrTransform - doesn't accept AttrContext, non-validating -ChildDef - not-allowed nodes translated to text, likely invalid handling +AttrTransform - doesn't accept AttrContext Config - "load configuration" hooks missing, rich set* accessors missing +ConfigSchema - redefinition is a mess Strategy FixNesting - cannot bubble nodes out of structures MakeWellFormed - insufficient automatic closing definitions (check HTML - spec for optional end tags). + spec for optional end tags, also, closing based on type (block/inline) + might be efficient). RemoveForeignElements - should be run in parallel with MakeWellFormed URIScheme - needs to have callable generic checks ftp - missing typecode check diff --git a/docs/optimization.txt b/docs/optimization.txt index 84c49c85..49a51794 100644 --- a/docs/optimization.txt +++ b/docs/optimization.txt @@ -2,7 +2,8 @@ Optimization Here are some possible optimization techniques we can apply to code sections if -they turn out to be slow. Be sure not to prematurely optimize though! +they turn out to be slow. Be sure not to prematurely optimize: if you get +that itch, put it here! - Make Tokens Flyweights (may prove problematic, probably not worth it) - Rewrite regexps into PHP code diff --git a/docs/security.txt b/docs/security.txt index d5b71295..695853d5 100644 --- a/docs/security.txt +++ b/docs/security.txt @@ -6,30 +6,39 @@ through negligence of people. This class will do its job: no more, no less, and it's up to you to provide it the proper information and proper context to be effective. Things to remember: -1. UTF-8. Currently, the parser runs under the assumption that it is dealing +1. Character Encoding: UTF-8. +Currently, the parser runs under the assumption that it is dealing with UTF-8. Not ISO-8859-1 or Windows-1252, UTF-8. And definitely not "no character encoding explicitly stated" or UTF-7. If you're not using UTF-8 as -your character encoding, you should switch. Now. Make sure any input is -properly converted to UTF-8, or the parser will mangle it badly -(though it won't be a security risk if you're outputting it as UTF-8 though). +your character encoding, make sure you configure HTML Purifier or switch +to UTF-8. Now. Also, make sure any input is properly converted to UTF-8, or +the parser will mangle it badly (though it won't be a security risk if you're +outputting it as UTF-8 though). Character encoding is, in general, a knotty +issue, but do yourself a favor and learn about it: + -2. XHTML 1.0 Transitional. This is what the parser is outputting. For the most +2. Doctype: XHTML 1.0 Transitional +This is what the parser is outputting. For the most part, it's compatible with HTML 4.01, but XHTML enforces some very nice things that all web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode has waaaay too many quirks for a little parser to handle. We did not select strict in order to prevent ourselves from being too draconic on users, but -this may be configurable in the future. +this may be configurable in the future. Do you want standards compliance? +The doctype is a good place to start. -3. IDs. They need to be unique, but without some knowledge of the +3. IDs +They need to be unique, but without some knowledge of the rest of the document, it's difficult to know what's unique. %Attr.IDBlacklist needs to be set: we may want to consider disallowing IDs by default to save lazy programmers. -4. [PROJECTED] Links. We're not going to try for spam protection (although +4. [PROJECTED] Links +We're not going to try for spam protection (although some hooks for such a module might be nice) but we may offer the ability to only accept relative URLs. Pick the one that's right for you. -5. CSS. While we can prevent the most flagrant cases from affecting your +5. CSS +While we can prevent the most flagrant cases from affecting your layout (such as absolutely positioned elements), no amount of code is going to protect your pages from being attacked by garish colors and plain old bad taste. A neat feature would be the ability to define acceptable colors diff --git a/library/HTMLPurifier/AttrContext.php b/library/HTMLPurifier/AttrContext.php index c3316737..5f0c1679 100644 --- a/library/HTMLPurifier/AttrContext.php +++ b/library/HTMLPurifier/AttrContext.php @@ -3,15 +3,15 @@ /** * Internal data-structure used in attribute validation to accumulate state. * - * All it is is a data-structure that holds objects that accumulate state, like - * HTMLPurifier_IDAccumulator. + * This is a data-structure that holds objects that accumulate state, like + * HTMLPurifier_IDAccumulator. It's better than using globals! * - * @param Many functions that accept this object have it as a mandatory - * parameter, even when there is no use for it. Though this is - * for the same reasons as why HTMLPurifier_Config is a mandatory - * parameter, it is also because you cannot assign a default value - * to a parameter passed by reference (passing by reference is essential - * for context to work in PHP 4). + * @note Many functions that accept this object have it as a mandatory + * parameter, even when there is no use for it. Though this is + * for the same reasons as why HTMLPurifier_Config is a mandatory + * parameter, it is also because you cannot assign a default value + * to a parameter passed by reference (passing by reference is essential + * for context to work in PHP 4). */ class HTMLPurifier_AttrContext diff --git a/library/HTMLPurifier/ConfigSchema.php b/library/HTMLPurifier/ConfigSchema.php index 270fd8fd..99806bd9 100644 --- a/library/HTMLPurifier/ConfigSchema.php +++ b/library/HTMLPurifier/ConfigSchema.php @@ -2,7 +2,6 @@ /** * Configuration definition, defines directives and their defaults. - * @todo Build documentation generation capabilities. * @todo The ability to define things multiple times is confusing and should * be factored out to its own function named registerDependency() or * addNote(), where only the namespace.name and an extra descriptions @@ -39,7 +38,6 @@ class HTMLPurifier_ConfigSchema { /** * Lookup table of allowed types. - * @todo Add descriptions */ var $types = array( 'string' => 'String', @@ -82,9 +80,6 @@ class HTMLPurifier_ConfigSchema { /** * Defines a directive for configuration * @warning Will fail of directive's namespace is defined - * @todo Collect information on description and allow redefinition - * so that multiple files can register a dependency on a - * configuration directive. * @param $namespace Namespace the directive is in * @param $name Key of directive * @param $default Default value of directive diff --git a/library/HTMLPurifier/EntityParser.php b/library/HTMLPurifier/EntityParser.php index 83593f7a..478b6ba2 100644 --- a/library/HTMLPurifier/EntityParser.php +++ b/library/HTMLPurifier/EntityParser.php @@ -88,7 +88,6 @@ class HTMLPurifier_EntityParser * either index 1, 2 or 3 set with a hex value, dec value, * or string (respectively). * @returns Replacement string. - * @todo Implement string translations */ // +----------+----------+----------+----------+ diff --git a/library/HTMLPurifier/Lexer/DOMLex.php b/library/HTMLPurifier/Lexer/DOMLex.php index 2df01f52..a24d1014 100644 --- a/library/HTMLPurifier/Lexer/DOMLex.php +++ b/library/HTMLPurifier/Lexer/DOMLex.php @@ -12,15 +12,12 @@ require_once 'HTMLPurifier/TokenFactory.php'; * documents, it performs twenty times faster than * HTMLPurifier_Lexer_DirectLex,and is the default choice for PHP 5. * - * @notice - * Any empty elements will have empty tokens associated with them, even if + * @note Any empty elements will have empty tokens associated with them, even if * this is prohibited by the spec. This is cannot be fixed until the spec * comes into play. * - * @todo Determine DOM's entity parsing behavior, point to local entity files - * if necessary. - * @todo Make div access less fragile, and refrain from preprocessing when - * HTML tag and friends are already present. + * @note PHP's DOM extension does not actually parse any entities, we use + * our own function to do that. */ class HTMLPurifier_Lexer_DOMLex extends HTMLPurifier_Lexer