mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2025-01-23 13:51:54 +00:00
Merged r434:436 from trunk/ to branches/1.1
- Update documentation. - Update TODO. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/branches/1.1@437 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
parent
64d8ca9831
commit
30d75c999d
3
TODO
3
TODO
@ -9,6 +9,7 @@ Ongoing
|
|||||||
- Additional support for poorly written HTML
|
- Additional support for poorly written HTML
|
||||||
- Implement all non-essential attribute transforms
|
- Implement all non-essential attribute transforms
|
||||||
- Microsoft Word HTML cleaning (i.e. MsoNormal)
|
- Microsoft Word HTML cleaning (i.e. MsoNormal)
|
||||||
|
- Error logging for filtering and cleanup procedures
|
||||||
|
|
||||||
1.3 release
|
1.3 release
|
||||||
- Formatters for plaintext
|
- Formatters for plaintext
|
||||||
@ -41,6 +42,8 @@ Unknown release (on a scratch-an-itch basis)
|
|||||||
- Pretty-printing HTML (adds dependency of Generator to HTMLDefinition)
|
- Pretty-printing HTML (adds dependency of Generator to HTMLDefinition)
|
||||||
- Non-lossy dumb alternate character encoding transformations, achieved by
|
- Non-lossy dumb alternate character encoding transformations, achieved by
|
||||||
numerically encoding all non-ASCII characters
|
numerically encoding all non-ASCII characters
|
||||||
|
- Preservation of indentation in tables (tricky since the contents can be
|
||||||
|
shuffled around)
|
||||||
|
|
||||||
Wontfix
|
Wontfix
|
||||||
- Non-lossy smart alternate character encoding transformations
|
- Non-lossy smart alternate character encoding transformations
|
||||||
|
@ -11,24 +11,24 @@ profiling.
|
|||||||
Here we go:
|
Here we go:
|
||||||
|
|
||||||
AttrDef
|
AttrDef
|
||||||
Class - doesn't support Unicode characters, uses regular expressions
|
Class - doesn't support Unicode characters (fringe); uses regular
|
||||||
Lang - code duplication, premature optimization, doesn't consult official
|
expressions
|
||||||
lists
|
Lang - code duplication; premature optimization; doesn't consult official
|
||||||
Pixels/Length/MultiLength - implemented according to HTML spec (excludes
|
lists (fringe)
|
||||||
code reuse in CSS)
|
Length - easily mistaken for CSSLength
|
||||||
URI - multiple regular expressions, needs host validation routines factored
|
URI - multiple regular expressions; needs host validation routines factored
|
||||||
out for mailto scheme, IPv6 validation is broken (fringe), unintuitive
|
out for mailto scheme; missing validation for query; fragment and path,
|
||||||
variable overwriting, missing validation for query, fragment and path,
|
|
||||||
no percent-encode fixing
|
no percent-encode fixing
|
||||||
CSS - parser doesn't accept advanced CSS (fringe)
|
CSS - parser doesn't accept advanced CSS (fringe)
|
||||||
Number - constructor interface is inconsistent with Integer
|
Number - constructor interface is inconsistent with Integer
|
||||||
AttrTransform - doesn't accept AttrContext, non-validating
|
AttrTransform - doesn't accept AttrContext
|
||||||
ChildDef - not-allowed nodes translated to text, likely invalid handling
|
|
||||||
Config - "load configuration" hooks missing, rich set* accessors missing
|
Config - "load configuration" hooks missing, rich set* accessors missing
|
||||||
|
ConfigSchema - redefinition is a mess
|
||||||
Strategy
|
Strategy
|
||||||
FixNesting - cannot bubble nodes out of structures
|
FixNesting - cannot bubble nodes out of structures
|
||||||
MakeWellFormed - insufficient automatic closing definitions (check HTML
|
MakeWellFormed - insufficient automatic closing definitions (check HTML
|
||||||
spec for optional end tags).
|
spec for optional end tags, also, closing based on type (block/inline)
|
||||||
|
might be efficient).
|
||||||
RemoveForeignElements - should be run in parallel with MakeWellFormed
|
RemoveForeignElements - should be run in parallel with MakeWellFormed
|
||||||
URIScheme - needs to have callable generic checks
|
URIScheme - needs to have callable generic checks
|
||||||
ftp - missing typecode check
|
ftp - missing typecode check
|
||||||
|
@ -2,7 +2,8 @@
|
|||||||
Optimization
|
Optimization
|
||||||
|
|
||||||
Here are some possible optimization techniques we can apply to code sections if
|
Here are some possible optimization techniques we can apply to code sections if
|
||||||
they turn out to be slow. Be sure not to prematurely optimize though!
|
they turn out to be slow. Be sure not to prematurely optimize: if you get
|
||||||
|
that itch, put it here!
|
||||||
|
|
||||||
- Make Tokens Flyweights (may prove problematic, probably not worth it)
|
- Make Tokens Flyweights (may prove problematic, probably not worth it)
|
||||||
- Rewrite regexps into PHP code
|
- Rewrite regexps into PHP code
|
||||||
|
@ -6,30 +6,39 @@ through negligence of people. This class will do its job: no more, no less,
|
|||||||
and it's up to you to provide it the proper information and proper context
|
and it's up to you to provide it the proper information and proper context
|
||||||
to be effective. Things to remember:
|
to be effective. Things to remember:
|
||||||
|
|
||||||
1. UTF-8. Currently, the parser runs under the assumption that it is dealing
|
1. Character Encoding: UTF-8.
|
||||||
|
Currently, the parser runs under the assumption that it is dealing
|
||||||
with UTF-8. Not ISO-8859-1 or Windows-1252, UTF-8. And definitely not "no
|
with UTF-8. Not ISO-8859-1 or Windows-1252, UTF-8. And definitely not "no
|
||||||
character encoding explicitly stated" or UTF-7. If you're not using UTF-8 as
|
character encoding explicitly stated" or UTF-7. If you're not using UTF-8 as
|
||||||
your character encoding, you should switch. Now. Make sure any input is
|
your character encoding, make sure you configure HTML Purifier or switch
|
||||||
properly converted to UTF-8, or the parser will mangle it badly
|
to UTF-8. Now. Also, make sure any input is properly converted to UTF-8, or
|
||||||
(though it won't be a security risk if you're outputting it as UTF-8 though).
|
the parser will mangle it badly (though it won't be a security risk if you're
|
||||||
|
outputting it as UTF-8 though). Character encoding is, in general, a knotty
|
||||||
|
issue, but do yourself a favor and learn about it:
|
||||||
|
<http://www.joelonsoftware.com/articles/Unicode.html>
|
||||||
|
|
||||||
2. XHTML 1.0 Transitional. This is what the parser is outputting. For the most
|
2. Doctype: XHTML 1.0 Transitional
|
||||||
|
This is what the parser is outputting. For the most
|
||||||
part, it's compatible with HTML 4.01, but XHTML enforces some very nice things
|
part, it's compatible with HTML 4.01, but XHTML enforces some very nice things
|
||||||
that all web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode
|
that all web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode
|
||||||
has waaaay too many quirks for a little parser to handle. We did not select
|
has waaaay too many quirks for a little parser to handle. We did not select
|
||||||
strict in order to prevent ourselves from being too draconic on users, but
|
strict in order to prevent ourselves from being too draconic on users, but
|
||||||
this may be configurable in the future.
|
this may be configurable in the future. Do you want standards compliance?
|
||||||
|
The doctype is a good place to start.
|
||||||
|
|
||||||
3. IDs. They need to be unique, but without some knowledge of the
|
3. IDs
|
||||||
|
They need to be unique, but without some knowledge of the
|
||||||
rest of the document, it's difficult to know what's unique. %Attr.IDBlacklist
|
rest of the document, it's difficult to know what's unique. %Attr.IDBlacklist
|
||||||
needs to be set: we may want to consider disallowing IDs by default to
|
needs to be set: we may want to consider disallowing IDs by default to
|
||||||
save lazy programmers.
|
save lazy programmers.
|
||||||
|
|
||||||
4. [PROJECTED] Links. We're not going to try for spam protection (although
|
4. [PROJECTED] Links
|
||||||
|
We're not going to try for spam protection (although
|
||||||
some hooks for such a module might be nice) but we may offer the ability to
|
some hooks for such a module might be nice) but we may offer the ability to
|
||||||
only accept relative URLs. Pick the one that's right for you.
|
only accept relative URLs. Pick the one that's right for you.
|
||||||
|
|
||||||
5. CSS. While we can prevent the most flagrant cases from affecting your
|
5. CSS
|
||||||
|
While we can prevent the most flagrant cases from affecting your
|
||||||
layout (such as absolutely positioned elements), no amount of code is going
|
layout (such as absolutely positioned elements), no amount of code is going
|
||||||
to protect your pages from being attacked by garish colors and plain old
|
to protect your pages from being attacked by garish colors and plain old
|
||||||
bad taste. A neat feature would be the ability to define acceptable colors
|
bad taste. A neat feature would be the ability to define acceptable colors
|
||||||
|
@ -3,15 +3,15 @@
|
|||||||
/**
|
/**
|
||||||
* Internal data-structure used in attribute validation to accumulate state.
|
* Internal data-structure used in attribute validation to accumulate state.
|
||||||
*
|
*
|
||||||
* All it is is a data-structure that holds objects that accumulate state, like
|
* This is a data-structure that holds objects that accumulate state, like
|
||||||
* HTMLPurifier_IDAccumulator.
|
* HTMLPurifier_IDAccumulator. It's better than using globals!
|
||||||
*
|
*
|
||||||
* @param Many functions that accept this object have it as a mandatory
|
* @note Many functions that accept this object have it as a mandatory
|
||||||
* parameter, even when there is no use for it. Though this is
|
* parameter, even when there is no use for it. Though this is
|
||||||
* for the same reasons as why HTMLPurifier_Config is a mandatory
|
* for the same reasons as why HTMLPurifier_Config is a mandatory
|
||||||
* parameter, it is also because you cannot assign a default value
|
* parameter, it is also because you cannot assign a default value
|
||||||
* to a parameter passed by reference (passing by reference is essential
|
* to a parameter passed by reference (passing by reference is essential
|
||||||
* for context to work in PHP 4).
|
* for context to work in PHP 4).
|
||||||
*/
|
*/
|
||||||
|
|
||||||
class HTMLPurifier_AttrContext
|
class HTMLPurifier_AttrContext
|
||||||
|
@ -2,7 +2,6 @@
|
|||||||
|
|
||||||
/**
|
/**
|
||||||
* Configuration definition, defines directives and their defaults.
|
* Configuration definition, defines directives and their defaults.
|
||||||
* @todo Build documentation generation capabilities.
|
|
||||||
* @todo The ability to define things multiple times is confusing and should
|
* @todo The ability to define things multiple times is confusing and should
|
||||||
* be factored out to its own function named registerDependency() or
|
* be factored out to its own function named registerDependency() or
|
||||||
* addNote(), where only the namespace.name and an extra descriptions
|
* addNote(), where only the namespace.name and an extra descriptions
|
||||||
@ -39,7 +38,6 @@ class HTMLPurifier_ConfigSchema {
|
|||||||
|
|
||||||
/**
|
/**
|
||||||
* Lookup table of allowed types.
|
* Lookup table of allowed types.
|
||||||
* @todo Add descriptions
|
|
||||||
*/
|
*/
|
||||||
var $types = array(
|
var $types = array(
|
||||||
'string' => 'String',
|
'string' => 'String',
|
||||||
@ -82,9 +80,6 @@ class HTMLPurifier_ConfigSchema {
|
|||||||
/**
|
/**
|
||||||
* Defines a directive for configuration
|
* Defines a directive for configuration
|
||||||
* @warning Will fail of directive's namespace is defined
|
* @warning Will fail of directive's namespace is defined
|
||||||
* @todo Collect information on description and allow redefinition
|
|
||||||
* so that multiple files can register a dependency on a
|
|
||||||
* configuration directive.
|
|
||||||
* @param $namespace Namespace the directive is in
|
* @param $namespace Namespace the directive is in
|
||||||
* @param $name Key of directive
|
* @param $name Key of directive
|
||||||
* @param $default Default value of directive
|
* @param $default Default value of directive
|
||||||
|
@ -88,7 +88,6 @@ class HTMLPurifier_EntityParser
|
|||||||
* either index 1, 2 or 3 set with a hex value, dec value,
|
* either index 1, 2 or 3 set with a hex value, dec value,
|
||||||
* or string (respectively).
|
* or string (respectively).
|
||||||
* @returns Replacement string.
|
* @returns Replacement string.
|
||||||
* @todo Implement string translations
|
|
||||||
*/
|
*/
|
||||||
|
|
||||||
// +----------+----------+----------+----------+
|
// +----------+----------+----------+----------+
|
||||||
|
@ -12,15 +12,12 @@ require_once 'HTMLPurifier/TokenFactory.php';
|
|||||||
* documents, it performs twenty times faster than
|
* documents, it performs twenty times faster than
|
||||||
* HTMLPurifier_Lexer_DirectLex,and is the default choice for PHP 5.
|
* HTMLPurifier_Lexer_DirectLex,and is the default choice for PHP 5.
|
||||||
*
|
*
|
||||||
* @notice
|
* @note Any empty elements will have empty tokens associated with them, even if
|
||||||
* Any empty elements will have empty tokens associated with them, even if
|
|
||||||
* this is prohibited by the spec. This is cannot be fixed until the spec
|
* this is prohibited by the spec. This is cannot be fixed until the spec
|
||||||
* comes into play.
|
* comes into play.
|
||||||
*
|
*
|
||||||
* @todo Determine DOM's entity parsing behavior, point to local entity files
|
* @note PHP's DOM extension does not actually parse any entities, we use
|
||||||
* if necessary.
|
* our own function to do that.
|
||||||
* @todo Make div access less fragile, and refrain from preprocessing when
|
|
||||||
* HTML tag and friends are already present.
|
|
||||||
*/
|
*/
|
||||||
|
|
||||||
class HTMLPurifier_Lexer_DOMLex extends HTMLPurifier_Lexer
|
class HTMLPurifier_Lexer_DOMLex extends HTMLPurifier_Lexer
|
||||||
|
Loading…
Reference in New Issue
Block a user