0
0
mirror of https://github.com/ezyang/htmlpurifier.git synced 2025-01-18 11:41:52 +00:00

Update documentation.

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@319 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
Edward Z. Yang 2006-08-25 03:01:16 +00:00
parent dcec92e7b3
commit ca1453401f
6 changed files with 36 additions and 19 deletions

19
TODO
View File

@ -3,23 +3,32 @@ Todo List
Core: Core:
- Finish table and shorthand CSS attributes - Finish table and shorthand CSS attributes
- border-collapse, caption-side, empty-cells, table-layout, vertical-align - border-collapse, caption-side, empty-cells, table-layout, vertical-align
- background - background (and friends)
- border, border-* - border, border-*
- font - font
- list-style - list-style
- Implement all non-essential attribute transforms - Implement all non-essential attribute transforms
- Microsoft Word HTML cleaning - Microsoft Word HTML cleaning
- Plugins for major CMSes - Plugins for major CMSes
- Rewrite *Definition and Config relationship, add various "levels" of cleaning
- Support other character encodings out-of-the-box
- Allow strict HTML 4.01, loose HTML 4.01 and strict XHTML 1.0 output
Code issues: Code issues:
- Massive profiling, make it faster! - Massive profiling, make it faster!
- Make URI validation routines tighter (especially mailto) - Make URI validation routines tighter (especially mailto)
- Distinguish between different types of URIs, for instance, a mailto URI - Distinguish between different types of URIs, for instance, a mailto URI
in IMG SRC is nonsensical in IMG SRC is nonsensical
- Factor out Host validation to its own AttrDef - Rewrite table's child definition to be faster, smart, and regexp free
- Rewrite table's child definition - Silently drop content inbetween SCRIPT tags (can be generalized to allow
- Silently drop content inbetween SCRIPT tags specification of elements that, when detected as foreign, trigger removal
of children, although unbalanced tags could wreck havoc (or at least delete
the rest of the document).
Enhancements: Enhancements:
- Do fixes for Firefox's inability to handle COL alignment props (Bug 915) - Fixes for Firefox's inability to handle COL alignment props (Bug 915)
- Pretty-printing HTML - Pretty-printing HTML
- Hooks for adding custom processors to custom namespaced tags and attributes,
offer default implementation
- Auto-paragraphing (be sure to leverage fact that we know when things
shouldn't be paragraphed, such as lists and tables).

View File

@ -21,13 +21,15 @@ AttrDef
variable overwriting, missing validation for query, fragment and path, variable overwriting, missing validation for query, fragment and path,
no percent-encode fixing no percent-encode fixing
CSS - parser doesn't accept advanced CSS (fringe) CSS - parser doesn't accept advanced CSS (fringe)
Number - constructor interface is inconsistent with Integer
AttrTransform - doesn't accept AttrContext, non-validating AttrTransform - doesn't accept AttrContext, non-validating
Lang - invalid xml:lang value can overwrite valid lang value (fringe)
ChildDef - not-allowed nodes translated to text, likely invalid handling ChildDef - not-allowed nodes translated to text, likely invalid handling
Config - "load configuration" hooks missing, rich set* accessors missing Config - "load configuration" hooks missing, rich set* accessors missing,
needs redefined relationship with the definitions
Strategy Strategy
FixNesting - cannot bubble nodes out of structures FixNesting - cannot bubble nodes out of structures
MakeWellFormed - insufficient automatic closing definitions MakeWellFormed - insufficient automatic closing definitions (check HTML
spec for optional end tags).
RemoveForeignElements - should be run in parallel with MakeWellFormed RemoveForeignElements - should be run in parallel with MakeWellFormed
URIScheme - needs to have callable generic checks URIScheme - needs to have callable generic checks
ftp - missing typecode check ftp - missing typecode check

View File

@ -28,6 +28,7 @@ time. Note the naming convention: %Namespace.Directive
%Attr.MaxWidth, %Attr.MaxWidth,
%Attr.MaxHeight - caps for width and height related checks. %Attr.MaxHeight - caps for width and height related checks.
(a hack in Pixels for an image crashing attack could be replaced by this)
%URI.Munge - will munge all URIs to a different URI, which should redirect %URI.Munge - will munge all URIs to a different URI, which should redirect
the user to the applicable page. A urlencoded version of the URI the user to the applicable page. A urlencoded version of the URI

View File

@ -17,6 +17,8 @@ are passed. These classes are: HTMLPurifier::*, Generator::generateFromTokens
and Lexer::tokenizeHTML. However, whenever a valid configuration object and Lexer::tokenizeHTML. However, whenever a valid configuration object
is defined, that object should be used. is defined, that object should be used.
-- the following is projected changes to the configuration system --
In relation to HTMLDefinition and CSSDefinition, there are going to be some In relation to HTMLDefinition and CSSDefinition, there are going to be some
major structural changes to enable the easy configuration of these objects. major structural changes to enable the easy configuration of these objects.
Due to the intricacy of these objects, it's not feasible to ask an average Due to the intricacy of these objects, it's not feasible to ask an average

View File

@ -9,11 +9,11 @@ to be effective. Things to remember:
1. UTF-8. Currently, the parser runs under the assumption that it is dealing 1. UTF-8. Currently, the parser runs under the assumption that it is dealing
with UTF-8. Not ISO-8859-1 or Windows-1252, UTF-8. And definitely not "no with UTF-8. Not ISO-8859-1 or Windows-1252, UTF-8. And definitely not "no
character encoding explicitly stated" or UTF-7. If you're not using UTF-8 as character encoding explicitly stated" or UTF-7. If you're not using UTF-8 as
your character encoding, you should switch. Now. (in future versions, however, your character encoding, you should switch. Now. Make sure any input is
I may make the character encoding configurable, but there's only so much I properly converted to UTF-8, or the parser will mangle it badly
can do). Make sure any input is properly converted to UTF-8, or the parser (though it won't be a security risk if you're outputting it as UTF-8 though).
will mangle it badly (though it won't be a security risk if you're outputting We will be adding out-of-the-box support for the other major character
it as UTF-8 though). encodings shortly.
2. XHTML 1.0 Transitional. This is what the parser is outputting. For the most 2. XHTML 1.0 Transitional. This is what the parser is outputting. For the most
part, it's compatible with HTML 4.01, but XHTML enforces some very nice things part, it's compatible with HTML 4.01, but XHTML enforces some very nice things
@ -23,8 +23,9 @@ strict in order to prevent ourselves from being too draconic on users, but
this may be configurable in the future. this may be configurable in the future.
3. IDs. They need to be unique, but without some knowledge of the 3. IDs. They need to be unique, but without some knowledge of the
rest of the document, it's difficult to know what's unique. Without setting rest of the document, it's difficult to know what's unique. %Attr.IDBlacklist
%Attr.IDBlacklist to the proper needs to be set: we may want to consider disallowing IDs by default to
save lazy programmers.
4. [PROJECTED] Links. We're not going to try for spam protection (although 4. [PROJECTED] Links. We're not going to try for spam protection (although
some hooks for such a module might be nice) but we may offer the ability to some hooks for such a module might be nice) but we may offer the ability to
@ -36,4 +37,4 @@ to protect your pages from being attacked by garish colors and plain old
bad taste. A neat feature would be the ability to define acceptable colors bad taste. A neat feature would be the ability to define acceptable colors
in a document, but that's not likely to be implemented for a while. In the in a document, but that's not likely to be implemented for a while. In the
meantime, be sure to make sure that floated elements (permitted, since they meantime, be sure to make sure that floated elements (permitted, since they
can be quite useful) cna't mess up your layout. can be quite useful) can't mess up your layout.

View File

@ -29,7 +29,8 @@ output is valid XHTML or send the HTML through a draconic XML parser (and yet
still get the nesting wrong: SafeHtmlChecker.class.php does not prevent <a> still get the nesting wrong: SafeHtmlChecker.class.php does not prevent <a>
tags from being nested within each other). tags from being nested within each other).
This document seeks to detail the inner workings of HTML Purifier. The first This document no longer is a detailed description of how HTMLPurifier works,
as those descriptions have been moved to the appropriate code. The first
draft was drawn up after two rough code sketches and the implementation of a draft was drawn up after two rough code sketches and the implementation of a
forgiving lexer. You may also be interested in the unit tests located in the forgiving lexer. You may also be interested in the unit tests located in the
tests/ folder, which provide a living document on how exactly the filter deals tests/ folder, which provide a living document on how exactly the filter deals
@ -52,4 +53,5 @@ In summary:
HTML Purifier is best suited for documents that require a rich array of HTML Purifier is best suited for documents that require a rich array of
HTML tags. Things like blog comments are, in all likelihood, most appropriately HTML tags. Things like blog comments are, in all likelihood, most appropriately
written in an extremely restrictive set of markup that doesn't require written in an extremely restrictive set of markup that doesn't require
all this functionality (or not written in HTML at all). all this functionality (or not written in HTML at all), although this may
be changing in the future.