Update documentation.

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@319 48356398-32a2-884e-a903-53898d9a118a
2024-12-22 16:31:53 +00:00 · 2006-08-25 03:01:16 +00:00 · 2006-08-25 03:01:16 +00:00 · ca1453401f
commit ca1453401f
parent dcec92e7b3
6 changed files with 36 additions and 19 deletions
--- a/19
+++ b/19
@ -3,23 +3,32 @@ Todo List
 Core:
 - Finish table and shorthand CSS attributes
    - border-collapse, caption-side, empty-cells, table-layout, vertical-align
-    - background
+    - background (and friends)
    - border, border-*
    - font
    - list-style
 - Implement all non-essential attribute transforms
 - Microsoft Word HTML cleaning
 - Plugins for major CMSes
+ - Rewrite *Definition and Config relationship, add various "levels" of cleaning
+ - Support other character encodings out-of-the-box
+ - Allow strict HTML 4.01, loose HTML 4.01 and strict XHTML 1.0 output

 Code issues:
 - Massive profiling, make it faster!
 - Make URI validation routines tighter (especially mailto)
 - Distinguish between different types of URIs, for instance, a mailto URI
   in IMG SRC is nonsensical
- - Factor out Host validation to its own AttrDef
- - Rewrite table's child definition
- - Silently drop content inbetween SCRIPT tags
+ - Rewrite table's child definition to be faster, smart, and regexp free
+ - Silently drop content inbetween SCRIPT tags (can be generalized to allow
+   specification of elements that, when detected as foreign, trigger removal
+   of children, although unbalanced tags could wreck havoc (or at least delete
+   the rest of the document).

 Enhancements:
- - Do fixes for Firefox's inability to handle COL alignment props (Bug 915)
+ - Fixes for Firefox's inability to handle COL alignment props (Bug 915)
 - Pretty-printing HTML
+ - Hooks for adding custom processors to custom namespaced tags and attributes,
+   offer default implementation
+ - Auto-paragraphing (be sure to leverage fact that we know when things
+   shouldn't be paragraphed, such as lists and tables).
--- a/docs/code-quality.txt
+++ b/docs/code-quality.txt
@ -21,13 +21,15 @@ AttrDef
        variable overwriting, missing validation for query, fragment and path,
        no percent-encode fixing
    CSS - parser doesn't accept advanced CSS (fringe)
+    Number - constructor interface is inconsistent with Integer
 AttrTransform - doesn't accept AttrContext, non-validating
-    Lang - invalid xml:lang value can overwrite valid lang value (fringe)
 ChildDef - not-allowed nodes translated to text, likely invalid handling
-Config - "load configuration" hooks missing, rich set* accessors missing
+Config - "load configuration" hooks missing, rich set* accessors missing,
+    needs redefined relationship with the definitions
 Strategy
    FixNesting - cannot bubble nodes out of structures
-    MakeWellFormed - insufficient automatic closing definitions
+    MakeWellFormed - insufficient automatic closing definitions (check HTML
+        spec for optional end tags).
    RemoveForeignElements - should be run in parallel with MakeWellFormed
 URIScheme - needs to have callable generic checks
    ftp - missing typecode check
--- a/docs/config-ideas.txt
+++ b/docs/config-ideas.txt
@ -28,6 +28,7 @@ time.  Note the naming convention: %Namespace.Directive

 %Attr.MaxWidth, 
 %Attr.MaxHeight - caps for width and height related checks.
+    (a hack in Pixels for an image crashing attack could be replaced by this)

 %URI.Munge - will munge all URIs to a different URI, which should redirect
    the user to the applicable page. A urlencoded version of the URI
--- a/docs/config.txt
+++ b/docs/config.txt
@ -17,6 +17,8 @@ are passed.  These classes are: HTMLPurifier::*, Generator::generateFromTokens
 and Lexer::tokenizeHTML.  However, whenever a valid configuration object
 is defined, that object should be used.

+-- the following is projected changes to the configuration system --
+
 In relation to HTMLDefinition and CSSDefinition, there are going to be some
 major structural changes to enable the easy configuration of these objects.
 Due to the intricacy of these objects, it's not feasible to ask an average
--- a/docs/security.txt
+++ b/docs/security.txt
@ -9,11 +9,11 @@ to be effective. Things to remember:
 1. UTF-8. Currently, the parser runs under the assumption that it is dealing
 with UTF-8. Not ISO-8859-1 or Windows-1252, UTF-8. And definitely not "no
 character encoding explicitly stated" or UTF-7. If you're not using UTF-8 as
-your character encoding, you should switch. Now. (in future versions, however,
-I may make the character encoding configurable, but there's only so much I
-can do). Make sure any input is properly converted to UTF-8, or the parser
-will mangle it badly (though it won't be a security risk if you're outputting
-it as UTF-8 though).
+your character encoding, you should switch. Now. Make sure any input is
+properly converted to UTF-8, or the parser will mangle it badly
+(though it won't be a security risk if you're outputting it as UTF-8 though).
+We will be adding out-of-the-box support for the other major character
+encodings shortly.

 2. XHTML 1.0 Transitional. This is what the parser is outputting. For the most
 part, it's compatible with HTML 4.01, but XHTML enforces some very nice things
@ -23,8 +23,9 @@ strict in order to prevent ourselves from being too draconic on users, but
 this may be configurable in the future.

 3. IDs. They need to be unique, but without some knowledge of the
-rest of the document, it's difficult to know what's unique. Without setting
-%Attr.IDBlacklist to the proper 
+rest of the document, it's difficult to know what's unique. %Attr.IDBlacklist
+needs to be set: we may want to consider disallowing IDs by default to
+save lazy programmers.

 4. [PROJECTED] Links. We're not going to try for spam protection (although
 some hooks for such a module might be nice) but we may offer the ability to
@ -36,4 +37,4 @@ to protect your pages from being attacked by garish colors and plain old
 bad taste.  A neat feature would be the ability to define acceptable colors
 in a document, but that's not likely to be implemented for a while.  In the
 meantime, be sure to make sure that floated elements (permitted, since they
-can be quite useful) cna't mess up your layout.
+can be quite useful) can't mess up your layout.
--- a/docs/spec.txt
+++ b/docs/spec.txt
@ -29,7 +29,8 @@ output is valid XHTML or send the HTML through a draconic XML parser (and yet
 still get the nesting wrong: SafeHtmlChecker.class.php does not prevent <a>
 tags from being nested within each other).

-This document seeks to detail the inner workings of HTML Purifier.  The first
+This document no longer is a detailed description of how HTMLPurifier works,
+as those descriptions have been moved to the appropriate code.  The first
 draft was drawn up after two rough code sketches and the implementation of a
 forgiving lexer.  You may also be interested in the unit tests located in the
 tests/ folder, which provide a living document on how exactly the filter deals
@ -52,4 +53,5 @@ In summary:
 HTML Purifier is best suited for documents that require a rich array of
 HTML tags.  Things like blog comments are, in all likelihood, most appropriately
 written in an extremely restrictive set of markup that doesn't require
-all this functionality (or not written in HTML at all).
+all this functionality (or not written in HTML at all), although this may
+be changing in the future.