mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2024-12-22 08:21:52 +00:00
Update documentation.
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@319 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
parent
dcec92e7b3
commit
ca1453401f
21
TODO
21
TODO
@ -3,23 +3,32 @@ Todo List
|
||||
Core:
|
||||
- Finish table and shorthand CSS attributes
|
||||
- border-collapse, caption-side, empty-cells, table-layout, vertical-align
|
||||
- background
|
||||
- background (and friends)
|
||||
- border, border-*
|
||||
- font
|
||||
- list-style
|
||||
- Implement all non-essential attribute transforms
|
||||
- Microsoft Word HTML cleaning
|
||||
- Plugins for major CMSes
|
||||
- Rewrite *Definition and Config relationship, add various "levels" of cleaning
|
||||
- Support other character encodings out-of-the-box
|
||||
- Allow strict HTML 4.01, loose HTML 4.01 and strict XHTML 1.0 output
|
||||
|
||||
Code issues:
|
||||
- Massive profiling, make it faster!
|
||||
- Make URI validation routines tighter (especially mailto)
|
||||
- Distinguish between different types of URIs, for instance, a mailto URI
|
||||
in IMG SRC is nonsensical
|
||||
- Factor out Host validation to its own AttrDef
|
||||
- Rewrite table's child definition
|
||||
- Silently drop content inbetween SCRIPT tags
|
||||
- Rewrite table's child definition to be faster, smart, and regexp free
|
||||
- Silently drop content inbetween SCRIPT tags (can be generalized to allow
|
||||
specification of elements that, when detected as foreign, trigger removal
|
||||
of children, although unbalanced tags could wreck havoc (or at least delete
|
||||
the rest of the document).
|
||||
|
||||
Enhancements:
|
||||
- Do fixes for Firefox's inability to handle COL alignment props (Bug 915)
|
||||
- Pretty-printing HTML
|
||||
- Fixes for Firefox's inability to handle COL alignment props (Bug 915)
|
||||
- Pretty-printing HTML
|
||||
- Hooks for adding custom processors to custom namespaced tags and attributes,
|
||||
offer default implementation
|
||||
- Auto-paragraphing (be sure to leverage fact that we know when things
|
||||
shouldn't be paragraphed, such as lists and tables).
|
||||
|
@ -21,13 +21,15 @@ AttrDef
|
||||
variable overwriting, missing validation for query, fragment and path,
|
||||
no percent-encode fixing
|
||||
CSS - parser doesn't accept advanced CSS (fringe)
|
||||
Number - constructor interface is inconsistent with Integer
|
||||
AttrTransform - doesn't accept AttrContext, non-validating
|
||||
Lang - invalid xml:lang value can overwrite valid lang value (fringe)
|
||||
ChildDef - not-allowed nodes translated to text, likely invalid handling
|
||||
Config - "load configuration" hooks missing, rich set* accessors missing
|
||||
Config - "load configuration" hooks missing, rich set* accessors missing,
|
||||
needs redefined relationship with the definitions
|
||||
Strategy
|
||||
FixNesting - cannot bubble nodes out of structures
|
||||
MakeWellFormed - insufficient automatic closing definitions
|
||||
MakeWellFormed - insufficient automatic closing definitions (check HTML
|
||||
spec for optional end tags).
|
||||
RemoveForeignElements - should be run in parallel with MakeWellFormed
|
||||
URIScheme - needs to have callable generic checks
|
||||
ftp - missing typecode check
|
||||
|
@ -28,6 +28,7 @@ time. Note the naming convention: %Namespace.Directive
|
||||
|
||||
%Attr.MaxWidth,
|
||||
%Attr.MaxHeight - caps for width and height related checks.
|
||||
(a hack in Pixels for an image crashing attack could be replaced by this)
|
||||
|
||||
%URI.Munge - will munge all URIs to a different URI, which should redirect
|
||||
the user to the applicable page. A urlencoded version of the URI
|
||||
|
@ -17,6 +17,8 @@ are passed. These classes are: HTMLPurifier::*, Generator::generateFromTokens
|
||||
and Lexer::tokenizeHTML. However, whenever a valid configuration object
|
||||
is defined, that object should be used.
|
||||
|
||||
-- the following is projected changes to the configuration system --
|
||||
|
||||
In relation to HTMLDefinition and CSSDefinition, there are going to be some
|
||||
major structural changes to enable the easy configuration of these objects.
|
||||
Due to the intricacy of these objects, it's not feasible to ask an average
|
||||
|
@ -9,11 +9,11 @@ to be effective. Things to remember:
|
||||
1. UTF-8. Currently, the parser runs under the assumption that it is dealing
|
||||
with UTF-8. Not ISO-8859-1 or Windows-1252, UTF-8. And definitely not "no
|
||||
character encoding explicitly stated" or UTF-7. If you're not using UTF-8 as
|
||||
your character encoding, you should switch. Now. (in future versions, however,
|
||||
I may make the character encoding configurable, but there's only so much I
|
||||
can do). Make sure any input is properly converted to UTF-8, or the parser
|
||||
will mangle it badly (though it won't be a security risk if you're outputting
|
||||
it as UTF-8 though).
|
||||
your character encoding, you should switch. Now. Make sure any input is
|
||||
properly converted to UTF-8, or the parser will mangle it badly
|
||||
(though it won't be a security risk if you're outputting it as UTF-8 though).
|
||||
We will be adding out-of-the-box support for the other major character
|
||||
encodings shortly.
|
||||
|
||||
2. XHTML 1.0 Transitional. This is what the parser is outputting. For the most
|
||||
part, it's compatible with HTML 4.01, but XHTML enforces some very nice things
|
||||
@ -23,8 +23,9 @@ strict in order to prevent ourselves from being too draconic on users, but
|
||||
this may be configurable in the future.
|
||||
|
||||
3. IDs. They need to be unique, but without some knowledge of the
|
||||
rest of the document, it's difficult to know what's unique. Without setting
|
||||
%Attr.IDBlacklist to the proper
|
||||
rest of the document, it's difficult to know what's unique. %Attr.IDBlacklist
|
||||
needs to be set: we may want to consider disallowing IDs by default to
|
||||
save lazy programmers.
|
||||
|
||||
4. [PROJECTED] Links. We're not going to try for spam protection (although
|
||||
some hooks for such a module might be nice) but we may offer the ability to
|
||||
@ -36,4 +37,4 @@ to protect your pages from being attacked by garish colors and plain old
|
||||
bad taste. A neat feature would be the ability to define acceptable colors
|
||||
in a document, but that's not likely to be implemented for a while. In the
|
||||
meantime, be sure to make sure that floated elements (permitted, since they
|
||||
can be quite useful) cna't mess up your layout.
|
||||
can be quite useful) can't mess up your layout.
|
||||
|
@ -29,7 +29,8 @@ output is valid XHTML or send the HTML through a draconic XML parser (and yet
|
||||
still get the nesting wrong: SafeHtmlChecker.class.php does not prevent <a>
|
||||
tags from being nested within each other).
|
||||
|
||||
This document seeks to detail the inner workings of HTML Purifier. The first
|
||||
This document no longer is a detailed description of how HTMLPurifier works,
|
||||
as those descriptions have been moved to the appropriate code. The first
|
||||
draft was drawn up after two rough code sketches and the implementation of a
|
||||
forgiving lexer. You may also be interested in the unit tests located in the
|
||||
tests/ folder, which provide a living document on how exactly the filter deals
|
||||
@ -52,4 +53,5 @@ In summary:
|
||||
HTML Purifier is best suited for documents that require a rich array of
|
||||
HTML tags. Things like blog comments are, in all likelihood, most appropriately
|
||||
written in an extremely restrictive set of markup that doesn't require
|
||||
all this functionality (or not written in HTML at all).
|
||||
all this functionality (or not written in HTML at all), although this may
|
||||
be changing in the future.
|
||||
|
Loading…
Reference in New Issue
Block a user