0
0
mirror of https://github.com/ezyang/htmlpurifier.git synced 2025-01-18 11:41:52 +00:00

Update docs.

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@254 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
Edward Z. Yang 2006-08-14 21:21:54 +00:00
parent 218eb67167
commit 4ef26bbd31
2 changed files with 14 additions and 114 deletions

View File

@ -13,24 +13,27 @@ your character encoding, you should switch. Now. (in future versions, however,
I may make the character encoding configurable, but there's only so much I
can do). Make sure any input is properly converted to UTF-8, or the parser
will mangle it badly (though it won't be a security risk if you're outputting
it as UTF-8).
it as UTF-8 though).
2. XHTML 1.0 Transitional. This is what the parser is outputting. For the most
part, it's compatible with HTML 4.01, but XHTML enforces some very nice things
that all web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode
has waaaay too many quirks for a little parser to handle. We did not select
strict in order to prevent ourselves from being too draconic on users.
strict in order to prevent ourselves from being too draconic on users, but
this may be configurable in the future.
3. [PROJECTED] IDs. They need to be unique, but without some knowledge of the
rest of the document, it's difficult to know what's unique. I project default
behavior being a customizable prefix to all ID declarations in the document,
so make sure you don't use that prefix. Might cause problems for multiple
instances of HTML escaped output too (especially when it comes to caching).
Best to just zap them completely, perhaps. This will be configurable, and you'll
have to pick the correct one.
3. IDs. They need to be unique, but without some knowledge of the
rest of the document, it's difficult to know what's unique. Without setting
%Attr.IDBlacklist to the proper
4. [PROJECTED] Links. We're not going to try for spam protection (although
some hooks for such a module might be nice) but we may offer the ability to
only accept relative URLs. Pick the one that's right for you.
5. [PROJECTED] CSS. What a knotty issue. Probably will have to be configurable.
5. CSS. While we can prevent the most flagrant cases from affecting your
layout (such as absolutely positioned elements), no amount of code is going
to protect your pages from being attacked by garish colors and plain old
bad taste. A neat feature would be the ability to define acceptable colors
in a document, but that's not likely to be implemented for a while. In the
meantime, be sure to make sure that floated elements (permitted, since they
can be quite useful) cna't mess up your layout.

View File

@ -53,106 +53,3 @@ HTML Purifier is best suited for documents that require a rich array of
HTML tags. Things like blog comments are, in all likelihood, most appropriately
written in an extremely restrictive set of markup that doesn't require
all this functionality (or not written in HTML at all).
The rest of this document is pending moving into their associated classes.
== STAGE 4 - check attributes ==
STATUS: F (currently implementing core/i18n)
While we're doing all this nesting hocus-pocus, attributes are also being
checked. The reason why we need this to be done with the nesting stuff
is if a REQUIRED attribute is not there, we might need to kill the tag (or
replace it with data). Fortunantely, this is rare enough that we only have
to worry about it for certain things:
* ! bdo - dir > replace with span, preserve attributes
* ! img - src, alt > if only alt is missing, insert filename, else remove img
* basefont - size
* param - name
* applet - width, height
* map - id
* area - alt
* form - action
* optgroup - label
* textarea - rows, cols
As you can see, only two of them we would remotely consider for our simplified
tag set. But each has a different set of challenges. For the img tag, we'd
have to be careful about deleting it. If we do hit a snag, we can supply
a default "blank" image.
So after that's all said and done, each of the different types of content
inside the attributes needs to be handled differently.
ContentType(s) [RFC2045]
Charset(s) [RFC2045]
LanguageCode [RFC3066] (NMTOKEN)
Character [XML][2.2] (a single character)
Number /^\d+$/
LinkTypes [HTML][6.12] <space>
MediaDesc [HTML][6.13] <comma>
URI/UriList [RFC2396] <space>
Datetime (ISO date format)
Script ...
StyleSheet [CSS] (complex)
Text CDATA
FrameTarget NMTOKEN
Length (pixel, percentage) (?:px suffix allowed?)
MultiLength (pixel, percentage, or relative)
Pixels (integer)
// map attributes omitted
ImgAlign (top|middle|bottom|left|right)
Color #NNNNNN, #NNN or color name (translate it
Black = #000000 Green = #008000
Silver = #C0C0C0 Lime = #00FF00
Gray = #808080 Olive = #808000
White = #FFFFFF Yellow = #FFFF00
Maroon = #800000 Navy = #000080
Red = #FF0000 Blue = #0000FF
Purple = #800080 Teal = #008080
Fuchsia= #FF00FF Aqua = #00FFFF
// plus some directly in the spec
Everything else is either ID, or defined as a certain set of values.
Unless we use reflection (which then we have to make sure the attribute exists),
we probably want to have a function like...
validate($type, $value) where $type is like ContentType or Number
and then pass it to a switch.
The final problem is CSS. Get intimate with the syntax here:
http://www.w3.org/TR/CSS21/syndata.html and also note the "bad" CSS elements
that HTML_Safe defines to help determine a whitelist.
----
<!ENTITY % coreattrs
"id ID #IMPLIED
class CDATA #IMPLIED
style %StyleSheet; #IMPLIED
title %Text; #IMPLIED"
>
<!ENTITY % i18n
"lang %LanguageCode; #IMPLIED
xml:lang %LanguageCode; #IMPLIED
dir (ltr|rtl) #IMPLIED"
>
<!ENTITY % attrs "%coreattrs; %i18n;">
----
These are the elements that only have %attrs:
ul, dl, dt, dd, address, span, em, strong, dfn, code, samp, kbd, var,
cite, abbr, acronym, sub, sup, tt, i, b, big, small, u, s, strike
These are the elements that only have %attrs and need an alignment transform
div, p, h1, h2, h3, h4, h5, h6
----
Prepend style transformations, as CSS takes precedence.