mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2025-01-18 11:41:52 +00:00
Update docs.
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@254 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
parent
218eb67167
commit
4ef26bbd31
@ -13,24 +13,27 @@ your character encoding, you should switch. Now. (in future versions, however,
|
||||
I may make the character encoding configurable, but there's only so much I
|
||||
can do). Make sure any input is properly converted to UTF-8, or the parser
|
||||
will mangle it badly (though it won't be a security risk if you're outputting
|
||||
it as UTF-8).
|
||||
it as UTF-8 though).
|
||||
|
||||
2. XHTML 1.0 Transitional. This is what the parser is outputting. For the most
|
||||
part, it's compatible with HTML 4.01, but XHTML enforces some very nice things
|
||||
that all web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode
|
||||
has waaaay too many quirks for a little parser to handle. We did not select
|
||||
strict in order to prevent ourselves from being too draconic on users.
|
||||
strict in order to prevent ourselves from being too draconic on users, but
|
||||
this may be configurable in the future.
|
||||
|
||||
3. [PROJECTED] IDs. They need to be unique, but without some knowledge of the
|
||||
rest of the document, it's difficult to know what's unique. I project default
|
||||
behavior being a customizable prefix to all ID declarations in the document,
|
||||
so make sure you don't use that prefix. Might cause problems for multiple
|
||||
instances of HTML escaped output too (especially when it comes to caching).
|
||||
Best to just zap them completely, perhaps. This will be configurable, and you'll
|
||||
have to pick the correct one.
|
||||
3. IDs. They need to be unique, but without some knowledge of the
|
||||
rest of the document, it's difficult to know what's unique. Without setting
|
||||
%Attr.IDBlacklist to the proper
|
||||
|
||||
4. [PROJECTED] Links. We're not going to try for spam protection (although
|
||||
some hooks for such a module might be nice) but we may offer the ability to
|
||||
only accept relative URLs. Pick the one that's right for you.
|
||||
|
||||
5. [PROJECTED] CSS. What a knotty issue. Probably will have to be configurable.
|
||||
5. CSS. While we can prevent the most flagrant cases from affecting your
|
||||
layout (such as absolutely positioned elements), no amount of code is going
|
||||
to protect your pages from being attacked by garish colors and plain old
|
||||
bad taste. A neat feature would be the ability to define acceptable colors
|
||||
in a document, but that's not likely to be implemented for a while. In the
|
||||
meantime, be sure to make sure that floated elements (permitted, since they
|
||||
can be quite useful) cna't mess up your layout.
|
||||
|
103
docs/spec.txt
103
docs/spec.txt
@ -53,106 +53,3 @@ HTML Purifier is best suited for documents that require a rich array of
|
||||
HTML tags. Things like blog comments are, in all likelihood, most appropriately
|
||||
written in an extremely restrictive set of markup that doesn't require
|
||||
all this functionality (or not written in HTML at all).
|
||||
|
||||
The rest of this document is pending moving into their associated classes.
|
||||
|
||||
== STAGE 4 - check attributes ==
|
||||
|
||||
STATUS: F (currently implementing core/i18n)
|
||||
|
||||
While we're doing all this nesting hocus-pocus, attributes are also being
|
||||
checked. The reason why we need this to be done with the nesting stuff
|
||||
is if a REQUIRED attribute is not there, we might need to kill the tag (or
|
||||
replace it with data). Fortunantely, this is rare enough that we only have
|
||||
to worry about it for certain things:
|
||||
|
||||
* ! bdo - dir > replace with span, preserve attributes
|
||||
* ! img - src, alt > if only alt is missing, insert filename, else remove img
|
||||
* basefont - size
|
||||
* param - name
|
||||
* applet - width, height
|
||||
* map - id
|
||||
* area - alt
|
||||
* form - action
|
||||
* optgroup - label
|
||||
* textarea - rows, cols
|
||||
|
||||
As you can see, only two of them we would remotely consider for our simplified
|
||||
tag set. But each has a different set of challenges. For the img tag, we'd
|
||||
have to be careful about deleting it. If we do hit a snag, we can supply
|
||||
a default "blank" image.
|
||||
|
||||
So after that's all said and done, each of the different types of content
|
||||
inside the attributes needs to be handled differently.
|
||||
|
||||
ContentType(s) [RFC2045]
|
||||
Charset(s) [RFC2045]
|
||||
LanguageCode [RFC3066] (NMTOKEN)
|
||||
Character [XML][2.2] (a single character)
|
||||
Number /^\d+$/
|
||||
LinkTypes [HTML][6.12] <space>
|
||||
MediaDesc [HTML][6.13] <comma>
|
||||
URI/UriList [RFC2396] <space>
|
||||
Datetime (ISO date format)
|
||||
Script ...
|
||||
StyleSheet [CSS] (complex)
|
||||
Text CDATA
|
||||
FrameTarget NMTOKEN
|
||||
Length (pixel, percentage) (?:px suffix allowed?)
|
||||
MultiLength (pixel, percentage, or relative)
|
||||
Pixels (integer)
|
||||
// map attributes omitted
|
||||
ImgAlign (top|middle|bottom|left|right)
|
||||
Color #NNNNNN, #NNN or color name (translate it
|
||||
Black = #000000 Green = #008000
|
||||
Silver = #C0C0C0 Lime = #00FF00
|
||||
Gray = #808080 Olive = #808000
|
||||
White = #FFFFFF Yellow = #FFFF00
|
||||
Maroon = #800000 Navy = #000080
|
||||
Red = #FF0000 Blue = #0000FF
|
||||
Purple = #800080 Teal = #008080
|
||||
Fuchsia= #FF00FF Aqua = #00FFFF
|
||||
// plus some directly in the spec
|
||||
|
||||
Everything else is either ID, or defined as a certain set of values.
|
||||
|
||||
Unless we use reflection (which then we have to make sure the attribute exists),
|
||||
we probably want to have a function like...
|
||||
|
||||
validate($type, $value) where $type is like ContentType or Number
|
||||
|
||||
and then pass it to a switch.
|
||||
|
||||
The final problem is CSS. Get intimate with the syntax here:
|
||||
http://www.w3.org/TR/CSS21/syndata.html and also note the "bad" CSS elements
|
||||
that HTML_Safe defines to help determine a whitelist.
|
||||
|
||||
----
|
||||
|
||||
<!ENTITY % coreattrs
|
||||
"id ID #IMPLIED
|
||||
class CDATA #IMPLIED
|
||||
style %StyleSheet; #IMPLIED
|
||||
title %Text; #IMPLIED"
|
||||
>
|
||||
|
||||
<!ENTITY % i18n
|
||||
"lang %LanguageCode; #IMPLIED
|
||||
xml:lang %LanguageCode; #IMPLIED
|
||||
dir (ltr|rtl) #IMPLIED"
|
||||
>
|
||||
|
||||
<!ENTITY % attrs "%coreattrs; %i18n;">
|
||||
|
||||
----
|
||||
|
||||
These are the elements that only have %attrs:
|
||||
ul, dl, dt, dd, address, span, em, strong, dfn, code, samp, kbd, var,
|
||||
cite, abbr, acronym, sub, sup, tt, i, b, big, small, u, s, strike
|
||||
|
||||
These are the elements that only have %attrs and need an alignment transform
|
||||
div, p, h1, h2, h3, h4, h5, h6
|
||||
|
||||
----
|
||||
|
||||
Prepend style transformations, as CSS takes precedence.
|
||||
|
Loading…
Reference in New Issue
Block a user