From 4ef26bbd310e9a1e9d484b170dbcf425150f89e0 Mon Sep 17 00:00:00 2001 From: "Edward Z. Yang" Date: Mon, 14 Aug 2006 21:21:54 +0000 Subject: [PATCH] Update docs. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@254 48356398-32a2-884e-a903-53898d9a118a --- docs/security.txt | 25 ++++++----- docs/spec.txt | 103 ---------------------------------------------- 2 files changed, 14 insertions(+), 114 deletions(-) diff --git a/docs/security.txt b/docs/security.txt index 05810ade..f077691c 100644 --- a/docs/security.txt +++ b/docs/security.txt @@ -13,24 +13,27 @@ your character encoding, you should switch. Now. (in future versions, however, I may make the character encoding configurable, but there's only so much I can do). Make sure any input is properly converted to UTF-8, or the parser will mangle it badly (though it won't be a security risk if you're outputting -it as UTF-8). +it as UTF-8 though). 2. XHTML 1.0 Transitional. This is what the parser is outputting. For the most part, it's compatible with HTML 4.01, but XHTML enforces some very nice things that all web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode -has waaaay too many quirks for a little parser to handle. We did not select -strict in order to prevent ourselves from being too draconic on users. +has waaaay too many quirks for a little parser to handle. We did not select +strict in order to prevent ourselves from being too draconic on users, but +this may be configurable in the future. -3. [PROJECTED] IDs. They need to be unique, but without some knowledge of the -rest of the document, it's difficult to know what's unique. I project default -behavior being a customizable prefix to all ID declarations in the document, -so make sure you don't use that prefix. Might cause problems for multiple -instances of HTML escaped output too (especially when it comes to caching). -Best to just zap them completely, perhaps. This will be configurable, and you'll -have to pick the correct one. +3. IDs. They need to be unique, but without some knowledge of the +rest of the document, it's difficult to know what's unique. Without setting +%Attr.IDBlacklist to the proper 4. [PROJECTED] Links. We're not going to try for spam protection (although some hooks for such a module might be nice) but we may offer the ability to only accept relative URLs. Pick the one that's right for you. -5. [PROJECTED] CSS. What a knotty issue. Probably will have to be configurable. \ No newline at end of file +5. CSS. While we can prevent the most flagrant cases from affecting your +layout (such as absolutely positioned elements), no amount of code is going +to protect your pages from being attacked by garish colors and plain old +bad taste. A neat feature would be the ability to define acceptable colors +in a document, but that's not likely to be implemented for a while. In the +meantime, be sure to make sure that floated elements (permitted, since they +can be quite useful) cna't mess up your layout. diff --git a/docs/spec.txt b/docs/spec.txt index 71b435ba..c51848ad 100644 --- a/docs/spec.txt +++ b/docs/spec.txt @@ -53,106 +53,3 @@ HTML Purifier is best suited for documents that require a rich array of HTML tags. Things like blog comments are, in all likelihood, most appropriately written in an extremely restrictive set of markup that doesn't require all this functionality (or not written in HTML at all). - -The rest of this document is pending moving into their associated classes. - -== STAGE 4 - check attributes == - - STATUS: F (currently implementing core/i18n) - -While we're doing all this nesting hocus-pocus, attributes are also being -checked. The reason why we need this to be done with the nesting stuff -is if a REQUIRED attribute is not there, we might need to kill the tag (or -replace it with data). Fortunantely, this is rare enough that we only have -to worry about it for certain things: - -* ! bdo - dir > replace with span, preserve attributes -* ! img - src, alt > if only alt is missing, insert filename, else remove img -* basefont - size -* param - name -* applet - width, height -* map - id -* area - alt -* form - action -* optgroup - label -* textarea - rows, cols - -As you can see, only two of them we would remotely consider for our simplified -tag set. But each has a different set of challenges. For the img tag, we'd -have to be careful about deleting it. If we do hit a snag, we can supply -a default "blank" image. - -So after that's all said and done, each of the different types of content -inside the attributes needs to be handled differently. - -ContentType(s) [RFC2045] -Charset(s) [RFC2045] -LanguageCode [RFC3066] (NMTOKEN) -Character [XML][2.2] (a single character) -Number /^\d+$/ -LinkTypes [HTML][6.12] -MediaDesc [HTML][6.13] -URI/UriList [RFC2396] -Datetime (ISO date format) -Script ... -StyleSheet [CSS] (complex) -Text CDATA -FrameTarget NMTOKEN -Length (pixel, percentage) (?:px suffix allowed?) -MultiLength (pixel, percentage, or relative) -Pixels (integer) -// map attributes omitted -ImgAlign (top|middle|bottom|left|right) -Color #NNNNNN, #NNN or color name (translate it - Black = #000000 Green = #008000 - Silver = #C0C0C0 Lime = #00FF00 - Gray = #808080 Olive = #808000 - White = #FFFFFF Yellow = #FFFF00 - Maroon = #800000 Navy = #000080 - Red = #FF0000 Blue = #0000FF - Purple = #800080 Teal = #008080 - Fuchsia= #FF00FF Aqua = #00FFFF -// plus some directly in the spec - -Everything else is either ID, or defined as a certain set of values. - -Unless we use reflection (which then we have to make sure the attribute exists), -we probably want to have a function like... - - validate($type, $value) where $type is like ContentType or Number - -and then pass it to a switch. - -The final problem is CSS. Get intimate with the syntax here: -http://www.w3.org/TR/CSS21/syndata.html and also note the "bad" CSS elements -that HTML_Safe defines to help determine a whitelist. - ----- - - - - - - - ----- - -These are the elements that only have %attrs: - ul, dl, dt, dd, address, span, em, strong, dfn, code, samp, kbd, var, - cite, abbr, acronym, sub, sup, tt, i, b, big, small, u, s, strike - -These are the elements that only have %attrs and need an alignment transform - div, p, h1, h2, h3, h4, h5, h6 - ----- - -Prepend style transformations, as CSS takes precedence.