mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2024-12-23 08:51:53 +00:00
a8298172e1
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@534 48356398-32a2-884e-a903-53898d9a118a
131 lines
5.3 KiB
Plaintext
131 lines
5.3 KiB
Plaintext
|
|
Filter Levels
|
|
When one size *does not* fit all
|
|
|
|
The more I think about it, the less sense it makes for maintaining one huge
|
|
monolithic HTMLDefinition class. There's simply so much variation that
|
|
could go into this definition: the set of HTML good for blog entries is
|
|
definitely too large for HTML that would be allowed in blog comments. Going
|
|
from Transitional to Strict requires changes to the definition.
|
|
|
|
However, allowing users to specify their own whitelists was an idea I
|
|
rejected from the start. Simply put, the typical programmer is too lazy
|
|
to actually go through the trouble of investigating which tags, attributes
|
|
and properties to allow. HTMLDefinition makes a big part of what HTMLPurifier
|
|
is.
|
|
|
|
The idea, then, is to setup fundamentally different set of definitions, which
|
|
can further be customized using simpler configuration options.
|
|
|
|
Here are some fuzzy levels you could set:
|
|
|
|
1. Comments - Wordpress recommends a, abbr, acronym, b, blockquote, cite,
|
|
code, em, i, strike, strong; however, you could get away with only a, em and
|
|
p; also having blockquote and pre tags would be helpful.
|
|
2. BBCode - Emulate the usual tagset for forums: b, i, img, a, blockquote,
|
|
pre, div, span and h[2-6] (the last three are for specially formatted
|
|
posts, div and span require associated classes or inline styling enabled
|
|
to be useful)
|
|
3. Pages - As permissive as possible without allowing XSS. No protection
|
|
against bad design sense, unfortunantely. Suitable for wiki and page
|
|
environments.
|
|
4. Lint - Accept everything in the spec, a Tidy wannabe. (This probably won't
|
|
get implemented as it would require routines for things like <object>
|
|
and friends to be implemented, which is a lot of work for not a lot of
|
|
benefit)
|
|
|
|
One final note: when you start axing tags that are more commonly used, you
|
|
run the risk of accidentally destroying user data, especially if the data
|
|
is incoming from a WYSIWYG eidtor that hasn't been synced accordingly. This may
|
|
make forbidden element to text transformations desirable (for example, images).
|
|
|
|
|
|
|
|
== Element Risk Analysis ==
|
|
|
|
Legend:
|
|
[danger level] - regular tags / uncommon tags ~ deprecated tags
|
|
[danger level]* - rare tags
|
|
|
|
1 - blockquote, code, em, i, p, tt / strong, sub, sup
|
|
1* - abbr, acronym, bdo, cite, dfn, kbd, q, samp
|
|
2 - b, br, del, div, pre, span / ins, s, strike ~ u
|
|
3 - h2, h3, h4, h5, h6 ~ center
|
|
4 - h1, big ~ font
|
|
5 - a
|
|
7 - area, map
|
|
|
|
These are special use tags, they should be enabled on a blanket basis.
|
|
|
|
Lists - dd, dl, dt, li, ol, ul ~ menu, dir
|
|
Tables - caption, table, td, th, tr / col, colgroup, tbody, tfoot, thead
|
|
|
|
Forms - fieldset, form, input, lable, legend, optgroup, option, select, textarea
|
|
XSS - noscript, object, script ~ applet
|
|
Meta - base, basefont, body, head, html, link, meta, style, title
|
|
Frames - frame, frameset, iframe
|
|
|
|
And tag specific notes:
|
|
|
|
a - general problems involving linkspam
|
|
b - too much bold is bad, typographically speaking bold is discouraged
|
|
br - often misused
|
|
center - CSS, usually no legit use
|
|
del - only useful in editing context
|
|
div - little meaning in certain contexts i.e. blog comment
|
|
h1 - usually no legit use, as header is already set by application
|
|
h* - not needed in blog comments
|
|
hr - usually not necessary in blog comments
|
|
img - could be extremely undesirable if linking to external pics (CSRF, goatse)
|
|
pre - could use formatting, only useful in code contexts
|
|
q - very little support
|
|
s - transform into span with styling or del?
|
|
small - technically presentational
|
|
span - depends on attribute allowances
|
|
sub, sup - specialized
|
|
u - little legit use, prefer class with text-decoration
|
|
|
|
Based on the riskiness of the items, we may want to offer %HTML.DisableImages
|
|
attribute and put URI filtering higher up on the priority list.
|
|
|
|
|
|
== Attribute Risk Analysis ==
|
|
|
|
We actually have a suprisingly small assortment of allowed attributes (the
|
|
rest are deprecated in strict, and thus we opted not to allow them, even
|
|
though our output is XHTML Transitional by default.)
|
|
|
|
Required URI - img.alt, img.src, a.href
|
|
Medium risk - *.class, *.dir
|
|
High risk - img.height, img.width, *.id, *.style
|
|
|
|
Table - colgroup/col.span, td/th.rowspan, td/th.colspan
|
|
Uncommon - *.title, *.lang, *.xml:lang
|
|
Rare - td/th.abbr, table.summary, {table}.charoff
|
|
Rare URI - del.cite, ins.cite, blockquote.cite, q.cite, img.longdesc
|
|
Presentational - {table}.align, {table}.valign, table.frame, table.rules,
|
|
table.border
|
|
Partially presentational - table.cellpadding, table.cellspacing,
|
|
table.width, col.width, colgroup.width
|
|
|
|
|
|
== CSS Risk Analysis ==
|
|
|
|
There are certain CSS elements that are extremely useful inline, but then
|
|
as you get to more presentation oriented styling it may not always be
|
|
appropriate to inline them.
|
|
|
|
Useful - clear, float, border-collapse, caption-side
|
|
|
|
These CSS properties can break layouts if used improperly. We have excluded
|
|
any CSS properties that are not currently implemented (such as position).
|
|
|
|
Dangerous, can go outside container - float
|
|
Easy to abuse - font-size, font-family (font), width
|
|
Colored - background-color (background), border-color (border), color
|
|
Dramatic - border, list-style-position (list-style), margin, padding,
|
|
text-align, text-indent, text-transform, vertical-align, line-height
|
|
|
|
Dramatic elements substantially change the look of text in ways that should
|
|
probably have been reserved to other areas.
|