mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2025-01-21 13:01:53 +00:00
Sync 1.1 branch as much as possible with trunk.
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/branches/1.1@476 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
parent
8104145580
commit
728e6c5b44
10
NEWS
10
NEWS
@ -24,6 +24,16 @@ NEWS ( CHANGELOG and HISTORY ) HTMLPurifier
|
||||
. Refactored parseData() to general Lexer class
|
||||
. Tester named "HTML Purifier" not "HTMLPurifier"
|
||||
|
||||
1.1.1, released 2006-09-24
|
||||
! Configuration option to optionally Tidy up output for indentation to make up
|
||||
for dropped whitespace by DOMLex (pretty-printing for the entire application
|
||||
should be done by a page-wide Tidy)
|
||||
- Various documentation updates
|
||||
- Fixed parse error in configuration documentation script
|
||||
- Fixed fatal error in benchmark scripts, slightly augmented
|
||||
- As far as possible, whitespace is preserved in-between table children
|
||||
- Sample test-settings.php file included
|
||||
|
||||
1.1.0, released 2006-09-16
|
||||
! Directive documentation generation using XSLT
|
||||
! XHTML can now be turned off, output becomes <br>
|
||||
|
23
docs/colors.txt
Normal file
23
docs/colors.txt
Normal file
@ -0,0 +1,23 @@
|
||||
|
||||
Colors
|
||||
Hammering some sense into those content-makers
|
||||
|
||||
Your website probably has a color-scheme. Green on white, purple on yellow,
|
||||
whatever. When you give users the ability to style their content, you may
|
||||
want them to keep in line with your styling. If you're website is all
|
||||
about light colors, you don't want a user to come in and vandalize your
|
||||
page with a deep maroon.
|
||||
|
||||
This is an extremely silly feature proposal, but I'm writing it down anyway.
|
||||
|
||||
What if the user could constrain the colors specified in inline styles? You
|
||||
are only allowed to use these shades of dark green for text and these shades
|
||||
of light yellow for the background. At the very least, you could ensure
|
||||
that we did not have pale yellow on white text.
|
||||
|
||||
Implementation issues:
|
||||
1. Requires the color attribute definition to know, currently, what the text
|
||||
and background colors are. This becomes difficult when classes are thrown
|
||||
into the mix.
|
||||
2. The user still has to define the permissible colors, how does one do
|
||||
something like that?
|
@ -20,15 +20,32 @@ can further be customized using simpler configuration options.
|
||||
Here are some fuzzy levels you could set:
|
||||
|
||||
1. Comments - Wordpress recommends a, abbr, acronym, b, blockquote, cite,
|
||||
code, em, i, strike, strong; however, you could get away with only a, b and
|
||||
i; also having p and pre tags would be helpful.
|
||||
2. Pages - As permissive as possible without allowing XSS. No protection
|
||||
code, em, i, strike, strong; however, you could get away with only a, em and
|
||||
p; also having blockquote and pre tags would be helpful.
|
||||
2. BBCode - Emulate the usual tagset for forums: b, i, img, a, blockquote,
|
||||
pre, div, span and h[2-6] (the last three are for specially formatted
|
||||
posts, div and span require associated classes or inline styling enabled
|
||||
to be useful)
|
||||
3. Pages - As permissive as possible without allowing XSS. No protection
|
||||
against bad design sense, unfortunantely. Suitable for wiki and page
|
||||
environments.
|
||||
3. Lint - Accept everything in the spec, a Tidy wannabe.
|
||||
4. Lint - Accept everything in the spec, a Tidy wannabe. (This probably won't
|
||||
get implemented as it would require routines for things like <object>
|
||||
and friends to be implemented, which is a lot of work for not a lot of
|
||||
benefit)
|
||||
|
||||
I've also decomposed tags into risk levels. An asterisk indicates that no one
|
||||
really uses that tag, tilde indicates it's deprecated.
|
||||
One final note: when you start axing tags that are more commonly used, you
|
||||
run the risk of accidentally destroying user data, especially if the data
|
||||
is incoming from a WYSIWYG eidtor that hasn't been synced accordingly. This may
|
||||
make forbidden element to text transformations desirable (for example, images).
|
||||
|
||||
|
||||
|
||||
== Element Risk Analysis ==
|
||||
|
||||
Legend:
|
||||
[danger level] - regular tags / uncommon tags ~ deprecated tags
|
||||
[danger level]* - rare tags
|
||||
|
||||
1 - blockquote, code, em, i, p, tt / strong, sub, sup
|
||||
1* - abbr, acronym, bdo, cite, dfn, kbd, q, samp
|
||||
@ -38,30 +55,76 @@ really uses that tag, tilde indicates it's deprecated.
|
||||
5 - a
|
||||
7 - area, map
|
||||
|
||||
These are special use tags, they should be enabled on a blanket basis.
|
||||
|
||||
Lists - dd, dl, dt, li, ol, ul ~ menu, dir
|
||||
Tables - caption, table, td, th, tr / col, colgroup, tbody, tfoot, thead
|
||||
|
||||
Forms - fieldset, form, input, lable, legend, optgroup, option, select, textarea
|
||||
XSS - noscript, object, script ~ applet
|
||||
|
||||
Meta - base, basefont, body, head, html, link, meta, style, title
|
||||
Frames - frame, frameset, iframe
|
||||
|
||||
And tag specific notes:
|
||||
|
||||
a - general problems involving linkspam
|
||||
b - too much bold is bad, typographically speaking bold is discouraged
|
||||
br - often misused
|
||||
a - general problems involving linkspam
|
||||
b - too much bold is bad, typographically speaking bold is discouraged
|
||||
br - often misused
|
||||
center - CSS, usually no legit use
|
||||
del - only useful in editing context
|
||||
div - little meaning in certain contexts i.e. blog comment
|
||||
h1 - usually no legit use, as header is already set by application
|
||||
h* - not needed in blog comments
|
||||
hr - usually not necessary in blog comments
|
||||
img - could be extremely undesirable if linking to external pics
|
||||
h1 - usually no legit use, as header is already set by application
|
||||
h* - not needed in blog comments
|
||||
hr - usually not necessary in blog comments
|
||||
img - could be extremely undesirable if linking to external pics (CSRF, goatse)
|
||||
pre - could use formatting, only useful in code contexts
|
||||
q - very little support
|
||||
s - transform into span with styling or del?
|
||||
q - very little support
|
||||
s - transform into span with styling or del?
|
||||
small - technically presentational
|
||||
span - depends on attribute allowances
|
||||
sub, sup - specialized
|
||||
u - little legit use, prefer class with text-decoration
|
||||
u - little legit use, prefer class with text-decoration
|
||||
|
||||
Based on the riskiness of the items, we may want to offer %HTML.DisableImages
|
||||
attribute and put URI filtering higher up on the priority list.
|
||||
|
||||
|
||||
== Attribute Risk Analysis ==
|
||||
|
||||
We actually have a suprisingly small assortment of allowed attributes (the
|
||||
rest are deprecated in strict, and thus we opted not to allow them, even
|
||||
though our output is XHTML Transitional by default.)
|
||||
|
||||
Required URI - img.alt, img.src, a.href
|
||||
Medium risk - *.class, *.dir
|
||||
High risk - img.height, img.width, *.id, *.style
|
||||
|
||||
Table - colgroup/col.span, td/th.rowspan, td/th.colspan
|
||||
Uncommon - *.title, *.lang, *.xml:lang
|
||||
Rare - td/th.abbr, table.summary, {table}.charoff
|
||||
Rare URI - del.cite, ins.cite, blockquote.cite, q.cite, img.longdesc
|
||||
Presentational - {table}.align, {table}.valign, table.frame, table.rules,
|
||||
table.border
|
||||
Partially presentational - table.cellpadding, table.cellspacing,
|
||||
table.width, col.width, colgroup.width
|
||||
|
||||
|
||||
== CSS Risk Analysis ==
|
||||
|
||||
There are certain CSS elements that are extremely useful inline, but then
|
||||
as you get to more presentation oriented styling it may not always be
|
||||
appropriate to inline them.
|
||||
|
||||
Useful - clear, float, border-collapse, caption-side
|
||||
|
||||
These CSS properties can break layouts if used improperly. We have excluded
|
||||
any CSS properties that are not currently implemented (such as position).
|
||||
|
||||
Dangerous, can go outside container - float
|
||||
Easy to abuse - font-size, font-family (font), width
|
||||
Colored - background-color (background), border-color (border), color
|
||||
Dramatic - border, list-style-position (list-style), margin, padding,
|
||||
text-align, text-indent, text-transform, vertical-align, line-height
|
||||
|
||||
Dramatic elements substnatially change the look of text in ways that should
|
||||
probably have been reserved to other areas.
|
||||
|
25
docs/strictness.txt
Normal file
25
docs/strictness.txt
Normal file
@ -0,0 +1,25 @@
|
||||
|
||||
Is HTML Purifier Strict or Transitional?
|
||||
A little bit of helpful guidance
|
||||
|
||||
Despite the fact that HTML Purifier professes only to support transitional
|
||||
HTML, it rejects a lot of attributes and elements that are actually, indeed,
|
||||
valid. You can investigate progress.html to find out precisely what we
|
||||
are doing to these *deprecated* attributes.
|
||||
|
||||
However, users have found that Strict HTML imposes some quite unreasonable
|
||||
restrictions on certain things. The start and value attributes in ol and
|
||||
li (respectively) perhaps are the most contested. There's is currently no
|
||||
widely supported browser method short of JavaScript that can replace these
|
||||
two deprecated elements. HTML Purifier does not currently support them, but
|
||||
it might behoove us to do so while our output is still transitional.
|
||||
|
||||
Fortunantely, that's the only real bugger case. The others have near-perfect
|
||||
CSS equivalents, and were presentational anyway. However, the other question
|
||||
pops up: should we always convert these to the CSS forms when 1. the spec
|
||||
allows them anyway and 2. older browsers support them better? After all, the
|
||||
whole point about CSS is to seperate styling from content, so inline styling
|
||||
doesn't solve that problem.
|
||||
|
||||
It's an icky question, and we'll have to deal with it as more and more
|
||||
transforms get implemented.
|
@ -56,6 +56,7 @@ class HTMLPurifier_HTMLDefinition
|
||||
|
||||
/**
|
||||
* String name of parent element HTML will be going into.
|
||||
* @todo Allow this to be overloaded by user config
|
||||
* @public
|
||||
*/
|
||||
var $info_parent = 'div';
|
||||
@ -111,12 +112,19 @@ class HTMLPurifier_HTMLDefinition
|
||||
//////////////////////////////////////////////////////////////////////
|
||||
// info[]->child : defines allowed children for elements
|
||||
|
||||
// entities: prefixed with e_ and _ replaces .
|
||||
// entities: prefixed with e_ and _ replaces . from DTD
|
||||
// double underlines are entities we made up
|
||||
|
||||
// we don't use an array because that complicates interpolation
|
||||
// strings are used instead of arrays because if you use arrays,
|
||||
// you have to do some hideous manipulation with array_merge()
|
||||
|
||||
// todo: determine whether or not having allowed children
|
||||
// that aren't allowed globally affects security (it shouldn't)
|
||||
// if above works out, extend children definitions to include all
|
||||
// possible elements (allowed elements will dictate which ones
|
||||
// get dropped
|
||||
|
||||
$e_special_extra = 'img';
|
||||
$e_special_basic = 'br | span | bdo';
|
||||
$e_special = "$e_special_basic | $e_special_extra";
|
||||
@ -142,16 +150,18 @@ class HTMLPurifier_HTMLDefinition
|
||||
$e_block = "p | $e_heading | div | $e_lists | $e_blocktext | table";
|
||||
$e__flow = "#PCDATA | $e_block | $e_inline | $e_misc";
|
||||
$e_Flow = new HTMLPurifier_ChildDef_Optional($e__flow);
|
||||
$e_a_content = new HTMLPurifier_ChildDef_Optional("#PCDATA | $e_special".
|
||||
" | $e_fontstyle | $e_phrase | $e_inline_forms | $e_misc_inline");
|
||||
$e_a_content = new HTMLPurifier_ChildDef_Optional("#PCDATA".
|
||||
" | $e_special | $e_fontstyle | $e_phrase | $e_inline_forms".
|
||||
" | $e_misc_inline");
|
||||
$e_pre_content = new HTMLPurifier_ChildDef_Optional("#PCDATA | a".
|
||||
" | $e_special_basic | $e_fontstyle_basic | $e_phrase_basic".
|
||||
" | $e_inline_forms | $e_misc_inline");
|
||||
$e_form_content = new HTMLPurifier_ChildDef_Optional(''); //unused
|
||||
$e_form_button_content = new HTMLPurifier_ChildDef_Optional(''); // unused
|
||||
$e_form_content = new HTMLPurifier_ChildDef_Optional('');//unused
|
||||
$e_form_button_content = new HTMLPurifier_ChildDef_Optional('');//unused
|
||||
|
||||
$this->info['ins']->child =
|
||||
$this->info['del']->child = new HTMLPurifier_ChildDef_Chameleon($e__inline, $e__flow);
|
||||
$this->info['del']->child =
|
||||
new HTMLPurifier_ChildDef_Chameleon($e__inline, $e__flow);
|
||||
|
||||
$this->info['blockquote']->child=
|
||||
$this->info['dd']->child =
|
||||
@ -225,7 +235,7 @@ class HTMLPurifier_HTMLDefinition
|
||||
//////////////////////////////////////////////////////////////////////
|
||||
// info[]->type : defines the type of the element (block or inline)
|
||||
|
||||
// reuses $e_Inline and $e_block
|
||||
// reuses $e_Inline and $e_Block
|
||||
|
||||
foreach ($e_Inline->elements as $name) {
|
||||
$this->info[$name]->type = 'inline';
|
||||
@ -243,7 +253,7 @@ class HTMLPurifier_HTMLDefinition
|
||||
|
||||
$this->info['a']->excludes = array('a' => true);
|
||||
$this->info['pre']->excludes = array_flip(array('img', 'big', 'small',
|
||||
// technically in spec, but we don't allow em anyway
|
||||
// technically useless, but good to be indepth
|
||||
'object', 'applet', 'font', 'basefont'));
|
||||
|
||||
//////////////////////////////////////////////////////////////////////
|
||||
@ -253,6 +263,8 @@ class HTMLPurifier_HTMLDefinition
|
||||
// by the transform classes. It will, however, do simple and slightly
|
||||
// complex attribute value substitution
|
||||
|
||||
// the question of varying allowed attributes is more entangling.
|
||||
|
||||
$e_Text = new HTMLPurifier_AttrDef_Text();
|
||||
|
||||
// attrs, included in almost every single one except for a few,
|
||||
@ -297,7 +309,8 @@ class HTMLPurifier_HTMLDefinition
|
||||
|
||||
$this->info['table']->attr['summary'] = $e_Text;
|
||||
|
||||
$this->info['table']->attr['border'] = new HTMLPurifier_AttrDef_Pixels();
|
||||
$this->info['table']->attr['border'] =
|
||||
new HTMLPurifier_AttrDef_Pixels();
|
||||
|
||||
$e_Length = new HTMLPurifier_AttrDef_Length();
|
||||
$this->info['table']->attr['cellpadding'] =
|
||||
@ -329,7 +342,7 @@ class HTMLPurifier_HTMLDefinition
|
||||
$this->info['q']->attr['cite'] = $e_URI;
|
||||
|
||||
//////////////////////////////////////////////////////////////////////
|
||||
// UNIMP : info_tag_transform : transformations of tags
|
||||
// info_tag_transform : transformations of tags
|
||||
|
||||
$this->info_tag_transform['font'] = new HTMLPurifier_TagTransform_Font();
|
||||
$this->info_tag_transform['menu'] = new HTMLPurifier_TagTransform_Simple('ul');
|
||||
@ -339,6 +352,9 @@ class HTMLPurifier_HTMLDefinition
|
||||
//////////////////////////////////////////////////////////////////////
|
||||
// info[]->auto_close : tags that automatically close another
|
||||
|
||||
// todo: determine whether or not SGML-like modeling based on
|
||||
// mandatory/optional end tags would be a better policy
|
||||
|
||||
// make sure you test using isset() not !empty()
|
||||
|
||||
// these are all block elements: blocks aren't allowed in P
|
||||
|
Loading…
Reference in New Issue
Block a user