0
0
mirror of https://github.com/ezyang/htmlpurifier.git synced 2024-12-22 16:31:53 +00:00

Update txt docs.

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1134 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
Edward Z. Yang 2007-06-09 14:53:21 +00:00
parent 8d15d1ce13
commit 58f00105c8
6 changed files with 42 additions and 92 deletions

View File

@ -8,15 +8,11 @@ to be effective. Things to remember:
1. Character Encoding: see enduser-utf8.html for more info. 1. Character Encoding: see enduser-utf8.html for more info.
2. Doctype: document pending feature completion 2. IDs: see enduser-id.html for more info
Not strictly necessary, actually. More in-depth discussion once we figure
out how to get strict loose mode working.
3. IDs: see enduser-id.html for more info 3. Links: document pending feature completion
4. Links: document pending feature completion
Rudimentary blacklisting, we should also allow only relative URIs. We Rudimentary blacklisting, we should also allow only relative URIs. We
need a doc to explain the stuff. need a doc to explain the stuff.
5. CSS: document pending 4. CSS: document pending
Explain which CSS styles we blocked and why. Explain which CSS styles we blocked and why.

View File

@ -141,12 +141,6 @@ the code. They may be upgraded to HTML files or stay as TXT scratchpads.</p>
<td>List of vendor-specific tags we may want to transform to W3C compliant markup.</td> <td>List of vendor-specific tags we may want to transform to W3C compliant markup.</td>
</tr> </tr>
<tr>
<td>Reference</td>
<td><a href="ref-strictness.txt">Strictness</a></td>
<td>Short essay on how loose definition isn't really loose.</td>
</tr>
<tr> <tr>
<td>Reference</td> <td>Reference</td>
<td><a href="ref-html-modularization.txt">Modularization of HTMLDefinition</a></td> <td><a href="ref-html-modularization.txt">Modularization of HTMLDefinition</a></td>

View File

@ -1,6 +1,5 @@
Configuration Configuration
[needs updating]
Configuration is documented on a per-use case: if a class uses a certain Configuration is documented on a per-use case: if a class uses a certain
value from the configuration object, it has to define its name and what the value from the configuration object, it has to define its name and what the
@ -13,29 +12,10 @@ the documentation in ConfigDef for more information on these namespaces.
Since configuration is dependant on context, internal classes require a Since configuration is dependant on context, internal classes require a
configuration object to be passed as a parameter. (They also require a configuration object to be passed as a parameter. (They also require a
Context object). Context object). A majority of classes do not need the config object,
but for those who do, it is a lifesaver.
In relation to HTMLDefinition and CSSDefinition, there could be a special class Definition objects are complex datatypes influenced by their respective
of directives that influence the *construction* of the Definition object. directive namespaces (HTMLDefinition with HTML and CSSDefinition with CSS).
A theoretical call pattern would look like: If any of these directives is updated, HTML Purifier forces the definition
to be regenerated.
1. Client calls Config->getHTMLDefinition()
2. Config calls HTMLDefinition->createNew(this)
3. HTMLDefinition constructs itself with base configuration
4. HTMLDefinition calls Config->get('HTML')
5. Config returns array of directives
6. HTMLDefinition performs operations and changes specified by directives
7. HTMLPurifier returns constructed definition
8. Config caches definition so it doesn't have to be generated again
9. Config returns definition
You could also override Config's copy of the definition with your own
custom copy, which OVERRIDES all directives. Only the base, vanilla copy
is the Singleton, the object actually interfaced with is a operated-upon
clone of that object. Also, if an update to the directives would update
the definition, you'd have to force reconstruction.
In practice, the pulling directives from the config object are
solely need-based, and the flex points are littered throughout the
setup() function. Some sort of refactoring is likely in order. See
ref-xhtml-1.1.txt for more info.

View File

@ -2,23 +2,16 @@
Filter Levels Filter Levels
When one size *does not* fit all When one size *does not* fit all
The more I think about it, the less sense it makes for maintaining one huge It makes little sense to constrain users to one set of HTML elements and
monolithic HTMLDefinition class. There's simply so much variation that attributes and tell them that they are not allowed to mold this in
could go into this definition: the set of HTML good for blog entries is any fashion. Many users demand to be able to custom-select which elements
definitely too large for HTML that would be allowed in blog comments. Going and attributes they want. This is fine: because HTML Purifier keeps close
from Transitional to Strict requires changes to the definition. track of what elements are safe to use, there is no way for them to
accidently allow an XSS-able tag.
Allowing users to specify their own whitelists is one step (implemented, btw), However, combing through the HTML spec to make your own whitelist can
but I have doubts on only doing this. Simply put, the typical programmer is too be a daunting task. HTML Purifier ought to offer pre-canned filter levels
lazy to actually go through the trouble of investigating which tags, attributes that amateur users can select based on what they think is their use-case.
and properties to allow. HTMLDefinition makes a big part of what HTMLPurifier
is.
The idea, then, is to setup fundamentally different set of definitions, which
can further be customized using simpler configuration options. Alternatively,
they could be implemented as configuration profiles, which simply load
a set of recommended directives to acheive a desired affect (no simpler
config options though).
Here are some fuzzy levels you could set: Here are some fuzzy levels you could set:
@ -46,6 +39,10 @@ make forbidden element to text transformations desirable (for example, images).
== Element Risk Analysis == == Element Risk Analysis ==
Although none of the currently supported elements presents a security
threat per-say, some can cause problems for page layouts or be
extremely complicated.
Legend: Legend:
[danger level] - regular tags / uncommon tags ~ deprecated tags [danger level] - regular tags / uncommon tags ~ deprecated tags
[danger level]* - rare tags [danger level]* - rare tags
@ -130,6 +127,7 @@ any CSS properties that are not currently implemented (such as position).
Dangerous, can go outside container - float Dangerous, can go outside container - float
Easy to abuse - font-size, font-family (font), width Easy to abuse - font-size, font-family (font), width
Colored - background-color (background), border-color (border), color Colored - background-color (background), border-color (border), color
(see proposal-colors.html)
Dramatic - border, list-style-position (list-style), margin, padding, Dramatic - border, list-style-position (list-style), margin, padding,
text-align, text-indent, text-transform, vertical-align, line-height text-align, text-indent, text-transform, vertical-align, line-height

View File

@ -1,33 +0,0 @@
Is HTML Purifier Strict or Transitional?
[rename/deprecation pending]
Despite the fact that HTML Purifier professes to support both transitional and
strict HTML, it rejects a lot of attributes and elements that are actually, indeed,
valid. You can investigate progress.html to find out precisely what we
are doing to these *deprecated* attributes.
However, users have found that Strict HTML imposes some quite unreasonable
restrictions on certain things. The start and value attributes in ol and
li (respectively) perhaps are the most contested. There's is currently no
widely supported browser method short of JavaScript that can replace these
two deprecated elements. It behooves us to allow these deprecated
attributes when the output is transitional.
Fortunantely, that's the only real bugger case. The others have near-perfect
CSS equivalents, and were presentational anyway. However, the other question
pops up: should we always convert these to the CSS forms when 1. the spec
allows them anyway and 2. older browsers support them better? After all, the
whole point about CSS is to seperate styling from content, so inline styling
doesn't solve that problem.
[new material]
HTML Purifier 1.7 creates a new organizational system for deprecated attribute/
element transformations. They will be unified under the title of "Tidy", which
is what they are: cleaning up after deprecated user markup into standards-compliant
versions. There will also be a change in the default behavior (athough, to the
end user not inspecting the HTML, there will be no change: in fact, it may
work even better).
Consult the Advanced API for more details.

View File

@ -2,8 +2,23 @@
Web Hypertext Application Technology Working Group Web Hypertext Application Technology Working Group
WHATWG WHATWG
I don't think we need to worry about them. Untrusted users shouldn't be == HTML 5 ==
submitting applications, eh? But if some interesting attribute pops up in
their spec, and might be worth supporting, stick it here.
HTML 5!!! URL: http://www.whatwg.org/specs/web-apps/current-work/
HTML 5 defines a kaboodle of new elements and attributes, as well as
some well-defined, "quirks mode" HTML parsing. Although WHATWG professes
to be targeted towards web applications, many of their semantic additions
would be quite useful in regular documents. Eventually, HTML
Purifier will need to audit their lists and figure out what changes need
to be made. This process is complicated by the fact that the WHATWG
doesn't buy into W3C's modularization of XHTML 1.1: we may need
to remodularize HTML 5 (probably done by section name). No sense in
committing ourselves till the spec stabilizes, though.
More immediately speaking though, however, is the well-defined parsing
behavior that HTML 5 adds. While I have little interest in writing
another DirectLex parser, other parsers like ph5p
<http://jero.net/lab/ph5p/> can be adapted to DOMLex to support much more
flexible HTML parsing (a cool feature I've seen is how they resolve
<b>bold<i>both</b>italic</i>).