mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2024-12-22 16:31:53 +00:00
Update txt docs.
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1134 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
parent
8d15d1ce13
commit
58f00105c8
@ -8,15 +8,11 @@ to be effective. Things to remember:
|
|||||||
|
|
||||||
1. Character Encoding: see enduser-utf8.html for more info.
|
1. Character Encoding: see enduser-utf8.html for more info.
|
||||||
|
|
||||||
2. Doctype: document pending feature completion
|
2. IDs: see enduser-id.html for more info
|
||||||
Not strictly necessary, actually. More in-depth discussion once we figure
|
|
||||||
out how to get strict loose mode working.
|
|
||||||
|
|
||||||
3. IDs: see enduser-id.html for more info
|
3. Links: document pending feature completion
|
||||||
|
|
||||||
4. Links: document pending feature completion
|
|
||||||
Rudimentary blacklisting, we should also allow only relative URIs. We
|
Rudimentary blacklisting, we should also allow only relative URIs. We
|
||||||
need a doc to explain the stuff.
|
need a doc to explain the stuff.
|
||||||
|
|
||||||
5. CSS: document pending
|
4. CSS: document pending
|
||||||
Explain which CSS styles we blocked and why.
|
Explain which CSS styles we blocked and why.
|
||||||
|
@ -141,12 +141,6 @@ the code. They may be upgraded to HTML files or stay as TXT scratchpads.</p>
|
|||||||
<td>List of vendor-specific tags we may want to transform to W3C compliant markup.</td>
|
<td>List of vendor-specific tags we may want to transform to W3C compliant markup.</td>
|
||||||
</tr>
|
</tr>
|
||||||
|
|
||||||
<tr>
|
|
||||||
<td>Reference</td>
|
|
||||||
<td><a href="ref-strictness.txt">Strictness</a></td>
|
|
||||||
<td>Short essay on how loose definition isn't really loose.</td>
|
|
||||||
</tr>
|
|
||||||
|
|
||||||
<tr>
|
<tr>
|
||||||
<td>Reference</td>
|
<td>Reference</td>
|
||||||
<td><a href="ref-html-modularization.txt">Modularization of HTMLDefinition</a></td>
|
<td><a href="ref-html-modularization.txt">Modularization of HTMLDefinition</a></td>
|
||||||
|
@ -1,6 +1,5 @@
|
|||||||
|
|
||||||
Configuration
|
Configuration
|
||||||
[needs updating]
|
|
||||||
|
|
||||||
Configuration is documented on a per-use case: if a class uses a certain
|
Configuration is documented on a per-use case: if a class uses a certain
|
||||||
value from the configuration object, it has to define its name and what the
|
value from the configuration object, it has to define its name and what the
|
||||||
@ -13,29 +12,10 @@ the documentation in ConfigDef for more information on these namespaces.
|
|||||||
|
|
||||||
Since configuration is dependant on context, internal classes require a
|
Since configuration is dependant on context, internal classes require a
|
||||||
configuration object to be passed as a parameter. (They also require a
|
configuration object to be passed as a parameter. (They also require a
|
||||||
Context object).
|
Context object). A majority of classes do not need the config object,
|
||||||
|
but for those who do, it is a lifesaver.
|
||||||
|
|
||||||
In relation to HTMLDefinition and CSSDefinition, there could be a special class
|
Definition objects are complex datatypes influenced by their respective
|
||||||
of directives that influence the *construction* of the Definition object.
|
directive namespaces (HTMLDefinition with HTML and CSSDefinition with CSS).
|
||||||
A theoretical call pattern would look like:
|
If any of these directives is updated, HTML Purifier forces the definition
|
||||||
|
to be regenerated.
|
||||||
1. Client calls Config->getHTMLDefinition()
|
|
||||||
2. Config calls HTMLDefinition->createNew(this)
|
|
||||||
3. HTMLDefinition constructs itself with base configuration
|
|
||||||
4. HTMLDefinition calls Config->get('HTML')
|
|
||||||
5. Config returns array of directives
|
|
||||||
6. HTMLDefinition performs operations and changes specified by directives
|
|
||||||
7. HTMLPurifier returns constructed definition
|
|
||||||
8. Config caches definition so it doesn't have to be generated again
|
|
||||||
9. Config returns definition
|
|
||||||
|
|
||||||
You could also override Config's copy of the definition with your own
|
|
||||||
custom copy, which OVERRIDES all directives. Only the base, vanilla copy
|
|
||||||
is the Singleton, the object actually interfaced with is a operated-upon
|
|
||||||
clone of that object. Also, if an update to the directives would update
|
|
||||||
the definition, you'd have to force reconstruction.
|
|
||||||
|
|
||||||
In practice, the pulling directives from the config object are
|
|
||||||
solely need-based, and the flex points are littered throughout the
|
|
||||||
setup() function. Some sort of refactoring is likely in order. See
|
|
||||||
ref-xhtml-1.1.txt for more info.
|
|
||||||
|
@ -2,23 +2,16 @@
|
|||||||
Filter Levels
|
Filter Levels
|
||||||
When one size *does not* fit all
|
When one size *does not* fit all
|
||||||
|
|
||||||
The more I think about it, the less sense it makes for maintaining one huge
|
It makes little sense to constrain users to one set of HTML elements and
|
||||||
monolithic HTMLDefinition class. There's simply so much variation that
|
attributes and tell them that they are not allowed to mold this in
|
||||||
could go into this definition: the set of HTML good for blog entries is
|
any fashion. Many users demand to be able to custom-select which elements
|
||||||
definitely too large for HTML that would be allowed in blog comments. Going
|
and attributes they want. This is fine: because HTML Purifier keeps close
|
||||||
from Transitional to Strict requires changes to the definition.
|
track of what elements are safe to use, there is no way for them to
|
||||||
|
accidently allow an XSS-able tag.
|
||||||
|
|
||||||
Allowing users to specify their own whitelists is one step (implemented, btw),
|
However, combing through the HTML spec to make your own whitelist can
|
||||||
but I have doubts on only doing this. Simply put, the typical programmer is too
|
be a daunting task. HTML Purifier ought to offer pre-canned filter levels
|
||||||
lazy to actually go through the trouble of investigating which tags, attributes
|
that amateur users can select based on what they think is their use-case.
|
||||||
and properties to allow. HTMLDefinition makes a big part of what HTMLPurifier
|
|
||||||
is.
|
|
||||||
|
|
||||||
The idea, then, is to setup fundamentally different set of definitions, which
|
|
||||||
can further be customized using simpler configuration options. Alternatively,
|
|
||||||
they could be implemented as configuration profiles, which simply load
|
|
||||||
a set of recommended directives to acheive a desired affect (no simpler
|
|
||||||
config options though).
|
|
||||||
|
|
||||||
Here are some fuzzy levels you could set:
|
Here are some fuzzy levels you could set:
|
||||||
|
|
||||||
@ -46,6 +39,10 @@ make forbidden element to text transformations desirable (for example, images).
|
|||||||
|
|
||||||
== Element Risk Analysis ==
|
== Element Risk Analysis ==
|
||||||
|
|
||||||
|
Although none of the currently supported elements presents a security
|
||||||
|
threat per-say, some can cause problems for page layouts or be
|
||||||
|
extremely complicated.
|
||||||
|
|
||||||
Legend:
|
Legend:
|
||||||
[danger level] - regular tags / uncommon tags ~ deprecated tags
|
[danger level] - regular tags / uncommon tags ~ deprecated tags
|
||||||
[danger level]* - rare tags
|
[danger level]* - rare tags
|
||||||
@ -130,6 +127,7 @@ any CSS properties that are not currently implemented (such as position).
|
|||||||
Dangerous, can go outside container - float
|
Dangerous, can go outside container - float
|
||||||
Easy to abuse - font-size, font-family (font), width
|
Easy to abuse - font-size, font-family (font), width
|
||||||
Colored - background-color (background), border-color (border), color
|
Colored - background-color (background), border-color (border), color
|
||||||
|
(see proposal-colors.html)
|
||||||
Dramatic - border, list-style-position (list-style), margin, padding,
|
Dramatic - border, list-style-position (list-style), margin, padding,
|
||||||
text-align, text-indent, text-transform, vertical-align, line-height
|
text-align, text-indent, text-transform, vertical-align, line-height
|
||||||
|
|
||||||
|
@ -1,33 +0,0 @@
|
|||||||
|
|
||||||
Is HTML Purifier Strict or Transitional?
|
|
||||||
[rename/deprecation pending]
|
|
||||||
|
|
||||||
Despite the fact that HTML Purifier professes to support both transitional and
|
|
||||||
strict HTML, it rejects a lot of attributes and elements that are actually, indeed,
|
|
||||||
valid. You can investigate progress.html to find out precisely what we
|
|
||||||
are doing to these *deprecated* attributes.
|
|
||||||
|
|
||||||
However, users have found that Strict HTML imposes some quite unreasonable
|
|
||||||
restrictions on certain things. The start and value attributes in ol and
|
|
||||||
li (respectively) perhaps are the most contested. There's is currently no
|
|
||||||
widely supported browser method short of JavaScript that can replace these
|
|
||||||
two deprecated elements. It behooves us to allow these deprecated
|
|
||||||
attributes when the output is transitional.
|
|
||||||
|
|
||||||
Fortunantely, that's the only real bugger case. The others have near-perfect
|
|
||||||
CSS equivalents, and were presentational anyway. However, the other question
|
|
||||||
pops up: should we always convert these to the CSS forms when 1. the spec
|
|
||||||
allows them anyway and 2. older browsers support them better? After all, the
|
|
||||||
whole point about CSS is to seperate styling from content, so inline styling
|
|
||||||
doesn't solve that problem.
|
|
||||||
|
|
||||||
[new material]
|
|
||||||
|
|
||||||
HTML Purifier 1.7 creates a new organizational system for deprecated attribute/
|
|
||||||
element transformations. They will be unified under the title of "Tidy", which
|
|
||||||
is what they are: cleaning up after deprecated user markup into standards-compliant
|
|
||||||
versions. There will also be a change in the default behavior (athough, to the
|
|
||||||
end user not inspecting the HTML, there will be no change: in fact, it may
|
|
||||||
work even better).
|
|
||||||
|
|
||||||
Consult the Advanced API for more details.
|
|
@ -2,8 +2,23 @@
|
|||||||
Web Hypertext Application Technology Working Group
|
Web Hypertext Application Technology Working Group
|
||||||
WHATWG
|
WHATWG
|
||||||
|
|
||||||
I don't think we need to worry about them. Untrusted users shouldn't be
|
== HTML 5 ==
|
||||||
submitting applications, eh? But if some interesting attribute pops up in
|
|
||||||
their spec, and might be worth supporting, stick it here.
|
|
||||||
|
|
||||||
HTML 5!!!
|
URL: http://www.whatwg.org/specs/web-apps/current-work/
|
||||||
|
|
||||||
|
HTML 5 defines a kaboodle of new elements and attributes, as well as
|
||||||
|
some well-defined, "quirks mode" HTML parsing. Although WHATWG professes
|
||||||
|
to be targeted towards web applications, many of their semantic additions
|
||||||
|
would be quite useful in regular documents. Eventually, HTML
|
||||||
|
Purifier will need to audit their lists and figure out what changes need
|
||||||
|
to be made. This process is complicated by the fact that the WHATWG
|
||||||
|
doesn't buy into W3C's modularization of XHTML 1.1: we may need
|
||||||
|
to remodularize HTML 5 (probably done by section name). No sense in
|
||||||
|
committing ourselves till the spec stabilizes, though.
|
||||||
|
|
||||||
|
More immediately speaking though, however, is the well-defined parsing
|
||||||
|
behavior that HTML 5 adds. While I have little interest in writing
|
||||||
|
another DirectLex parser, other parsers like ph5p
|
||||||
|
<http://jero.net/lab/ph5p/> can be adapted to DOMLex to support much more
|
||||||
|
flexible HTML parsing (a cool feature I've seen is how they resolve
|
||||||
|
<b>bold<i>both</b>italic</i>).
|
||||||
|
Loading…
Reference in New Issue
Block a user