mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2024-12-22 16:31:53 +00:00
Update docs.
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@700 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
parent
01c85b71d2
commit
be264a4b20
@ -36,7 +36,7 @@ forgiving lexer. You may also be interested in the unit tests located in the
|
|||||||
tests/ folder, which provide a living document on how exactly the filter deals
|
tests/ folder, which provide a living document on how exactly the filter deals
|
||||||
with malformed input.
|
with malformed input.
|
||||||
|
|
||||||
In summary:
|
In summary (see corresponding classes for more details):
|
||||||
|
|
||||||
1. Parse document into an array of tag and text tokens (Lexer)
|
1. Parse document into an array of tag and text tokens (Lexer)
|
||||||
2. Remove all elements not on whitelist and transform certain other elements
|
2. Remove all elements not on whitelist and transform certain other elements
|
||||||
|
@ -6,45 +6,17 @@ through negligence of people. This class will do its job: no more, no less,
|
|||||||
and it's up to you to provide it the proper information and proper context
|
and it's up to you to provide it the proper information and proper context
|
||||||
to be effective. Things to remember:
|
to be effective. Things to remember:
|
||||||
|
|
||||||
1. Character Encoding: UTF-8.
|
1. Character Encoding: see enduser-utf8.html for more info.
|
||||||
This segment will soon be obsoleted by enduser-utf8.html
|
|
||||||
Currently, the parser runs under the assumption that it is dealing
|
|
||||||
with UTF-8. Not ISO-8859-1 or Windows-1252, UTF-8. And definitely not "no
|
|
||||||
character encoding explicitly stated" or UTF-7. If you're not using UTF-8 as
|
|
||||||
your character encoding, make sure you configure HTML Purifier or switch
|
|
||||||
to UTF-8. Now. Also, make sure any input is properly converted to UTF-8, or
|
|
||||||
the parser will mangle it badly (though it won't be a security risk if you're
|
|
||||||
outputting it as UTF-8 though). Character encoding is, in general, a knotty
|
|
||||||
issue, but do yourself a favor and learn about it:
|
|
||||||
<http://www.joelonsoftware.com/articles/Unicode.html>
|
|
||||||
|
|
||||||
2. Doctype: XHTML 1.0 Transitional
|
2. Doctype: document pending feature completion
|
||||||
This is what the parser is outputting. For the most
|
Not strictly necessary, actually. More in-depth discussion once we figure
|
||||||
part, it's compatible with HTML 4.01, but XHTML enforces some very nice things
|
out how to get strict loose mode working.
|
||||||
that all web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode
|
|
||||||
has waaaay too many quirks for a little parser to handle. We did not select
|
|
||||||
strict in order to prevent ourselves from being too draconic on users, but
|
|
||||||
this may be configurable in the future. Do you want standards compliance?
|
|
||||||
The doctype is a good place to start.
|
|
||||||
|
|
||||||
3. IDs
|
3. IDs: see enduser-id.html for more info
|
||||||
This segment is obsoleted by enduser-id.html
|
|
||||||
They need to be unique, but without some knowledge of the
|
|
||||||
rest of the document, it's difficult to know what's unique. %Attr.IDBlacklist
|
|
||||||
needs to be set: we may want to consider disallowing IDs by default to
|
|
||||||
save lazy programmers.
|
|
||||||
|
|
||||||
4. [PROJECTED] Links
|
4. Links: document pending feature completion
|
||||||
We're not going to try for spam protection (although
|
Rudimentary blacklisting, we should also allow only relative URIs. We
|
||||||
some hooks for such a module might be nice) but we may offer the ability to
|
need a doc to explain the stuff.
|
||||||
only accept relative URLs. Pick the one that's right for you.
|
|
||||||
|
|
||||||
5. CSS
|
5. CSS: document pending
|
||||||
While we can prevent the most flagrant cases from affecting your
|
Explain which CSS styles we blocked and why.
|
||||||
layout (such as absolutely positioned elements), no amount of code is going
|
|
||||||
to protect your pages from being attacked by garish colors and plain old
|
|
||||||
bad taste. A neat feature would be the ability to define acceptable colors
|
|
||||||
in a document, but that's not likely to be implemented for a while. In the
|
|
||||||
meantime, be sure to make sure that floated elements (permitted, since they
|
|
||||||
can be quite useful) can't mess up your layout. Once again, we may want to
|
|
||||||
disable this by default to protect lazy developers.
|
|
||||||
|
Loading…
Reference in New Issue
Block a user