0
0
mirror of https://github.com/ezyang/htmlpurifier.git synced 2024-12-22 16:31:53 +00:00

Update docs.

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@700 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
Edward Z. Yang 2007-01-29 17:53:54 +00:00
parent 01c85b71d2
commit be264a4b20
2 changed files with 11 additions and 39 deletions

View File

@ -36,7 +36,7 @@ forgiving lexer. You may also be interested in the unit tests located in the
tests/ folder, which provide a living document on how exactly the filter deals tests/ folder, which provide a living document on how exactly the filter deals
with malformed input. with malformed input.
In summary: In summary (see corresponding classes for more details):
1. Parse document into an array of tag and text tokens (Lexer) 1. Parse document into an array of tag and text tokens (Lexer)
2. Remove all elements not on whitelist and transform certain other elements 2. Remove all elements not on whitelist and transform certain other elements

View File

@ -6,45 +6,17 @@ through negligence of people. This class will do its job: no more, no less,
and it's up to you to provide it the proper information and proper context and it's up to you to provide it the proper information and proper context
to be effective. Things to remember: to be effective. Things to remember:
1. Character Encoding: UTF-8. 1. Character Encoding: see enduser-utf8.html for more info.
This segment will soon be obsoleted by enduser-utf8.html
Currently, the parser runs under the assumption that it is dealing
with UTF-8. Not ISO-8859-1 or Windows-1252, UTF-8. And definitely not "no
character encoding explicitly stated" or UTF-7. If you're not using UTF-8 as
your character encoding, make sure you configure HTML Purifier or switch
to UTF-8. Now. Also, make sure any input is properly converted to UTF-8, or
the parser will mangle it badly (though it won't be a security risk if you're
outputting it as UTF-8 though). Character encoding is, in general, a knotty
issue, but do yourself a favor and learn about it:
<http://www.joelonsoftware.com/articles/Unicode.html>
2. Doctype: XHTML 1.0 Transitional 2. Doctype: document pending feature completion
This is what the parser is outputting. For the most Not strictly necessary, actually. More in-depth discussion once we figure
part, it's compatible with HTML 4.01, but XHTML enforces some very nice things out how to get strict loose mode working.
that all web developers should use. Regardless, NO DOCTYPE is a NO. Quirks mode
has waaaay too many quirks for a little parser to handle. We did not select
strict in order to prevent ourselves from being too draconic on users, but
this may be configurable in the future. Do you want standards compliance?
The doctype is a good place to start.
3. IDs 3. IDs: see enduser-id.html for more info
This segment is obsoleted by enduser-id.html
They need to be unique, but without some knowledge of the
rest of the document, it's difficult to know what's unique. %Attr.IDBlacklist
needs to be set: we may want to consider disallowing IDs by default to
save lazy programmers.
4. [PROJECTED] Links 4. Links: document pending feature completion
We're not going to try for spam protection (although Rudimentary blacklisting, we should also allow only relative URIs. We
some hooks for such a module might be nice) but we may offer the ability to need a doc to explain the stuff.
only accept relative URLs. Pick the one that's right for you.
5. CSS 5. CSS: document pending
While we can prevent the most flagrant cases from affecting your Explain which CSS styles we blocked and why.
layout (such as absolutely positioned elements), no amount of code is going
to protect your pages from being attacked by garish colors and plain old
bad taste. A neat feature would be the ability to define acceptable colors
in a document, but that's not likely to be implemented for a while. In the
meantime, be sure to make sure that floated elements (permitted, since they
can be quite useful) can't mess up your layout. Once again, we may want to
disable this by default to protect lazy developers.