diff --git a/Doxyfile b/Doxyfile index c906929e..da12ad93 100644 --- a/Doxyfile +++ b/Doxyfile @@ -4,7 +4,7 @@ # Project related configuration options #--------------------------------------------------------------------------- PROJECT_NAME = HTML Purifier -PROJECT_NUMBER = 1.5.0 +PROJECT_NUMBER = 1.6.0 OUTPUT_DIRECTORY = "C:/Documents and Settings/Edward/My Documents/My Webs/htmlpurifier/docs/doxygen" CREATE_SUBDIRS = NO OUTPUT_LANGUAGE = English diff --git a/INSTALL b/INSTALL index 0013705c..5f41cfba 100644 --- a/INSTALL +++ b/INSTALL @@ -47,7 +47,9 @@ HTML Purifier is all about web-standards, so accordingly your webpages should be standards compliant. HTML Purifier can deal with these doctypes: * XHTML 1.0 Transitional (default) +* XHTML 1.0 Strict * HTML 4.01 Transitional +* HTML 4.01 Strict ...and these character encodings: @@ -87,7 +89,7 @@ into configuring things just for the heck of it, skip to 4.3). * Am I using UTF-8? * Am I using XHTML 1.0 Transitional? -If you answered yes to any of these questions, instantiate a configuration +If you answered no to any of these questions, instantiate a configuration object and read on: $config = HTMLPurifier_Config::createDefault(); diff --git a/INSTALL.fr.utf8 b/INSTALL.fr.utf8 new file mode 100644 index 00000000..625078e2 --- /dev/null +++ b/INSTALL.fr.utf8 @@ -0,0 +1,71 @@ + +Installation + Comment installer HTML Purifier + +Attention: Ce document a encode en UTF-8. Si les lettres avec les accents +est essoreuse, prenez un mieux editeur de texte. + +À L'Aide: Je ne suis pas un diseur natif de français. Si vous trouvez une +erreur dans ce document, racontez-moi! Merci. + + +L'installation de HTML Purifier est trés simple, parce qu'il ne doit pas +la configuration. Dans le pied de de document, les utilisateurs +impatient peuvent trouver le code, mais je recommande que vous lisez +ce document pour quelques choses. + + +1. Compatibilité + +HTML Purifier fonctionne dans PHP 4 et PHP 5. PHP 4.3.9 est le dernier +version que je le testais. Il ne dépend de les autre librairies. + +Les extensions optionnel est iconv (en général déjà installer) et +tidy (répandu aussi). Si vous utilisez UTF-8 et ne voulez pas +l'indentation, vous pouvez utiliser HTML Purifier sans ces extensions. + + +2. Inclure la librarie + +Utilisez: + + require_once '/path/to/library/HTMLPurifier.auto.php'; + +...quand vous devez utiliser HTML Purifier (ne inclure pas quand vous +ne devez pas, parce que HTML Purifier est trés grand.) + +Si vous n'aime pas que HTML Purifier change vos include_path, on peut +change vos include_path, et: + + require_once 'HTMLPurifier.php'; + +Seuleument les contents dans library/ est essentiel; vous peut enlever +les autre fichiers quand vous est dans une atmosphère professionnel. + + +[En cours de construction] + + +6. Installation vite + +Si votre site web est en UTF-8 et XHTML Transitional, utilisez: + +purify($html_salle); +?> + +Sinon, utilisez: + +set('Core', 'Encoding', 'ISO-8859-1'); //remplacez avec votre encoding + $config->set('Core', 'XHTML', true); //remplacez avec false si HTML 4.01 + $purificateur = new HTMLPurifier($config); + + $html_propre = $purificateur->purify($html_salle); +?> \ No newline at end of file diff --git a/NEWS b/NEWS index 9bd45a99..089922f0 100644 --- a/NEWS +++ b/NEWS @@ -9,6 +9,25 @@ NEWS ( CHANGELOG and HISTORY ) HTMLPurifier . Internal change ========================== +1.6.0, released 2007-04-01 +! Support for most common deprecated attributes via transformations: + + bgcolor in td, th, tr and table + + border in img + + name in a and img + + width in td, th and hr + + height in td, th +! Support for CSS attribute 'height' added +! Support for rel and rev attributes in a tags added, use %Attr.AllowedRel + and %Attr.AllowedRev to activate +- You can define ID blacklists using regular expressions via + %Attr.IDBlacklistRegexp +- Error messages are emitted when you attempt to "allow" elements or + attributes that HTML Purifier does not support + +1.5.1, unknown release date +- Fix segfault in unit test. The problem is not very reproduceable and + I don't know what causes it, but a six line patch fixed it. + 1.5.0, released 2007-03-23 ! Added a rudimentary I18N and L10N system modeled off MediaWiki. It doesn't actually do anything yet, but keep your eyes peeled. diff --git a/TODO b/TODO index 436f4fd9..9901a429 100644 --- a/TODO +++ b/TODO @@ -4,33 +4,35 @@ TODO List = KEY ==================== # Flagship - Regular - ? At-risk + ? Maybe I'll Do It ========================== -1.6 release - # Implement all non-essential attribute transforms, configurable +1.7 release [Advanced API] + # Complete advanced API, and fully document it + # Implement all edge-case attribute transforms + # Implement all deprecated tags and attributes + - Parse TinyMCE-style whitelist into our %HTML.Allow* whitelists (possibly + do this earlier) + +1.8 release [Refactor, refactor!] # URI validation routines tighter (see docs/dev-code-quality.html) (COMPLEX) # Advanced URI filtering schemes (see docs/proposal-new-directives.txt) + - Configuration profiles: predefined directives set with one func call + - Implement IDREF support (harder than it seems, since you cannot have + IDREFs to non-existent IDs) + - Allow non-ASCII characters in font names + +1.9 release [Error'ed] # Error logging for filtering/cleanup procedures - Requires I18N facilities to be created first (COMPLEX) - ? Configuration profiles: sets of directives that get set with one func call - XSS-attempt detection - - Implement IDREF support - -1.7 release - # Add pre-packaged "levels" of cleaning (custom behavior already done) - More fine-grained control over escaping behavior - Silently drop content inbetween SCRIPT tags (can be generalized to allow specification of elements that, when detected as foreign, trigger removal of children, although unbalanced tags could wreck havoc (or at least delete the rest of the document)). - - Allow specifying global attributes on a tag-by-tag basis in - %HTML.AllowAttributes - ? More user-friendly warnings when %HTML.Allow* attempts to specify a - tag or attribute that is not supported - - Parse TinyMCE whitelist into our %HTML.Allow* whitelists -1.8 release +1.10 release [Do What I Mean, Not What I Say] # Additional support for poorly written HTML - Microsoft Word HTML cleaning (i.e. MsoNormal, but research essential!) - Friendly strict handling of
(block ->It makes no sense to adopt a one-size-fits-all
approach to
-filtersets: therefore, users must be able to define their own sets of
-allowed
elements, as well as switch in-between doctypes of HTML.
HTML Purifier currently natively supports only a subset of HTML's +allowed elements, attributes, and behavior. This is by design, +but as the user is always right, they'll need some method to overload +these behaviors.
Our goals are to let the user:
@@ -26,20 +27,18 @@ filtersets: therefore, users must be able to define their own sets ofFor basic use, the user will have to specify some basic parameters. This +is not strictly necessary, as HTML Purifier's default setting will always +output safe code, but is required for standards-compliant output.
+By default, users will use a doctype-based, permissive but secure -whitelist. They must define a doctype, and this serves -as the first method of determining a filterset.
+The first thing to select is the doctype. This +is essential for standards-compliant output.
This identifier is based on the name the W3C has given to the document type and not @@ -61,117 +63,131 @@ the DTD identifier.
$config->set('HTML', 'Doctype', 'XHTML 1.0 Transitional');-
However, selecting this doctype doesn't mean much, because if we -adhered exactly to the definition we would be letting XSS and other -nasties through. HTML Purifier must, in its filterset, allow a subset -of the doctype, which we shall call a filterset.
- -By default, HTML Purifier will use the Rich -filterset, which allows as many elements as possible with untrusted -sources. Other possible filtersets could be:
- -Extension-authors would be able to define custom filtersets for -other users to use.
- -A possible call to select a filterset would be:
- -$config->set('HTML', 'Filterset', 'Rich');+
Due to historical reasons, the default doctype is XHTML 1.0 +Transitional, however, we really shouldn't be guessing what the user's +doctype is. Fortunantely, people who can't be bothered to set this won't +be bothered when their pages stop validating.
Within filtersets, there are various modes of operation. +
Within doctypes, there are various modes of operation. These indicate variant behaviors that, while not strictly changing the -allowed set of elements and attributes, will definitely affect the output. +allowed set of elements and attributes, definitely affect the output. Currently, we have two modes, which may be used together:
center
- tag would be turned into a div
with the CSS property
+ Deprecated elements and attributes will be transformed into + standards-compliant alternatives when explicitly disallowed.
+For example, in the XHTML 1.0 Strict doctype, a center
+ element would be turned into a div
with the CSS property
text-align:center;
, but in XHTML 1.0 Transitional
- the tag would be preserved. This mode is on by default.
center
tag would
- be transformed in both cases. However, tags without a
+ the element would be preserved.
+ This mode is on by default.
+Deprecated elements and attributes will be transformed into + standards-compliant alternatives whenever possible. + It may have various levels of operation.
+Referring back to the previous example, the center
element would
+ be transformed in both cases. However, elements without a
reasonable standards-compliant alternative will be preserved
- in their form. This mode is on by default. It may have
- various levels of operation.
A user may want to correct certain deprecated attributes, but
+ not others. For example, the bgcolor
attribute may be
+ acceptable, but the center
element not; also, possibly,
+ an HTML Purifier transformation may be buggy, so the user wants
+ to forgo it. Thus, correctional accepts an array defining which
+ elements and attributes to cleanup, or no parameter at all, which
+ means everything gets corrected. This also means that each
+ correction needs to be given a unique ID that can be referenced
+ in this manner. (We may also allow globbing, like *.name or a.*
+ for mass-enabling correction, and subtractive mode, where things
+ specified stop correction.) This array gets passed into the
+ constructor of the mode's module.
This mode is on by default.
+A possible call to select modes would be:
$config->set('HTML', 'Mode', array('correctional', 'lenient'));-
If modes have extra parameters, a hash might work well:
+If modes have extra parameters, a hash is necessary:
$config->set('HTML', 'Mode', array( - 'correctional' => 9, // strongest level + 'correctional' => 'center,a.name', 'lenient' => true // this one's just boolean ));-
Modes may possibly be wrapped up with the filterset declaration:
+Modes may be specified along with the doctype declaration (we may want +to get a better set of separator characters):
-$config->set('HTML', 'Filterset', 'Rich: correctional, lenient');+
$config->setDoctype('XHTML Transitional 1.0', '+correctional[center,a.name] -lenient');-
Further investigation in this field is necessary.
+
+With regards to the various levels of operation conjectured in the
+Correctional mode, this is prompted by the fact that a user may want to
+correct certain problems but not others, for example, fix the center
+element but not the u
element, both of which are deprecated.
+Having an integer level
will not work very well for such fine
+grained tweaking, but an array of specific settings might.
If this cookie cutter approach doesn't appeal to a user, they may -decide to roll their own filterset by selecting modules, tags and +decide to roll their own filterset by selecting modules, elements and attributes to allow.
This would make use of the same facilities
as a filterset author would use, except that it would go under an
anonymous
filterset that would be auto-selected if any of the
-relevant module/tag/attribute selection configuration directives were
+relevant module/elements/attribute selection configuration directives were
non-null.
On the highest level, a user will usually be most interested in -directly specifying which elements and attributes are desired. For -example:
+In practice, this is the most commonly demanded feature. Most users are +perfectly happy defining a filterset that looks like:
-$config->set('HTML', 'AllowedElements', 'a,b,em,p,blockquote,code,i');+
$config->setAllowedHTML('a[href,title];em;p;blockquote');-
Attribute declarations could be merged into this declaration as such:
+The directive %HTML.Allowed is a convenience function +that may be fully expressed with the legacy interface, and thus is +given its own setter.
-$config->set('HTML', 'Allowed', 'a[href,title],b,em,p[class],blockquote[cite],code,i');+
We currently support a separated interface, which also must be preserved:
-...or be kept separate:
+$config->set('HTML', 'AllowedElements', 'a,em,p,blockquote'); +$config->set('HTML', 'AllowedAttributes', 'a.href,a.title');-
$config->set('HTML', 'AllowedAttributes', 'a.href,a.title,p.class,blockquote.cite');+
A user may also choose to allow modules:
+ +$config->set('HTML', 'AllowedModules', 'Hypertext,Text,Lists'); // or +$config->setAllowedHTML('Hypertext,Text,Lists');+ +
But it is not expected that this feature will be widely used.
+ +The granularity of these modules is too coarse for
+the average user (for example, the core module loads everything from
+the essential p
element to the not-so-safe h1
+element). How do we make this still a viable solution? Possible answers
+may be sub-modules or module parameters. This may not even be a problem,
+considering that most people won't be selecting modules.
Modules are distinguished from regular elements by the +case of their first letter. While XML distinguishes between and allows +lower and uppercase letters in element names, most well-known XML +languages use only lower-case +element names for sake of consistency.
Considering that, internally speaking, as mandated by the XHTML 1.1 Modularization specification, we have organized our elements around modules, considerable gymnastics will be needed to get this sort of functionality working.
-A user may also specify a module to load a class of elements and attributes -into their filterest:
- -$config->set('HTML', 'Allowed', 'Hypertext,Core');- -
The granularity of these modules is too coarse for
-the average user (for example, the core module loads everything from
-the essential p
tag to the not-so-safe h1
-tag). How do we make this still a viable solution?
Because selecting each and every one of these configuration options @@ -183,6 +199,89 @@ for selecting a filterset. Possibility:
...which is simply a light wrapper over the individual configuration calls. A custom config file format or text format could also be adopted.
+By reviewing topic posts in the support forum, we determined that +there were two primarily demanded customization features people wanted: +to add an attribute to an existing element, and to add an element. +Thus, we'll want to create convenience functions for these common +use-cases.
+ +Note that the functions described here are only available if
+a raw copy of HTMLPurifier_HTMLDefinition
was retrieved.
+addAttribute
may work on a processed copy, but for
+consistency's sake we will mandate this for everything.
An attribute is bound to an element by a name and has a specific
+AttrDef
that validates it. Thus, the interface should
+be:
function addAttribute($element, $attribute, $attribute_def);+ +
With a use-case that looks like:
+ +$def->addAttribute('a', 'rel', new HTMLPurifier_AttrDef_Enum(array('nofollow')));+ +
The $attribute_def
value can be a little flexible,
+to make things simpler. We'll let it also be:
HTMLPurifier_AttrDef_Anonymous
+ class with that function registered as a callback.HTMLPurifier_AttrTypes
+ enum(
: We'll explode it and stuff it in an
+ HTMLPurifier_AttrDef_Enum
for you.Making the previous example written as:
+ +$def->addAttribute('a', 'rel', 'enum(nofollow)');+ +
An element requires certain information as specified by
+HTMLPurifier_ElementDef
. However, not all of it is necessary,
+the usual things required are:
This suggests an API like this:
+ +function addElement($element, $type, $content_model, $attributes = array());+ +
Each parameter explained in depth:
+ +$element
$type
$content_model
HTMLPurifier_ElementDef
's member variables
+ $content_model
and $content_model_type
,
+ where the form is Type: Model, ex. 'Optional: Inline'.
$attributes
A possible usage:
+ +$def->addElement('font', 'Inline', 'Optional: Inline', + array(0 => array('Common'), 'color' => 'Color'));+ +
We may want to Common attribute collection inclusion to be added +by default.
+