mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2024-12-23 00:41:52 +00:00
Update documentation.
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@147 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
parent
26733183b7
commit
f0deae1fc0
113
docs/spec.txt
113
docs/spec.txt
@ -56,111 +56,6 @@ all this functionality (or not written in HTML at all).
|
|||||||
|
|
||||||
The rest of this document is pending moving into their associated classes.
|
The rest of this document is pending moving into their associated classes.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
== STAGE 4 - check nesting ==
|
|
||||||
|
|
||||||
Status: B (table custom definition needs to be implemented)
|
|
||||||
|
|
||||||
We know that the document is now well formed. The tokenizer should now take
|
|
||||||
things in nodes: when you hit a start tag, keep on going until you get its
|
|
||||||
ending tag, and then handle everything inside there. Fortunantely, no
|
|
||||||
fancy recursion is necessary as going to the next node is as simple as
|
|
||||||
scrolling to the next start tag.
|
|
||||||
|
|
||||||
Suppose we have a node and encounter a problem with one of its children.
|
|
||||||
Depending on the complexity of the rule, we will either delete the children,
|
|
||||||
or delete the entire node itself.
|
|
||||||
|
|
||||||
The simplest type of rule is zero or more valid elements, denoted like:
|
|
||||||
|
|
||||||
( el1 | el2 | el3 )*
|
|
||||||
|
|
||||||
The next simplest is with one or more valid elements:
|
|
||||||
|
|
||||||
( li )+
|
|
||||||
|
|
||||||
And then you have complex cases:
|
|
||||||
|
|
||||||
table (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))
|
|
||||||
map ((%block; | form | %misc;)+ | area+)
|
|
||||||
html (head, body)
|
|
||||||
head (%head.misc;,
|
|
||||||
((title, %head.misc;, (base, %head.misc;)?) |
|
|
||||||
(base, %head.misc;, (title, %head.misc;))))
|
|
||||||
|
|
||||||
Each of these has to be dealt with. Case 1 is a joy, because you can zap
|
|
||||||
as many as you want, but you'll never actually have to kill the node. Two
|
|
||||||
and three need the entire node to be killed if you have a problem. This
|
|
||||||
can be problematic, as the missing node might cause its parent node to now
|
|
||||||
be incorrect. Granted, it's unlikely, and I'm fairly certain that HTML, let
|
|
||||||
alone the simplified set I'm allowing will have this problem, but it's worth
|
|
||||||
checking for.
|
|
||||||
|
|
||||||
The way, I suppose, one would check for it, is whenever a node is removed,
|
|
||||||
scroll to it's parent start, and re-evaluate it. Make sure you're able to do
|
|
||||||
that with minimal code repetition.
|
|
||||||
|
|
||||||
The most complex case can probably be done by using some fancy regexp
|
|
||||||
expressions and transformations. However, it doesn't seem right that, say,
|
|
||||||
a stray <b> in a <table> can cause the entire table to be removed. Depending
|
|
||||||
on how much work we want to do, this will at least need a custom child
|
|
||||||
definition, and at most require extra element bubbling capabilities to be
|
|
||||||
added.
|
|
||||||
|
|
||||||
--
|
|
||||||
|
|
||||||
So, the way we define these cases should work like this:
|
|
||||||
|
|
||||||
class ChildDef with validateChildren($children_tags)
|
|
||||||
|
|
||||||
The function needs to parse into nodes, then into the regex array.
|
|
||||||
It can result in one of three actions: the removal of the entire parent node,
|
|
||||||
replacement of all of the original child tags with a new set of child
|
|
||||||
tags which it returns, or no changes. They shall be denoted as, respectively,
|
|
||||||
|
|
||||||
Remove entire parent node = false
|
|
||||||
Replace child tags with this = array of tags
|
|
||||||
No changes = true
|
|
||||||
|
|
||||||
If we remove the entire parent node, we must scroll back to the parent of the
|
|
||||||
parent.
|
|
||||||
|
|
||||||
--
|
|
||||||
|
|
||||||
Also, what do we do with elements if they're not allowed somewhere? We need
|
|
||||||
some sort of default behavior. I reckon that we should be allowed to:
|
|
||||||
|
|
||||||
1. Delete the node
|
|
||||||
2. Translate it into text (not okay for areas that don't allow #PCDATA)
|
|
||||||
3. Move the node to somewhere where it is okay
|
|
||||||
|
|
||||||
What complicates the matter is that Firefox has the ability to construct
|
|
||||||
DOMs and render invalid nestings of elements (like <b><div>asdf</div></b>).
|
|
||||||
This means that behavior for stray pcdata in ul/ol is undefined. Behavior
|
|
||||||
with data in a table gets bubbled to the start of the table (assuming
|
|
||||||
that we actually custom-make the table child validation class).
|
|
||||||
|
|
||||||
So... I say delete the node when PCDATA isn't allowed (or the regex is too
|
|
||||||
complicated to determine where PCDATA could be inserted), and translate the node
|
|
||||||
to text when PCDATA is allowed.
|
|
||||||
|
|
||||||
--
|
|
||||||
|
|
||||||
ins/del are allowed in block and inline content, but it is
|
|
||||||
inappropriate to include block content within an ins element
|
|
||||||
occurring in inline content. How would we fix this?
|
|
||||||
|
|
||||||
== STAGE 4 - check attributes ==
|
== STAGE 4 - check attributes ==
|
||||||
|
|
||||||
STATUS: F (currently implementing core/i18n)
|
STATUS: F (currently implementing core/i18n)
|
||||||
@ -261,11 +156,3 @@ These are the elements that only have %attrs and need an alignment transform
|
|||||||
----
|
----
|
||||||
|
|
||||||
Prepend style transformations, as CSS takes precedence.
|
Prepend style transformations, as CSS takes precedence.
|
||||||
|
|
||||||
== PART 5 - stringify ==
|
|
||||||
|
|
||||||
Status: A+ (done completely!)
|
|
||||||
|
|
||||||
Should be fairly simple as long as we delegate to appropriate functions.
|
|
||||||
It's probably too much trouble to indent the stuff properly, so just output
|
|
||||||
stuff raw.
|
|
||||||
|
@ -3,8 +3,33 @@
|
|||||||
require_once 'HTMLPurifier/Strategy.php';
|
require_once 'HTMLPurifier/Strategy.php';
|
||||||
require_once 'HTMLPurifier/Definition.php';
|
require_once 'HTMLPurifier/Definition.php';
|
||||||
|
|
||||||
// EXTRA: provide a mechanism for elements to be bubbled OUT of a node
|
/**
|
||||||
// or "Replace Nodes while including the parent nodes too"
|
* Takes a well formed list of tokens and fixes their nesting.
|
||||||
|
*
|
||||||
|
* HTML elements dictate which elements are allowed to be their children,
|
||||||
|
* for example, you can't have a p tag in a span tag. Other elements have
|
||||||
|
* much more rigorous definitions: tables, for instance, require a specific
|
||||||
|
* order for their elements. There are also constraints not expressible by
|
||||||
|
* document type definitions, such as the chameleon nature of ins/del
|
||||||
|
* tags and global child exclusions.
|
||||||
|
*
|
||||||
|
* The first major objective of this strategy is to iterate through all the
|
||||||
|
* nodes (not tokens) of the list of tokens and determine whether or not
|
||||||
|
* their children conform to the element's definition. If they do not, the
|
||||||
|
* child definition may optionally supply an amended list of elements that
|
||||||
|
* is valid or require that the entire node be deleted (and the previous
|
||||||
|
* node rescanned).
|
||||||
|
*
|
||||||
|
* The second objective is to ensure that explicitly excluded elements of
|
||||||
|
* an element do not appear in its children. Code that accomplishes this
|
||||||
|
* task is pervasive through the strategy, though the two are distinct tasks
|
||||||
|
* and could, theoretically, be seperated (although it's not recommended).
|
||||||
|
*
|
||||||
|
* @note Whether or not unrecognized children are silently dropped or
|
||||||
|
* translated into text depends on the child definitions.
|
||||||
|
*
|
||||||
|
* @todo Enable nodes to be bubbled out of the structure.
|
||||||
|
*/
|
||||||
|
|
||||||
class HTMLPurifier_Strategy_FixNesting extends HTMLPurifier_Strategy
|
class HTMLPurifier_Strategy_FixNesting extends HTMLPurifier_Strategy
|
||||||
{
|
{
|
||||||
|
Loading…
Reference in New Issue
Block a user