mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2024-12-22 16:31:53 +00:00
Update documentation.
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@147 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
parent
26733183b7
commit
f0deae1fc0
113
docs/spec.txt
113
docs/spec.txt
@ -56,111 +56,6 @@ all this functionality (or not written in HTML at all).
|
||||
|
||||
The rest of this document is pending moving into their associated classes.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
== STAGE 4 - check nesting ==
|
||||
|
||||
Status: B (table custom definition needs to be implemented)
|
||||
|
||||
We know that the document is now well formed. The tokenizer should now take
|
||||
things in nodes: when you hit a start tag, keep on going until you get its
|
||||
ending tag, and then handle everything inside there. Fortunantely, no
|
||||
fancy recursion is necessary as going to the next node is as simple as
|
||||
scrolling to the next start tag.
|
||||
|
||||
Suppose we have a node and encounter a problem with one of its children.
|
||||
Depending on the complexity of the rule, we will either delete the children,
|
||||
or delete the entire node itself.
|
||||
|
||||
The simplest type of rule is zero or more valid elements, denoted like:
|
||||
|
||||
( el1 | el2 | el3 )*
|
||||
|
||||
The next simplest is with one or more valid elements:
|
||||
|
||||
( li )+
|
||||
|
||||
And then you have complex cases:
|
||||
|
||||
table (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))
|
||||
map ((%block; | form | %misc;)+ | area+)
|
||||
html (head, body)
|
||||
head (%head.misc;,
|
||||
((title, %head.misc;, (base, %head.misc;)?) |
|
||||
(base, %head.misc;, (title, %head.misc;))))
|
||||
|
||||
Each of these has to be dealt with. Case 1 is a joy, because you can zap
|
||||
as many as you want, but you'll never actually have to kill the node. Two
|
||||
and three need the entire node to be killed if you have a problem. This
|
||||
can be problematic, as the missing node might cause its parent node to now
|
||||
be incorrect. Granted, it's unlikely, and I'm fairly certain that HTML, let
|
||||
alone the simplified set I'm allowing will have this problem, but it's worth
|
||||
checking for.
|
||||
|
||||
The way, I suppose, one would check for it, is whenever a node is removed,
|
||||
scroll to it's parent start, and re-evaluate it. Make sure you're able to do
|
||||
that with minimal code repetition.
|
||||
|
||||
The most complex case can probably be done by using some fancy regexp
|
||||
expressions and transformations. However, it doesn't seem right that, say,
|
||||
a stray <b> in a <table> can cause the entire table to be removed. Depending
|
||||
on how much work we want to do, this will at least need a custom child
|
||||
definition, and at most require extra element bubbling capabilities to be
|
||||
added.
|
||||
|
||||
--
|
||||
|
||||
So, the way we define these cases should work like this:
|
||||
|
||||
class ChildDef with validateChildren($children_tags)
|
||||
|
||||
The function needs to parse into nodes, then into the regex array.
|
||||
It can result in one of three actions: the removal of the entire parent node,
|
||||
replacement of all of the original child tags with a new set of child
|
||||
tags which it returns, or no changes. They shall be denoted as, respectively,
|
||||
|
||||
Remove entire parent node = false
|
||||
Replace child tags with this = array of tags
|
||||
No changes = true
|
||||
|
||||
If we remove the entire parent node, we must scroll back to the parent of the
|
||||
parent.
|
||||
|
||||
--
|
||||
|
||||
Also, what do we do with elements if they're not allowed somewhere? We need
|
||||
some sort of default behavior. I reckon that we should be allowed to:
|
||||
|
||||
1. Delete the node
|
||||
2. Translate it into text (not okay for areas that don't allow #PCDATA)
|
||||
3. Move the node to somewhere where it is okay
|
||||
|
||||
What complicates the matter is that Firefox has the ability to construct
|
||||
DOMs and render invalid nestings of elements (like <b><div>asdf</div></b>).
|
||||
This means that behavior for stray pcdata in ul/ol is undefined. Behavior
|
||||
with data in a table gets bubbled to the start of the table (assuming
|
||||
that we actually custom-make the table child validation class).
|
||||
|
||||
So... I say delete the node when PCDATA isn't allowed (or the regex is too
|
||||
complicated to determine where PCDATA could be inserted), and translate the node
|
||||
to text when PCDATA is allowed.
|
||||
|
||||
--
|
||||
|
||||
ins/del are allowed in block and inline content, but it is
|
||||
inappropriate to include block content within an ins element
|
||||
occurring in inline content. How would we fix this?
|
||||
|
||||
== STAGE 4 - check attributes ==
|
||||
|
||||
STATUS: F (currently implementing core/i18n)
|
||||
@ -261,11 +156,3 @@ These are the elements that only have %attrs and need an alignment transform
|
||||
----
|
||||
|
||||
Prepend style transformations, as CSS takes precedence.
|
||||
|
||||
== PART 5 - stringify ==
|
||||
|
||||
Status: A+ (done completely!)
|
||||
|
||||
Should be fairly simple as long as we delegate to appropriate functions.
|
||||
It's probably too much trouble to indent the stuff properly, so just output
|
||||
stuff raw.
|
||||
|
@ -3,8 +3,33 @@
|
||||
require_once 'HTMLPurifier/Strategy.php';
|
||||
require_once 'HTMLPurifier/Definition.php';
|
||||
|
||||
// EXTRA: provide a mechanism for elements to be bubbled OUT of a node
|
||||
// or "Replace Nodes while including the parent nodes too"
|
||||
/**
|
||||
* Takes a well formed list of tokens and fixes their nesting.
|
||||
*
|
||||
* HTML elements dictate which elements are allowed to be their children,
|
||||
* for example, you can't have a p tag in a span tag. Other elements have
|
||||
* much more rigorous definitions: tables, for instance, require a specific
|
||||
* order for their elements. There are also constraints not expressible by
|
||||
* document type definitions, such as the chameleon nature of ins/del
|
||||
* tags and global child exclusions.
|
||||
*
|
||||
* The first major objective of this strategy is to iterate through all the
|
||||
* nodes (not tokens) of the list of tokens and determine whether or not
|
||||
* their children conform to the element's definition. If they do not, the
|
||||
* child definition may optionally supply an amended list of elements that
|
||||
* is valid or require that the entire node be deleted (and the previous
|
||||
* node rescanned).
|
||||
*
|
||||
* The second objective is to ensure that explicitly excluded elements of
|
||||
* an element do not appear in its children. Code that accomplishes this
|
||||
* task is pervasive through the strategy, though the two are distinct tasks
|
||||
* and could, theoretically, be seperated (although it's not recommended).
|
||||
*
|
||||
* @note Whether or not unrecognized children are silently dropped or
|
||||
* translated into text depends on the child definitions.
|
||||
*
|
||||
* @todo Enable nodes to be bubbled out of the structure.
|
||||
*/
|
||||
|
||||
class HTMLPurifier_Strategy_FixNesting extends HTMLPurifier_Strategy
|
||||
{
|
||||
|
Loading…
Reference in New Issue
Block a user