Update documentation.

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@147 48356398-32a2-884e-a903-53898d9a118a
2024-12-22 16:31:53 +00:00 · 2006-08-03 01:37:28 +00:00 · 2006-08-03 01:37:28 +00:00 · f0deae1fc0
commit f0deae1fc0
parent 26733183b7
2 changed files with 27 additions and 115 deletions
--- a/docs/spec.txt
+++ b/docs/spec.txt
@ -56,111 +56,6 @@ all this functionality (or not written in HTML at all).
 The rest of this document is pending moving into their associated classes.
 == STAGE 4 - check nesting ==
    Status: B (table custom definition needs to be implemented)
 We know that the document is now well formed. The tokenizer should now take
 things in nodes: when you hit a start tag, keep on going until you get its
 ending tag, and then handle everything inside there. Fortunantely, no
 fancy recursion is necessary as going to the next node is as simple as
 scrolling to the next start tag.
 Suppose we have a node and encounter a problem with one of its children.
 Depending on the complexity of the rule, we will either delete the children,
 or delete the entire node itself.
 The simplest type of rule is zero or more valid elements, denoted like:
  ( el1 | el2 | el3 )*
 The next simplest is with one or more valid elements:
  ( li )+
 And then you have complex cases:
 table (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))
 map ((%block; | form | %misc;)+ | area+)
 html (head, body)
 head (%head.misc;,
     ((title, %head.misc;, (base, %head.misc;)?) |
      (base, %head.misc;, (title, %head.misc;))))
 Each of these has to be dealt with. Case 1 is a joy, because you can zap
 as many as you want, but you'll never actually have to kill the node. Two
 and three need the entire node to be killed if you have a problem. This
 can be problematic, as the missing node might cause its parent node to now
 be incorrect. Granted, it's unlikely, and I'm fairly certain that HTML, let
 alone the simplified set I'm allowing will have this problem, but it's worth
 checking for.
 The way, I suppose, one would check for it, is whenever a node is removed,
 scroll to it's parent start, and re-evaluate it. Make sure you're able to do
 that with minimal code repetition.
 The most complex case can probably be done by using some fancy regexp
 expressions and transformations. However, it doesn't seem right that, say,
 a stray <b> in a <table> can cause the entire table to be removed. Depending
 on how much work we want to do, this will at least need a custom child
 definition, and at most require extra element bubbling capabilities to be
 added.
 --
 So, the way we define these cases should work like this:
 class ChildDef with validateChildren($children_tags)
 The function needs to parse into nodes, then into the regex array.
 It can result in one of three actions: the removal of the entire parent node,
 replacement of all of the original child tags with a new set of child
 tags which it returns, or no changes. They shall be denoted as, respectively,
 Remove entire parent node    = false
 Replace child tags with this = array of tags
 No changes                   = true
 If we remove the entire parent node, we must scroll back to the parent of the
 parent.
 --
 Also, what do we do with elements if they're not allowed somewhere? We need
 some sort of default behavior. I reckon that we should be allowed to:
 1. Delete the node
 2. Translate it into text (not okay for areas that don't allow #PCDATA)
 3. Move the node to somewhere where it is okay
 What complicates the matter is that Firefox has the ability to construct
 DOMs and render invalid nestings of elements (like <b><div>asdf</div></b>).
 This means that behavior for stray pcdata in ul/ol is undefined. Behavior
 with data in a table gets bubbled to the start of the table (assuming
 that we actually custom-make the table child validation class).
 So... I say delete the node when PCDATA isn't allowed (or the regex is too
 complicated to determine where PCDATA could be inserted), and translate the node
 to text when PCDATA is allowed.
 --
 ins/del are allowed in block and inline content, but it is
 inappropriate to include block content within an ins element
 occurring in inline content. How would we fix this?
 == STAGE 4 - check attributes ==
    STATUS: F (currently implementing core/i18n)
@ -261,11 +156,3 @@ These are the elements that only have %attrs and need an alignment transform
 ----
 Prepend style transformations, as CSS takes precedence.
 == PART 5 - stringify ==
    Status: A+ (done completely!)
 Should be fairly simple as long as we delegate to appropriate functions.
 It's probably too much trouble to indent the stuff properly, so just output
 stuff raw.
--- a/library/HTMLPurifier/Strategy/FixNesting.php
+++ b/library/HTMLPurifier/Strategy/FixNesting.php
@ -3,8 +3,33 @@
 require_once 'HTMLPurifier/Strategy.php';
 require_once 'HTMLPurifier/Definition.php';
-// EXTRA: provide a mechanism for elements to be bubbled OUT of a node
+/**
-// or "Replace Nodes while including the parent nodes too"
+ * Takes a well formed list of tokens and fixes their nesting.
 * 
 * HTML elements dictate which elements are allowed to be their children,
 * for example, you can't have a p tag in a span tag.  Other elements have
 * much more rigorous definitions: tables, for instance, require a specific
 * order for their elements.  There are also constraints not expressible by
 * document type definitions, such as the chameleon nature of ins/del
 * tags and global child exclusions.
 * 
 * The first major objective of this strategy is to iterate through all the
 * nodes (not tokens) of the list of tokens and determine whether or not
 * their children conform to the element's definition.  If they do not, the
 * child definition may optionally supply an amended list of elements that
 * is valid or require that the entire node be deleted (and the previous
 * node rescanned).
 * 
 * The second objective is to ensure that explicitly excluded elements of
 * an element do not appear in its children.  Code that accomplishes this
 * task is pervasive through the strategy, though the two are distinct tasks
 * and could, theoretically, be seperated (although it's not recommended).
 * 
 * @note Whether or not unrecognized children are silently dropped or
 *       translated into text depends on the child definitions.
 * 
 * @todo Enable nodes to be bubbled out of the structure.
 */
 class HTMLPurifier_Strategy_FixNesting extends HTMLPurifier_Strategy
 {