From f0deae1fc0db0108cb97c08572d9cf42f5e257e0 Mon Sep 17 00:00:00 2001 From: "Edward Z. Yang" Date: Thu, 3 Aug 2006 01:37:28 +0000 Subject: [PATCH] Update documentation. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@147 48356398-32a2-884e-a903-53898d9a118a --- docs/spec.txt | 113 ------------------- library/HTMLPurifier/Strategy/FixNesting.php | 29 ++++- 2 files changed, 27 insertions(+), 115 deletions(-) diff --git a/docs/spec.txt b/docs/spec.txt index fa95246a..71b435ba 100644 --- a/docs/spec.txt +++ b/docs/spec.txt @@ -56,111 +56,6 @@ all this functionality (or not written in HTML at all). The rest of this document is pending moving into their associated classes. - - - - - - - - - - - - -== STAGE 4 - check nesting == - - Status: B (table custom definition needs to be implemented) - -We know that the document is now well formed. The tokenizer should now take -things in nodes: when you hit a start tag, keep on going until you get its -ending tag, and then handle everything inside there. Fortunantely, no -fancy recursion is necessary as going to the next node is as simple as -scrolling to the next start tag. - -Suppose we have a node and encounter a problem with one of its children. -Depending on the complexity of the rule, we will either delete the children, -or delete the entire node itself. - -The simplest type of rule is zero or more valid elements, denoted like: - - ( el1 | el2 | el3 )* - -The next simplest is with one or more valid elements: - - ( li )+ - -And then you have complex cases: - - table (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+)) - map ((%block; | form | %misc;)+ | area+) - html (head, body) - head (%head.misc;, - ((title, %head.misc;, (base, %head.misc;)?) | - (base, %head.misc;, (title, %head.misc;)))) - -Each of these has to be dealt with. Case 1 is a joy, because you can zap -as many as you want, but you'll never actually have to kill the node. Two -and three need the entire node to be killed if you have a problem. This -can be problematic, as the missing node might cause its parent node to now -be incorrect. Granted, it's unlikely, and I'm fairly certain that HTML, let -alone the simplified set I'm allowing will have this problem, but it's worth -checking for. - -The way, I suppose, one would check for it, is whenever a node is removed, -scroll to it's parent start, and re-evaluate it. Make sure you're able to do -that with minimal code repetition. - -The most complex case can probably be done by using some fancy regexp -expressions and transformations. However, it doesn't seem right that, say, -a stray in a can cause the entire table to be removed. Depending -on how much work we want to do, this will at least need a custom child -definition, and at most require extra element bubbling capabilities to be -added. - --- - -So, the way we define these cases should work like this: - -class ChildDef with validateChildren($children_tags) - -The function needs to parse into nodes, then into the regex array. -It can result in one of three actions: the removal of the entire parent node, -replacement of all of the original child tags with a new set of child -tags which it returns, or no changes. They shall be denoted as, respectively, - -Remove entire parent node = false -Replace child tags with this = array of tags -No changes = true - -If we remove the entire parent node, we must scroll back to the parent of the -parent. - --- - -Also, what do we do with elements if they're not allowed somewhere? We need -some sort of default behavior. I reckon that we should be allowed to: - -1. Delete the node -2. Translate it into text (not okay for areas that don't allow #PCDATA) -3. Move the node to somewhere where it is okay - -What complicates the matter is that Firefox has the ability to construct -DOMs and render invalid nestings of elements (like
asdf
). -This means that behavior for stray pcdata in ul/ol is undefined. Behavior -with data in a table gets bubbled to the start of the table (assuming -that we actually custom-make the table child validation class). - -So... I say delete the node when PCDATA isn't allowed (or the regex is too -complicated to determine where PCDATA could be inserted), and translate the node -to text when PCDATA is allowed. - --- - -ins/del are allowed in block and inline content, but it is -inappropriate to include block content within an ins element -occurring in inline content. How would we fix this? - == STAGE 4 - check attributes == STATUS: F (currently implementing core/i18n) @@ -261,11 +156,3 @@ These are the elements that only have %attrs and need an alignment transform ---- Prepend style transformations, as CSS takes precedence. - -== PART 5 - stringify == - - Status: A+ (done completely!) - -Should be fairly simple as long as we delegate to appropriate functions. -It's probably too much trouble to indent the stuff properly, so just output -stuff raw. diff --git a/library/HTMLPurifier/Strategy/FixNesting.php b/library/HTMLPurifier/Strategy/FixNesting.php index 504c1eff..5c55f886 100644 --- a/library/HTMLPurifier/Strategy/FixNesting.php +++ b/library/HTMLPurifier/Strategy/FixNesting.php @@ -3,8 +3,33 @@ require_once 'HTMLPurifier/Strategy.php'; require_once 'HTMLPurifier/Definition.php'; -// EXTRA: provide a mechanism for elements to be bubbled OUT of a node -// or "Replace Nodes while including the parent nodes too" +/** + * Takes a well formed list of tokens and fixes their nesting. + * + * HTML elements dictate which elements are allowed to be their children, + * for example, you can't have a p tag in a span tag. Other elements have + * much more rigorous definitions: tables, for instance, require a specific + * order for their elements. There are also constraints not expressible by + * document type definitions, such as the chameleon nature of ins/del + * tags and global child exclusions. + * + * The first major objective of this strategy is to iterate through all the + * nodes (not tokens) of the list of tokens and determine whether or not + * their children conform to the element's definition. If they do not, the + * child definition may optionally supply an amended list of elements that + * is valid or require that the entire node be deleted (and the previous + * node rescanned). + * + * The second objective is to ensure that explicitly excluded elements of + * an element do not appear in its children. Code that accomplishes this + * task is pervasive through the strategy, though the two are distinct tasks + * and could, theoretically, be seperated (although it's not recommended). + * + * @note Whether or not unrecognized children are silently dropped or + * translated into text depends on the child definitions. + * + * @todo Enable nodes to be bubbled out of the structure. + */ class HTMLPurifier_Strategy_FixNesting extends HTMLPurifier_Strategy {