Update docs. Delineate XHTML 1.1 revamping of HTMLDefinition.

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@705 48356398-32a2-884e-a903-53898d9a118a
2024-12-31 12:01:51 +00:00 · 2007-02-03 17:03:04 +00:00 · 2007-02-03 17:03:04 +00:00 · f90eef7f1f
commit f90eef7f1f
parent 06867e14b6
4 changed files with 137 additions and 49 deletions
--- a/docs/proposal-config.txt
+++ b/docs/proposal-config.txt
@ -36,4 +36,5 @@ the definition, you'd have to force reconstruction.
 In practice, the pulling directives from the config object are
 solely need-based, and the flex points are littered throughout the
-setup() function.  Some sort of refactoring is likely in order.
+setup() function.  Some sort of refactoring is likely in order. See
 ref-xhtml-1.1.txt for more info.
--- a/docs/proposal-language.txt
+++ b/docs/proposal-language.txt
@ -1,42 +1,6 @@
 We are going to model our I18N/L10N off of MediaWiki's system.  Their's is
 obviously quite complicated, so we're going to simplify it a bit for our needs.
 == Structure ==
 First, you have a Language object.  This object contains all the localisable
 message strings, as well as other important language-specific settings and
 custom behavior (uppercasing, lowercasing, printing dates, formatting
 numbers, etc.)
 The object is constructed from two sources: subclassed versions of itself
 (classes) and Message files (messages).
 == General use ==
 You load a language object by calling the Language::factory() function. 
 This function the class file for the object (taking in account fallback 
 languages by using the fallback langauge's object but overloading the 
 language key) and returns that object. Nothing else happens.
 When a message/etc is requested, a lazy load initializor is called.  Now the
 real work starts.  We're first going to take the scenario that the language
 is not cached.  The system loads the Messages file by:
    require( $filename );
    $cache = compact( self::$mLocalisationKeys );	
 ...where self::$mLocalisationKeys is the name of variables that could be used
 in the localization file. This lets you use things like:
    $fallback = false;
    $rtl = false;
 ...and easily siphon them into arrays.
 Then, we load the $fallback language (if not set, English) to fill in the gaps in
 the messages.  There is specialized behavior for certain keys, as they can be
 mergeable maps, lists or alias lists (not sure what the last one is).
 == Caching ==
 MediaWiki has lots of caching mechanisms built in, which make the code somewhat
--- a/docs/ref-loose-vs-strict.txt
+++ b/docs/ref-loose-vs-strict.txt
@ -32,6 +32,6 @@ A tag's attribute 'target' (for selecting frames) cut
    current behavior: no substitute, just delete when in strict, allow in loose
 Attribute 'name' deprecated in favor of 'id'
    current behavior: dropped silently
-    projected behavior: create proper AttrTransform (currently not allowed at all)
+    projected behavior: create proper AttrTransform
 [done] PRE tag allows SUB/SUP? (strict dtd comment vs syntax, loose disallows)
    current behavior: disallow as usual
--- a/docs/ref-xhtml-1.1.txt
+++ b/docs/ref-xhtml-1.1.txt
@ -7,15 +7,138 @@ It's quite simple, according to <http://www.w3.org/TR/xhtml11/changes.html>
 2. Scratch name entirely in favor of id (partially-done)
 3. Support Ruby <http://www.w3.org/TR/2001/REC-ruby-20010531/>
-...but that's only an informative section. More things to do:
+...but that's only an informative section. The true power of XHTML 1.1
-
+is its modularization, defined at:
 1. Scratch style attribute (it's deprecated)
 2. Be module-aware (this might entail intelligent grouping in the definition
   and allowing users to specifically remove certain modules (see 5))
 3. Cross-reference minimal content models with existing DTDs and determine
   changes (todo)
 4. Watch out for the Legacy Module
 <http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/abstract_modules.html#s_legacymodule>
 5. Let users specify their own custom modules
 6. Study Modularization document
 <http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/>
 Modularization may very well be the next-generation implementation
 of HTMLDefinition. The current, XHTML 1.0 DTD-based approach is
 extremely brittle and doesn't lend well to extension by others, but
 modularization fixes all that. The modules W3C defines that we
 should take a look at are:
    * 5.1. Attribute Collections
    * 5.2. Core Modules
          o 5.2.2. Text Module
          o 5.2.3. Hypertext Module
          o 5.2.4. List Module
    * 5.4. Text Extension Modules
          o 5.4.1. Presentation Module
          o 5.4.2. Edit Module
          o 5.4.3. Bi-directional Text Module
    * 5.6. Table Modules
          o 5.6.1. Basic Tables Module [?]
          o 5.6.2. Tables Module
    * 5.7. Image Module
    * 5.8. Client-side Image Map Module [?]
    * 5.9. Server-side Image Map Module [?]
    * 5.12. Target Module [?]
    * 5.18. Style Attribute Module
    * 5.21. Name Identification Module [deprecated]
    * 5.22. Legacy Module [deprecated]
 We exclude these modules due to their dangerousness or inapplicability
 as a XHTML fragment:
    * 5.2. Core Modules
          o 5.2.1. Structure Module
    * 5.3. Applet Module
    * 5.5. Forms Modules
          o 5.5.1. Basic Forms Module
          o 5.5.2. Forms Module
    * 5.10. Object Module
    * 5.11. Frames Module
    * 5.13. Iframe Module
    * 5.14. Intrinsic Events Module
    * 5.15. Metainformation Module
    * 5.16. Scripting Module
    * 5.17. Style Sheet Module
    * 5.19. Link Module
    * 5.20. Base Module
 Modularization also defines content sets:
    * Heading
    * Block
    * Inline
    * Flow {Heading | Block | Inline}
    * List
    * Form [x]
    * Formctrl [x]
 Which may have elements dynamically added to them as more modules get
 added.
 == Implementation Details ==
 We will not be using the XML Schemas or DTDs directly due to the lack
 of robust tools in the area.
 Since we will be performing a lot of abstracting, caching would be nice. Cache
 invalidation could be done by comparing the HTML and Attr config namespaces
 with a copy that was packaged along with this (we have no files to mtime)
 We also have the trouble of preserving the current interface, which is
 quite nice in terms of speed but not so good in terms of OO-ness. This
 is fine: we may need to have a two-tiered setup approach, that goes
 like this:
 1. When getHTMLDefinition() is initially called, we prepare the default
   environment of content-sets, loaded modules, and allowed content-sets.
   This, while good for developers seeking to customize the tagset, is
   unusable by HTML Purifier internals. It represents the XML schema
 2. When HTMLPurifier needs to use the definition, it calls a second setup
   function, which now performs any substitutions needed and instantiates
   all the objects which the internals will use.
 In this manner, complicated observers are not necessary, you just
 specify a content module like:
    $flow = '%Heading | %Block | %Inline';
 And the second setup will perform the substitutions magically for you.
 A module will have certain properties:
 - Elements
    - Attributes
    - Content model
 - Content sets
 - Content set extensions
 - Content model extensions [x] (seen only on structural elements)
 - Attribute collection extensions
 In our case, the content model does a lot more than just define what
 allowed children are: they also define exclusions. Also, if we refrain
 from directly instantiating objects, we are posed with the problem of
 how to signify which ChildDef to use. Remember: our specialized cases of
 content models are proprietary optimizations that allow us to deal with
 elements that don't belong rather than spit them out. Possible solutions:
 1. Factory method that analyzes the definition and figures out who
   to defer to. It would also be responsible for parsing out omissions.
 2. Don't use their content model syntax, just enumerate items and give
   the class-name of which one to use. If a complex definition is truly
   needed, then use content model syntax. A definition, then, would
   be composed of multiple parts:
    - True content-model definition, OR
    - Simple content-model definition
        - List of items in the definition (may be multiple if dealing
          with Chameleon)
        - Name of the type (optional, required, etc)
 Flexibility is absolutely essential, so the API of some of these
 ChildDefs may need to change to lend them better to uniform treatment.
 Attributes are somewhat easier to manage, because we would be using
 associative arrays of elements => attributes => AttrDef names, and there
 would be an Attribute Types lookup array to get the appropriate AttrDef
 (if the objects are stateful, they will need to be cloned). Attributes
 for just one element can be specifically overridden through some mechanism
 (probably a config lookup array as well as an internal counter, because
 of HTML 4.01 semantics concerns.) An attribute set will also optionally
 have an array at index 0 which defines what attribute collections to
 agglutinate on when parsing. This may allow us to get rid of global
 attributes, which were also a proprietary implementation detail.
 Alright: let's get to work!