0
0
mirror of https://github.com/ezyang/htmlpurifier.git synced 2024-12-22 08:21:52 +00:00

Update docs. Delineate XHTML 1.1 revamping of HTMLDefinition.

git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@705 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
Edward Z. Yang 2007-02-03 17:03:04 +00:00
parent 06867e14b6
commit f90eef7f1f
4 changed files with 137 additions and 49 deletions

View File

@ -36,4 +36,5 @@ the definition, you'd have to force reconstruction.
In practice, the pulling directives from the config object are
solely need-based, and the flex points are littered throughout the
setup() function. Some sort of refactoring is likely in order.
setup() function. Some sort of refactoring is likely in order. See
ref-xhtml-1.1.txt for more info.

View File

@ -1,42 +1,6 @@
We are going to model our I18N/L10N off of MediaWiki's system. Their's is
obviously quite complicated, so we're going to simplify it a bit for our needs.
== Structure ==
First, you have a Language object. This object contains all the localisable
message strings, as well as other important language-specific settings and
custom behavior (uppercasing, lowercasing, printing dates, formatting
numbers, etc.)
The object is constructed from two sources: subclassed versions of itself
(classes) and Message files (messages).
== General use ==
You load a language object by calling the Language::factory() function.
This function the class file for the object (taking in account fallback
languages by using the fallback langauge's object but overloading the
language key) and returns that object. Nothing else happens.
When a message/etc is requested, a lazy load initializor is called. Now the
real work starts. We're first going to take the scenario that the language
is not cached. The system loads the Messages file by:
require( $filename );
$cache = compact( self::$mLocalisationKeys );
...where self::$mLocalisationKeys is the name of variables that could be used
in the localization file. This lets you use things like:
$fallback = false;
$rtl = false;
...and easily siphon them into arrays.
Then, we load the $fallback language (if not set, English) to fill in the gaps in
the messages. There is specialized behavior for certain keys, as they can be
mergeable maps, lists or alias lists (not sure what the last one is).
== Caching ==
MediaWiki has lots of caching mechanisms built in, which make the code somewhat

View File

@ -32,6 +32,6 @@ A tag's attribute 'target' (for selecting frames) cut
current behavior: no substitute, just delete when in strict, allow in loose
Attribute 'name' deprecated in favor of 'id'
current behavior: dropped silently
projected behavior: create proper AttrTransform (currently not allowed at all)
projected behavior: create proper AttrTransform
[done] PRE tag allows SUB/SUP? (strict dtd comment vs syntax, loose disallows)
current behavior: disallow as usual

View File

@ -7,15 +7,138 @@ It's quite simple, according to <http://www.w3.org/TR/xhtml11/changes.html>
2. Scratch name entirely in favor of id (partially-done)
3. Support Ruby <http://www.w3.org/TR/2001/REC-ruby-20010531/>
...but that's only an informative section. More things to do:
1. Scratch style attribute (it's deprecated)
2. Be module-aware (this might entail intelligent grouping in the definition
and allowing users to specifically remove certain modules (see 5))
3. Cross-reference minimal content models with existing DTDs and determine
changes (todo)
4. Watch out for the Legacy Module
<http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/abstract_modules.html#s_legacymodule>
5. Let users specify their own custom modules
6. Study Modularization document
...but that's only an informative section. The true power of XHTML 1.1
is its modularization, defined at:
<http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/>
Modularization may very well be the next-generation implementation
of HTMLDefinition. The current, XHTML 1.0 DTD-based approach is
extremely brittle and doesn't lend well to extension by others, but
modularization fixes all that. The modules W3C defines that we
should take a look at are:
* 5.1. Attribute Collections
* 5.2. Core Modules
o 5.2.2. Text Module
o 5.2.3. Hypertext Module
o 5.2.4. List Module
* 5.4. Text Extension Modules
o 5.4.1. Presentation Module
o 5.4.2. Edit Module
o 5.4.3. Bi-directional Text Module
* 5.6. Table Modules
o 5.6.1. Basic Tables Module [?]
o 5.6.2. Tables Module
* 5.7. Image Module
* 5.8. Client-side Image Map Module [?]
* 5.9. Server-side Image Map Module [?]
* 5.12. Target Module [?]
* 5.18. Style Attribute Module
* 5.21. Name Identification Module [deprecated]
* 5.22. Legacy Module [deprecated]
We exclude these modules due to their dangerousness or inapplicability
as a XHTML fragment:
* 5.2. Core Modules
o 5.2.1. Structure Module
* 5.3. Applet Module
* 5.5. Forms Modules
o 5.5.1. Basic Forms Module
o 5.5.2. Forms Module
* 5.10. Object Module
* 5.11. Frames Module
* 5.13. Iframe Module
* 5.14. Intrinsic Events Module
* 5.15. Metainformation Module
* 5.16. Scripting Module
* 5.17. Style Sheet Module
* 5.19. Link Module
* 5.20. Base Module
Modularization also defines content sets:
* Heading
* Block
* Inline
* Flow {Heading | Block | Inline}
* List
* Form [x]
* Formctrl [x]
Which may have elements dynamically added to them as more modules get
added.
== Implementation Details ==
We will not be using the XML Schemas or DTDs directly due to the lack
of robust tools in the area.
Since we will be performing a lot of abstracting, caching would be nice. Cache
invalidation could be done by comparing the HTML and Attr config namespaces
with a copy that was packaged along with this (we have no files to mtime)
We also have the trouble of preserving the current interface, which is
quite nice in terms of speed but not so good in terms of OO-ness. This
is fine: we may need to have a two-tiered setup approach, that goes
like this:
1. When getHTMLDefinition() is initially called, we prepare the default
environment of content-sets, loaded modules, and allowed content-sets.
This, while good for developers seeking to customize the tagset, is
unusable by HTML Purifier internals. It represents the XML schema
2. When HTMLPurifier needs to use the definition, it calls a second setup
function, which now performs any substitutions needed and instantiates
all the objects which the internals will use.
In this manner, complicated observers are not necessary, you just
specify a content module like:
$flow = '%Heading | %Block | %Inline';
And the second setup will perform the substitutions magically for you.
A module will have certain properties:
- Elements
- Attributes
- Content model
- Content sets
- Content set extensions
- Content model extensions [x] (seen only on structural elements)
- Attribute collection extensions
In our case, the content model does a lot more than just define what
allowed children are: they also define exclusions. Also, if we refrain
from directly instantiating objects, we are posed with the problem of
how to signify which ChildDef to use. Remember: our specialized cases of
content models are proprietary optimizations that allow us to deal with
elements that don't belong rather than spit them out. Possible solutions:
1. Factory method that analyzes the definition and figures out who
to defer to. It would also be responsible for parsing out omissions.
2. Don't use their content model syntax, just enumerate items and give
the class-name of which one to use. If a complex definition is truly
needed, then use content model syntax. A definition, then, would
be composed of multiple parts:
- True content-model definition, OR
- Simple content-model definition
- List of items in the definition (may be multiple if dealing
with Chameleon)
- Name of the type (optional, required, etc)
Flexibility is absolutely essential, so the API of some of these
ChildDefs may need to change to lend them better to uniform treatment.
Attributes are somewhat easier to manage, because we would be using
associative arrays of elements => attributes => AttrDef names, and there
would be an Attribute Types lookup array to get the appropriate AttrDef
(if the objects are stateful, they will need to be cloned). Attributes
for just one element can be specifically overridden through some mechanism
(probably a config lookup array as well as an internal counter, because
of HTML 4.01 semantics concerns.) An attribute set will also optionally
have an array at index 0 which defines what attribute collections to
agglutinate on when parsing. This may allow us to get rid of global
attributes, which were also a proprietary implementation detail.
Alright: let's get to work!