mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2024-12-22 16:31:53 +00:00
Update docs. Delineate XHTML 1.1 revamping of HTMLDefinition.
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@705 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
parent
06867e14b6
commit
f90eef7f1f
@ -36,4 +36,5 @@ the definition, you'd have to force reconstruction.
|
||||
|
||||
In practice, the pulling directives from the config object are
|
||||
solely need-based, and the flex points are littered throughout the
|
||||
setup() function. Some sort of refactoring is likely in order.
|
||||
setup() function. Some sort of refactoring is likely in order. See
|
||||
ref-xhtml-1.1.txt for more info.
|
||||
|
@ -1,42 +1,6 @@
|
||||
We are going to model our I18N/L10N off of MediaWiki's system. Their's is
|
||||
obviously quite complicated, so we're going to simplify it a bit for our needs.
|
||||
|
||||
== Structure ==
|
||||
|
||||
First, you have a Language object. This object contains all the localisable
|
||||
message strings, as well as other important language-specific settings and
|
||||
custom behavior (uppercasing, lowercasing, printing dates, formatting
|
||||
numbers, etc.)
|
||||
|
||||
The object is constructed from two sources: subclassed versions of itself
|
||||
(classes) and Message files (messages).
|
||||
|
||||
== General use ==
|
||||
|
||||
You load a language object by calling the Language::factory() function.
|
||||
This function the class file for the object (taking in account fallback
|
||||
languages by using the fallback langauge's object but overloading the
|
||||
language key) and returns that object. Nothing else happens.
|
||||
|
||||
When a message/etc is requested, a lazy load initializor is called. Now the
|
||||
real work starts. We're first going to take the scenario that the language
|
||||
is not cached. The system loads the Messages file by:
|
||||
|
||||
require( $filename );
|
||||
$cache = compact( self::$mLocalisationKeys );
|
||||
|
||||
...where self::$mLocalisationKeys is the name of variables that could be used
|
||||
in the localization file. This lets you use things like:
|
||||
|
||||
$fallback = false;
|
||||
$rtl = false;
|
||||
|
||||
...and easily siphon them into arrays.
|
||||
|
||||
Then, we load the $fallback language (if not set, English) to fill in the gaps in
|
||||
the messages. There is specialized behavior for certain keys, as they can be
|
||||
mergeable maps, lists or alias lists (not sure what the last one is).
|
||||
|
||||
== Caching ==
|
||||
|
||||
MediaWiki has lots of caching mechanisms built in, which make the code somewhat
|
||||
|
@ -32,6 +32,6 @@ A tag's attribute 'target' (for selecting frames) cut
|
||||
current behavior: no substitute, just delete when in strict, allow in loose
|
||||
Attribute 'name' deprecated in favor of 'id'
|
||||
current behavior: dropped silently
|
||||
projected behavior: create proper AttrTransform (currently not allowed at all)
|
||||
projected behavior: create proper AttrTransform
|
||||
[done] PRE tag allows SUB/SUP? (strict dtd comment vs syntax, loose disallows)
|
||||
current behavior: disallow as usual
|
||||
|
@ -7,15 +7,138 @@ It's quite simple, according to <http://www.w3.org/TR/xhtml11/changes.html>
|
||||
2. Scratch name entirely in favor of id (partially-done)
|
||||
3. Support Ruby <http://www.w3.org/TR/2001/REC-ruby-20010531/>
|
||||
|
||||
...but that's only an informative section. More things to do:
|
||||
|
||||
1. Scratch style attribute (it's deprecated)
|
||||
2. Be module-aware (this might entail intelligent grouping in the definition
|
||||
and allowing users to specifically remove certain modules (see 5))
|
||||
3. Cross-reference minimal content models with existing DTDs and determine
|
||||
changes (todo)
|
||||
4. Watch out for the Legacy Module
|
||||
<http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/abstract_modules.html#s_legacymodule>
|
||||
5. Let users specify their own custom modules
|
||||
6. Study Modularization document
|
||||
...but that's only an informative section. The true power of XHTML 1.1
|
||||
is its modularization, defined at:
|
||||
<http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/>
|
||||
|
||||
Modularization may very well be the next-generation implementation
|
||||
of HTMLDefinition. The current, XHTML 1.0 DTD-based approach is
|
||||
extremely brittle and doesn't lend well to extension by others, but
|
||||
modularization fixes all that. The modules W3C defines that we
|
||||
should take a look at are:
|
||||
|
||||
* 5.1. Attribute Collections
|
||||
* 5.2. Core Modules
|
||||
o 5.2.2. Text Module
|
||||
o 5.2.3. Hypertext Module
|
||||
o 5.2.4. List Module
|
||||
* 5.4. Text Extension Modules
|
||||
o 5.4.1. Presentation Module
|
||||
o 5.4.2. Edit Module
|
||||
o 5.4.3. Bi-directional Text Module
|
||||
* 5.6. Table Modules
|
||||
o 5.6.1. Basic Tables Module [?]
|
||||
o 5.6.2. Tables Module
|
||||
* 5.7. Image Module
|
||||
* 5.8. Client-side Image Map Module [?]
|
||||
* 5.9. Server-side Image Map Module [?]
|
||||
* 5.12. Target Module [?]
|
||||
* 5.18. Style Attribute Module
|
||||
* 5.21. Name Identification Module [deprecated]
|
||||
* 5.22. Legacy Module [deprecated]
|
||||
|
||||
We exclude these modules due to their dangerousness or inapplicability
|
||||
as a XHTML fragment:
|
||||
|
||||
* 5.2. Core Modules
|
||||
o 5.2.1. Structure Module
|
||||
* 5.3. Applet Module
|
||||
* 5.5. Forms Modules
|
||||
o 5.5.1. Basic Forms Module
|
||||
o 5.5.2. Forms Module
|
||||
* 5.10. Object Module
|
||||
* 5.11. Frames Module
|
||||
* 5.13. Iframe Module
|
||||
* 5.14. Intrinsic Events Module
|
||||
* 5.15. Metainformation Module
|
||||
* 5.16. Scripting Module
|
||||
* 5.17. Style Sheet Module
|
||||
* 5.19. Link Module
|
||||
* 5.20. Base Module
|
||||
|
||||
Modularization also defines content sets:
|
||||
|
||||
* Heading
|
||||
* Block
|
||||
* Inline
|
||||
* Flow {Heading | Block | Inline}
|
||||
* List
|
||||
* Form [x]
|
||||
* Formctrl [x]
|
||||
|
||||
Which may have elements dynamically added to them as more modules get
|
||||
added.
|
||||
|
||||
== Implementation Details ==
|
||||
|
||||
We will not be using the XML Schemas or DTDs directly due to the lack
|
||||
of robust tools in the area.
|
||||
|
||||
Since we will be performing a lot of abstracting, caching would be nice. Cache
|
||||
invalidation could be done by comparing the HTML and Attr config namespaces
|
||||
with a copy that was packaged along with this (we have no files to mtime)
|
||||
|
||||
We also have the trouble of preserving the current interface, which is
|
||||
quite nice in terms of speed but not so good in terms of OO-ness. This
|
||||
is fine: we may need to have a two-tiered setup approach, that goes
|
||||
like this:
|
||||
|
||||
1. When getHTMLDefinition() is initially called, we prepare the default
|
||||
environment of content-sets, loaded modules, and allowed content-sets.
|
||||
This, while good for developers seeking to customize the tagset, is
|
||||
unusable by HTML Purifier internals. It represents the XML schema
|
||||
2. When HTMLPurifier needs to use the definition, it calls a second setup
|
||||
function, which now performs any substitutions needed and instantiates
|
||||
all the objects which the internals will use.
|
||||
|
||||
In this manner, complicated observers are not necessary, you just
|
||||
specify a content module like:
|
||||
|
||||
$flow = '%Heading | %Block | %Inline';
|
||||
|
||||
And the second setup will perform the substitutions magically for you.
|
||||
|
||||
A module will have certain properties:
|
||||
|
||||
- Elements
|
||||
- Attributes
|
||||
- Content model
|
||||
- Content sets
|
||||
- Content set extensions
|
||||
- Content model extensions [x] (seen only on structural elements)
|
||||
- Attribute collection extensions
|
||||
|
||||
In our case, the content model does a lot more than just define what
|
||||
allowed children are: they also define exclusions. Also, if we refrain
|
||||
from directly instantiating objects, we are posed with the problem of
|
||||
how to signify which ChildDef to use. Remember: our specialized cases of
|
||||
content models are proprietary optimizations that allow us to deal with
|
||||
elements that don't belong rather than spit them out. Possible solutions:
|
||||
|
||||
1. Factory method that analyzes the definition and figures out who
|
||||
to defer to. It would also be responsible for parsing out omissions.
|
||||
2. Don't use their content model syntax, just enumerate items and give
|
||||
the class-name of which one to use. If a complex definition is truly
|
||||
needed, then use content model syntax. A definition, then, would
|
||||
be composed of multiple parts:
|
||||
- True content-model definition, OR
|
||||
- Simple content-model definition
|
||||
- List of items in the definition (may be multiple if dealing
|
||||
with Chameleon)
|
||||
- Name of the type (optional, required, etc)
|
||||
|
||||
Flexibility is absolutely essential, so the API of some of these
|
||||
ChildDefs may need to change to lend them better to uniform treatment.
|
||||
|
||||
Attributes are somewhat easier to manage, because we would be using
|
||||
associative arrays of elements => attributes => AttrDef names, and there
|
||||
would be an Attribute Types lookup array to get the appropriate AttrDef
|
||||
(if the objects are stateful, they will need to be cloned). Attributes
|
||||
for just one element can be specifically overridden through some mechanism
|
||||
(probably a config lookup array as well as an internal counter, because
|
||||
of HTML 4.01 semantics concerns.) An attribute set will also optionally
|
||||
have an array at index 0 which defines what attribute collections to
|
||||
agglutinate on when parsing. This may allow us to get rid of global
|
||||
attributes, which were also a proprietary implementation detail.
|
||||
|
||||
Alright: let's get to work!
|
Loading…
Reference in New Issue
Block a user