mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2024-11-14 01:08:41 +00:00
Update docs. Delineate XHTML 1.1 revamping of HTMLDefinition.
git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@705 48356398-32a2-884e-a903-53898d9a118a
This commit is contained in:
parent
06867e14b6
commit
f90eef7f1f
@ -36,4 +36,5 @@ the definition, you'd have to force reconstruction.
|
|||||||
|
|
||||||
In practice, the pulling directives from the config object are
|
In practice, the pulling directives from the config object are
|
||||||
solely need-based, and the flex points are littered throughout the
|
solely need-based, and the flex points are littered throughout the
|
||||||
setup() function. Some sort of refactoring is likely in order.
|
setup() function. Some sort of refactoring is likely in order. See
|
||||||
|
ref-xhtml-1.1.txt for more info.
|
||||||
|
@ -1,42 +1,6 @@
|
|||||||
We are going to model our I18N/L10N off of MediaWiki's system. Their's is
|
We are going to model our I18N/L10N off of MediaWiki's system. Their's is
|
||||||
obviously quite complicated, so we're going to simplify it a bit for our needs.
|
obviously quite complicated, so we're going to simplify it a bit for our needs.
|
||||||
|
|
||||||
== Structure ==
|
|
||||||
|
|
||||||
First, you have a Language object. This object contains all the localisable
|
|
||||||
message strings, as well as other important language-specific settings and
|
|
||||||
custom behavior (uppercasing, lowercasing, printing dates, formatting
|
|
||||||
numbers, etc.)
|
|
||||||
|
|
||||||
The object is constructed from two sources: subclassed versions of itself
|
|
||||||
(classes) and Message files (messages).
|
|
||||||
|
|
||||||
== General use ==
|
|
||||||
|
|
||||||
You load a language object by calling the Language::factory() function.
|
|
||||||
This function the class file for the object (taking in account fallback
|
|
||||||
languages by using the fallback langauge's object but overloading the
|
|
||||||
language key) and returns that object. Nothing else happens.
|
|
||||||
|
|
||||||
When a message/etc is requested, a lazy load initializor is called. Now the
|
|
||||||
real work starts. We're first going to take the scenario that the language
|
|
||||||
is not cached. The system loads the Messages file by:
|
|
||||||
|
|
||||||
require( $filename );
|
|
||||||
$cache = compact( self::$mLocalisationKeys );
|
|
||||||
|
|
||||||
...where self::$mLocalisationKeys is the name of variables that could be used
|
|
||||||
in the localization file. This lets you use things like:
|
|
||||||
|
|
||||||
$fallback = false;
|
|
||||||
$rtl = false;
|
|
||||||
|
|
||||||
...and easily siphon them into arrays.
|
|
||||||
|
|
||||||
Then, we load the $fallback language (if not set, English) to fill in the gaps in
|
|
||||||
the messages. There is specialized behavior for certain keys, as they can be
|
|
||||||
mergeable maps, lists or alias lists (not sure what the last one is).
|
|
||||||
|
|
||||||
== Caching ==
|
== Caching ==
|
||||||
|
|
||||||
MediaWiki has lots of caching mechanisms built in, which make the code somewhat
|
MediaWiki has lots of caching mechanisms built in, which make the code somewhat
|
||||||
|
@ -32,6 +32,6 @@ A tag's attribute 'target' (for selecting frames) cut
|
|||||||
current behavior: no substitute, just delete when in strict, allow in loose
|
current behavior: no substitute, just delete when in strict, allow in loose
|
||||||
Attribute 'name' deprecated in favor of 'id'
|
Attribute 'name' deprecated in favor of 'id'
|
||||||
current behavior: dropped silently
|
current behavior: dropped silently
|
||||||
projected behavior: create proper AttrTransform (currently not allowed at all)
|
projected behavior: create proper AttrTransform
|
||||||
[done] PRE tag allows SUB/SUP? (strict dtd comment vs syntax, loose disallows)
|
[done] PRE tag allows SUB/SUP? (strict dtd comment vs syntax, loose disallows)
|
||||||
current behavior: disallow as usual
|
current behavior: disallow as usual
|
||||||
|
@ -7,15 +7,138 @@ It's quite simple, according to <http://www.w3.org/TR/xhtml11/changes.html>
|
|||||||
2. Scratch name entirely in favor of id (partially-done)
|
2. Scratch name entirely in favor of id (partially-done)
|
||||||
3. Support Ruby <http://www.w3.org/TR/2001/REC-ruby-20010531/>
|
3. Support Ruby <http://www.w3.org/TR/2001/REC-ruby-20010531/>
|
||||||
|
|
||||||
...but that's only an informative section. More things to do:
|
...but that's only an informative section. The true power of XHTML 1.1
|
||||||
|
is its modularization, defined at:
|
||||||
1. Scratch style attribute (it's deprecated)
|
|
||||||
2. Be module-aware (this might entail intelligent grouping in the definition
|
|
||||||
and allowing users to specifically remove certain modules (see 5))
|
|
||||||
3. Cross-reference minimal content models with existing DTDs and determine
|
|
||||||
changes (todo)
|
|
||||||
4. Watch out for the Legacy Module
|
|
||||||
<http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/abstract_modules.html#s_legacymodule>
|
|
||||||
5. Let users specify their own custom modules
|
|
||||||
6. Study Modularization document
|
|
||||||
<http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/>
|
<http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/>
|
||||||
|
|
||||||
|
Modularization may very well be the next-generation implementation
|
||||||
|
of HTMLDefinition. The current, XHTML 1.0 DTD-based approach is
|
||||||
|
extremely brittle and doesn't lend well to extension by others, but
|
||||||
|
modularization fixes all that. The modules W3C defines that we
|
||||||
|
should take a look at are:
|
||||||
|
|
||||||
|
* 5.1. Attribute Collections
|
||||||
|
* 5.2. Core Modules
|
||||||
|
o 5.2.2. Text Module
|
||||||
|
o 5.2.3. Hypertext Module
|
||||||
|
o 5.2.4. List Module
|
||||||
|
* 5.4. Text Extension Modules
|
||||||
|
o 5.4.1. Presentation Module
|
||||||
|
o 5.4.2. Edit Module
|
||||||
|
o 5.4.3. Bi-directional Text Module
|
||||||
|
* 5.6. Table Modules
|
||||||
|
o 5.6.1. Basic Tables Module [?]
|
||||||
|
o 5.6.2. Tables Module
|
||||||
|
* 5.7. Image Module
|
||||||
|
* 5.8. Client-side Image Map Module [?]
|
||||||
|
* 5.9. Server-side Image Map Module [?]
|
||||||
|
* 5.12. Target Module [?]
|
||||||
|
* 5.18. Style Attribute Module
|
||||||
|
* 5.21. Name Identification Module [deprecated]
|
||||||
|
* 5.22. Legacy Module [deprecated]
|
||||||
|
|
||||||
|
We exclude these modules due to their dangerousness or inapplicability
|
||||||
|
as a XHTML fragment:
|
||||||
|
|
||||||
|
* 5.2. Core Modules
|
||||||
|
o 5.2.1. Structure Module
|
||||||
|
* 5.3. Applet Module
|
||||||
|
* 5.5. Forms Modules
|
||||||
|
o 5.5.1. Basic Forms Module
|
||||||
|
o 5.5.2. Forms Module
|
||||||
|
* 5.10. Object Module
|
||||||
|
* 5.11. Frames Module
|
||||||
|
* 5.13. Iframe Module
|
||||||
|
* 5.14. Intrinsic Events Module
|
||||||
|
* 5.15. Metainformation Module
|
||||||
|
* 5.16. Scripting Module
|
||||||
|
* 5.17. Style Sheet Module
|
||||||
|
* 5.19. Link Module
|
||||||
|
* 5.20. Base Module
|
||||||
|
|
||||||
|
Modularization also defines content sets:
|
||||||
|
|
||||||
|
* Heading
|
||||||
|
* Block
|
||||||
|
* Inline
|
||||||
|
* Flow {Heading | Block | Inline}
|
||||||
|
* List
|
||||||
|
* Form [x]
|
||||||
|
* Formctrl [x]
|
||||||
|
|
||||||
|
Which may have elements dynamically added to them as more modules get
|
||||||
|
added.
|
||||||
|
|
||||||
|
== Implementation Details ==
|
||||||
|
|
||||||
|
We will not be using the XML Schemas or DTDs directly due to the lack
|
||||||
|
of robust tools in the area.
|
||||||
|
|
||||||
|
Since we will be performing a lot of abstracting, caching would be nice. Cache
|
||||||
|
invalidation could be done by comparing the HTML and Attr config namespaces
|
||||||
|
with a copy that was packaged along with this (we have no files to mtime)
|
||||||
|
|
||||||
|
We also have the trouble of preserving the current interface, which is
|
||||||
|
quite nice in terms of speed but not so good in terms of OO-ness. This
|
||||||
|
is fine: we may need to have a two-tiered setup approach, that goes
|
||||||
|
like this:
|
||||||
|
|
||||||
|
1. When getHTMLDefinition() is initially called, we prepare the default
|
||||||
|
environment of content-sets, loaded modules, and allowed content-sets.
|
||||||
|
This, while good for developers seeking to customize the tagset, is
|
||||||
|
unusable by HTML Purifier internals. It represents the XML schema
|
||||||
|
2. When HTMLPurifier needs to use the definition, it calls a second setup
|
||||||
|
function, which now performs any substitutions needed and instantiates
|
||||||
|
all the objects which the internals will use.
|
||||||
|
|
||||||
|
In this manner, complicated observers are not necessary, you just
|
||||||
|
specify a content module like:
|
||||||
|
|
||||||
|
$flow = '%Heading | %Block | %Inline';
|
||||||
|
|
||||||
|
And the second setup will perform the substitutions magically for you.
|
||||||
|
|
||||||
|
A module will have certain properties:
|
||||||
|
|
||||||
|
- Elements
|
||||||
|
- Attributes
|
||||||
|
- Content model
|
||||||
|
- Content sets
|
||||||
|
- Content set extensions
|
||||||
|
- Content model extensions [x] (seen only on structural elements)
|
||||||
|
- Attribute collection extensions
|
||||||
|
|
||||||
|
In our case, the content model does a lot more than just define what
|
||||||
|
allowed children are: they also define exclusions. Also, if we refrain
|
||||||
|
from directly instantiating objects, we are posed with the problem of
|
||||||
|
how to signify which ChildDef to use. Remember: our specialized cases of
|
||||||
|
content models are proprietary optimizations that allow us to deal with
|
||||||
|
elements that don't belong rather than spit them out. Possible solutions:
|
||||||
|
|
||||||
|
1. Factory method that analyzes the definition and figures out who
|
||||||
|
to defer to. It would also be responsible for parsing out omissions.
|
||||||
|
2. Don't use their content model syntax, just enumerate items and give
|
||||||
|
the class-name of which one to use. If a complex definition is truly
|
||||||
|
needed, then use content model syntax. A definition, then, would
|
||||||
|
be composed of multiple parts:
|
||||||
|
- True content-model definition, OR
|
||||||
|
- Simple content-model definition
|
||||||
|
- List of items in the definition (may be multiple if dealing
|
||||||
|
with Chameleon)
|
||||||
|
- Name of the type (optional, required, etc)
|
||||||
|
|
||||||
|
Flexibility is absolutely essential, so the API of some of these
|
||||||
|
ChildDefs may need to change to lend them better to uniform treatment.
|
||||||
|
|
||||||
|
Attributes are somewhat easier to manage, because we would be using
|
||||||
|
associative arrays of elements => attributes => AttrDef names, and there
|
||||||
|
would be an Attribute Types lookup array to get the appropriate AttrDef
|
||||||
|
(if the objects are stateful, they will need to be cloned). Attributes
|
||||||
|
for just one element can be specifically overridden through some mechanism
|
||||||
|
(probably a config lookup array as well as an internal counter, because
|
||||||
|
of HTML 4.01 semantics concerns.) An attribute set will also optionally
|
||||||
|
have an array at index 0 which defines what attribute collections to
|
||||||
|
agglutinate on when parsing. This may allow us to get rid of global
|
||||||
|
attributes, which were also a proprietary implementation detail.
|
||||||
|
|
||||||
|
Alright: let's get to work!
|
Loading…
Reference in New Issue
Block a user