htmlpurifier/docs/proposal-plists.txt

THE UNIVERSAL DESIGN PATTERN: PROPERTIES
Steve Yegge

Implementation:
    get(name)
    put(name, value)
    has(name)
    remove(name)
    iteration, with filtering [this will be our namespaces]
    parent

Representations:
    - Keys are strings
    - It's nice to not need to quote keys (if we formulate our own language,
      consider this)
    - Property not present representation (key missing)
    - Frequent removal/re-add may have null help. If null is valid, use
      another value. (PHP semantics are weird here)

Data structures:
    - LinkedHashMap is wonderful (O(1) access and maintains order)
    - Using a special property that points to the parent is usual
    - Multiple inheritance possible, need rules for which to lookup first
    - Iterative inheritance is best
    - Consider performance!

Deletion
    - Tricky problem with inheritance
    - Distinguish between "not found" and "look in my parent for the property"
    [Maybe HTML Purifier won't allow deletion]

Read/write asymmetry (it's correct!)

Read-only plists
    - Allow ability to freeze [this is what we have already]
    - Don't overuse it

Performance:
    - Intern strings (PHP does this already)
    - Don't be case-insensitive
    - If all properties in a plist are known a-priori, you can use a "perfect"
      hash function. Often overkill.
    - Copy-on-read caching "plundering" reduces lookup, but uses memory and can
      grow stale. Use as last resort.
    - Refactoring to fields. Watch for API compatibility, system complexity,
      and lack of flexibility.
    - Refrigerator: external data-structure to hold plists

Transient properties:
    [Don't need to worry about this]
    - Use a separate plist for transient properties
    - Non-numeric override; numeric should ADD
    - Deletion: removeTransientProperty() and transientlyRemoveProperty()

Persistence:
    - XML/JSON are good
    - Text-based is good for readability, maintainability and bootstrapping
    - Compressed binary format for network transport [not necessary]
    - RDBMS or XML database

Querying: [not relevant]
    - XML database is nice for XPath/XQuery
    - jQuery for JSON
    - Just load it all into a program

Backfills/Data integrity:
    - Use usual methods
    - Lazy backfill is a nice hack

Type systems:
    - Flags: ReadOnly, Permanent, DontEnum
    - Typed properties isn't that useful [It's also Not-PHP]
    - Seperate meta-list of directive properties IS useful
    - Duck typing is useful for systems designed fully around properties pattern

Trade-off:
    + Flexibility
    + Extensibility
    + Unit-testing/prototype-speed
    - Performance
    - Data integrity
    - Navagability/Query-ability
    - Reversability (hard to go back)

HTML Purifier

We are not happy with our current system of defining configuration directives,
because it has become clear that things will get a lot nicer if we allow
multiple namespaces, and there are some features that naturally lend themselves
to inheritance, which we do not really support well.

One of the considered implementation changes would be to go from a structure
like:

array(
    'Namespace' => array(
        'Directive' => 'val1',
        'Directive2' => 'val2',
    )
)

to:

array(
    'Namespace.Directive' => 'val1',
    'Namespace.Directive2' => 'val2',
)

The below implementation takes more memory, however, and it makes it a bit
complicated to grab all values from a namespace.

The alternate implementation choice is to allow nested plists. This keeps
iteration easy, but is problematic for inheritance (it would be difficult
to distinguish a plist from an array) and retrieval (when specifying multiple
namespaces we would need some multiple de-referencing).

----

We can bite the performance hit, and just do iteration with filter
(the strncmp call should be relatively cheap). Then, users should be able
to optimize doing something like:

$config = HTMLPurifier_Config::createDefault();
if (!file_exists('config.php')) {
    // set up $config
    $config->save('config.php');
} else {
    $config->load('config.php');
}

Or maybe memcache, or something. This means that "// set up $config" must
not have any dynamic parts, or the user has to invalidate the cache when
they do update it. We have to think about this a little more carefully; the
file call might be more expensive.

----

This might get expensive, however, when we actually care about iterating
over the configuration and want the actual values. So what about nesting the
lists?

"ns.sub.directive" => values['ns']['sub']['directive']

We can distinguish between plists and arrays by using ArrayObjects for the
plists, and regular arrays for the arrays? Alternatively, use ArrayObjects
for the arrays, and regular arrays for the plists.

----

Implementation demands, and what has caused them:

1. DefinitionCache, the HTML, CSS and URI namespaces have caches attached to them
   Results:
    - getBatchSerial()
        - getBatch() : in general, the ability to traverse just a namespace

2. AutoFormat/Filter, this is a plugin architecture, directives not hard-coded
    - getBatch()

3. Configuration form
    - Namespaces used to organize directives

Other than that, we have a pure plist. PERHAPS we should maintain separate things
for these different demands.

Issue 2: Directives for configuring the plugins are regular plists, but
when enabling them, while it's "plist-ish", what you're really doing is adding
them to an array of "autoformatters"/"filters" to enable. We can setup
magic BC as well as in the new interface, but there should also be an
add('AutoFormat', 'AutoParagraph'); which does the right thing.

One thing to consider is whether or not inheritance rules will apply to these.
I'd say yes. That means that they're still plisty, in fact, the underlying
implementation will probably be a plist. However, they will get their OWN
plists, and will NOT support nesting.

Issue 1: Our current implementation is generally not efficient; md5(serialize($foo))
is pretty expensive. So, I don't think there will be any problems if it
gets "less" efficient, as long as we give users a properly fast alternative;
DefinitionRev gives us a way to do this, by simply telling the user they must
update it whenever they update Configuration directives as well. (There are
obvious BC concerns here).

In such a case, we simply iterate over our plist (performing full retrievals
for each value), grab the entries we care about, and then serialize and hash.
It's going to be slow either way, due to the ability of plists to inherit.
If we ksort(), we don't have to traverse the entire array, however, the
cost of a ksort() call may not be worth it.

At this point, last time, I started worrying about the performance implications
of allowing inheritance, and wondering whether or not I wanted to squash
the plist. At first blush, our code might be under the assumption that
accessing properties is cheap; but actually we prefer to copy out the value
into a member variable if it's going to be used many times. With this is mind
I don't think CPU consumption from a few nested function calls is going to
be a problem. We *are* going to enforce a function only interface.

The next issue at hand is how we're going to manage the "special" plists,
which should still be able to be inherited. Basically, it means that multiple
plists would be attached to the configuration object, which is not the
best for memory performance. The alternative is to keep them all in one
big plist, and then eat the one-time cost of traversing the entire plist
to grab the appropriate values.

I think at this point we can write the generic interface, and then set up separate
plists if that ends up being necessary for performance (it probably won't.) Now
lets code our generic plist implementation.

----

Iterating over the plist presents some problems. The way we've chosen to solve
this is to squash all of the parents.

----

But I don't need iteration.

    vim: et sw=4 sts=4
Revamp configuration backend. Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com> 2009-02-07 07:53:20 +00:00			`THE UNIVERSAL DESIGN PATTERN: PROPERTIES`
			`Steve Yegge`

			`Implementation:`
			`get(name)`
			`put(name, value)`
			`has(name)`
			`remove(name)`
			`iteration, with filtering [this will be our namespaces]`
			`parent`

			`Representations:`
			`- Keys are strings`
			`- It's nice to not need to quote keys (if we formulate our own language,`
			`consider this)`
			`- Property not present representation (key missing)`
			`- Frequent removal/re-add may have null help. If null is valid, use`
			`another value. (PHP semantics are weird here)`

			`Data structures:`
			`- LinkedHashMap is wonderful (O(1) access and maintains order)`
			`- Using a special property that points to the parent is usual`
			`- Multiple inheritance possible, need rules for which to lookup first`
			`- Iterative inheritance is best`
			`- Consider performance!`

			`Deletion`
			`- Tricky problem with inheritance`
			`- Distinguish between "not found" and "look in my parent for the property"`
			`[Maybe HTML Purifier won't allow deletion]`

			`Read/write asymmetry (it's correct!)`

			`Read-only plists`
			`- Allow ability to freeze [this is what we have already]`
			`- Don't overuse it`

			`Performance:`
			`- Intern strings (PHP does this already)`
			`- Don't be case-insensitive`
			`- If all properties in a plist are known a-priori, you can use a "perfect"`
			`hash function. Often overkill.`
			`- Copy-on-read caching "plundering" reduces lookup, but uses memory and can`
			`grow stale. Use as last resort.`
			`- Refactoring to fields. Watch for API compatibility, system complexity,`
			`and lack of flexibility.`
			`- Refrigerator: external data-structure to hold plists`

			`Transient properties:`
			`[Don't need to worry about this]`
			`- Use a separate plist for transient properties`
			`- Non-numeric override; numeric should ADD`
			`- Deletion: removeTransientProperty() and transientlyRemoveProperty()`

			`Persistence:`
			`- XML/JSON are good`
			`- Text-based is good for readability, maintainability and bootstrapping`
			`- Compressed binary format for network transport [not necessary]`
			`- RDBMS or XML database`

			`Querying: [not relevant]`
			`- XML database is nice for XPath/XQuery`
			`- jQuery for JSON`
			`- Just load it all into a program`

			`Backfills/Data integrity:`
			`- Use usual methods`
			`- Lazy backfill is a nice hack`

			`Type systems:`
			`- Flags: ReadOnly, Permanent, DontEnum`
			`- Typed properties isn't that useful [It's also Not-PHP]`
			`- Seperate meta-list of directive properties IS useful`
			`- Duck typing is useful for systems designed fully around properties pattern`

			`Trade-off:`
			`+ Flexibility`
			`+ Extensibility`
			`+ Unit-testing/prototype-speed`
			`- Performance`
			`- Data integrity`
			`- Navagability/Query-ability`
			`- Reversability (hard to go back)`

			`HTML Purifier`

			`We are not happy with our current system of defining configuration directives,`
			`because it has become clear that things will get a lot nicer if we allow`
			`multiple namespaces, and there are some features that naturally lend themselves`
			`to inheritance, which we do not really support well.`

			`One of the considered implementation changes would be to go from a structure`
			`like:`

			`array(`
			`'Namespace' => array(`
			`'Directive' => 'val1',`
			`'Directive2' => 'val2',`
			`)`
			`)`

			`to:`

			`array(`
			`'Namespace.Directive' => 'val1',`
			`'Namespace.Directive2' => 'val2',`
			`)`

			`The below implementation takes more memory, however, and it makes it a bit`
			`complicated to grab all values from a namespace.`

			`The alternate implementation choice is to allow nested plists. This keeps`
			`iteration easy, but is problematic for inheritance (it would be difficult`
			`to distinguish a plist from an array) and retrieval (when specifying multiple`
			`namespaces we would need some multiple de-referencing).`

			`----`

			`We can bite the performance hit, and just do iteration with filter`
			`(the strncmp call should be relatively cheap). Then, users should be able`
			`to optimize doing something like:`

			`$config = HTMLPurifier_Config::createDefault();`
			`if (!file_exists('config.php')) {`
			`// set up $config`
			`$config->save('config.php');`
			`} else {`
			`$config->load('config.php');`
			`}`

			`Or maybe memcache, or something. This means that "// set up $config" must`
			`not have any dynamic parts, or the user has to invalidate the cache when`
			`they do update it. We have to think about this a little more carefully; the`
			`file call might be more expensive.`

			`----`

			`This might get expensive, however, when we actually care about iterating`
			`over the configuration and want the actual values. So what about nesting the`
			`lists?`

			`"ns.sub.directive" => values['ns']['sub']['directive']`

			`We can distinguish between plists and arrays by using ArrayObjects for the`
			`plists, and regular arrays for the arrays? Alternatively, use ArrayObjects`
			`for the arrays, and regular arrays for the plists.`

			`----`

			`Implementation demands, and what has caused them:`

			`1. DefinitionCache, the HTML, CSS and URI namespaces have caches attached to them`
			`Results:`
			`- getBatchSerial()`
			`- getBatch() : in general, the ability to traverse just a namespace`

			`2. AutoFormat/Filter, this is a plugin architecture, directives not hard-coded`
			`- getBatch()`

			`3. Configuration form`
			`- Namespaces used to organize directives`

			`Other than that, we have a pure plist. PERHAPS we should maintain separate things`
			`for these different demands.`

			`Issue 2: Directives for configuring the plugins are regular plists, but`
			`when enabling them, while it's "plist-ish", what you're really doing is adding`
			`them to an array of "autoformatters"/"filters" to enable. We can setup`
			`magic BC as well as in the new interface, but there should also be an`
			`add('AutoFormat', 'AutoParagraph'); which does the right thing.`

			`One thing to consider is whether or not inheritance rules will apply to these.`
			`I'd say yes. That means that they're still plisty, in fact, the underlying`
			`implementation will probably be a plist. However, they will get their OWN`
			`plists, and will NOT support nesting.`

			`Issue 1: Our current implementation is generally not efficient; md5(serialize($foo))`
			`is pretty expensive. So, I don't think there will be any problems if it`
			`gets "less" efficient, as long as we give users a properly fast alternative;`
			`DefinitionRev gives us a way to do this, by simply telling the user they must`
			`update it whenever they update Configuration directives as well. (There are`
			`obvious BC concerns here).`

			`In such a case, we simply iterate over our plist (performing full retrievals`
			`for each value), grab the entries we care about, and then serialize and hash.`
			`It's going to be slow either way, due to the ability of plists to inherit.`
			`If we ksort(), we don't have to traverse the entire array, however, the`
			`cost of a ksort() call may not be worth it.`

			`At this point, last time, I started worrying about the performance implications`
			`of allowing inheritance, and wondering whether or not I wanted to squash`
			`the plist. At first blush, our code might be under the assumption that`
			`accessing properties is cheap; but actually we prefer to copy out the value`
			`into a member variable if it's going to be used many times. With this is mind`
			`I don't think CPU consumption from a few nested function calls is going to`
			`be a problem. We are going to enforce a function only interface.`

			`The next issue at hand is how we're going to manage the "special" plists,`
			`which should still be able to be inherited. Basically, it means that multiple`
			`plists would be attached to the configuration object, which is not the`
			`best for memory performance. The alternative is to keep them all in one`
			`big plist, and then eat the one-time cost of traversing the entire plist`
			`to grab the appropriate values.`

			`I think at this point we can write the generic interface, and then set up separate`
			`plists if that ends up being necessary for performance (it probably won't.) Now`
			`lets code our generic plist implementation.`

			`----`

			`Iterating over the plist presents some problems. The way we've chosen to solve`
			`this is to squash all of the parents.`

			`----`

			`But I don't need iteration.`

Style refresh: add/remove vimlines, fix minor factual errors. Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com> 2009-04-09 16:47:10 +00:00			`vim: et sw=4 sts=4`