From c43c0660f50cb9e56d51fc778ee10b6d218f2adc Mon Sep 17 00:00:00 2001 From: "Edward Z. Yang" Date: Tue, 15 Jan 2008 03:05:43 +0000 Subject: [PATCH] Update dev-includes.txt with our evil master plan. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1506 48356398-32a2-884e-a903-53898d9a118a --- TODO | 5 +- docs/dev-includes.txt | 215 +++++++++++++++++++++++++++++++++++++++++- 2 files changed, 217 insertions(+), 3 deletions(-) diff --git a/TODO b/TODO index dd764deb..2ebad1af 100644 --- a/TODO +++ b/TODO @@ -12,7 +12,10 @@ amount of effort to implement, it may get endlessly delayed. Do not be afraid to cast your vote for the next feature to be implemented! IMPORTANT - - Ensure all configuration goes through HTMLPurifier_Config + - Put our new configuration thing into effect + - Get everything into configuration objects (filters, I'm looking at you) + - Factor demo.php into a set of Printer classes, and then create a stub + file for users here 3.1 release [Error'ed] # Error logging for filtering/cleanup procedures diff --git a/docs/dev-includes.txt b/docs/dev-includes.txt index 2c6f6c89..f5bae133 100644 --- a/docs/dev-includes.txt +++ b/docs/dev-includes.txt @@ -20,7 +20,9 @@ when there are not. Unfortunately, these two goals seem contrary to each other. A peripheral issue is the performance of ConfigSchema, which has been -shown take a large, constant amount of initialization time. +shown take a large, constant amount of initialization time, and is +intricately linked to the issue of includes due to its pervasive use +in our plugin architecture. Pros and Cons ------------- @@ -47,13 +49,222 @@ Include it all: - Classes that inherit from external libraries will cause compile errors -Build an include stub: +Build an include stub (Let's do this!): Pros: - Only necessary code is included - Plays nicely with opcode caches and standalone version - require (without once) can be used, see above + - Could further extend as a compilation to one file Cons: - Not implemented yet - Requires user intervention and use of a command line script - Standalone script must be chained to this - More complex and compiled-language-like + - Requires a whole new class of system-wide configuration directives, + as configuration objects can be reused + - Determining what needs to be included can be complex (see above) + - No way of autodetecting dynamically instantiated classes + - Might be slow + +Include stubs +------------- + +This solution may be "just right" for users who are heavily oriented +towards performance. However, there are a number of picky implementation +details to work out beforehand. + +The number one concern is how to make the HTML Purifier files "work +out of the box", while still being able to easily get them into a form +that works with this setup. As the codebase stands right now, it would +be necessary to strip out all of the require_once calls. The only way +we could get rid of the require_once calls is to use __autoload or +use the stub for all cases (which might not be a bad idea). + + Aside + ----- + An important thing to remember, however, is that these require_once's + are valuable data about what classes a file needs. Unfortunately, there's + no distinction between whether or not the file is needed all the time, + or whether or not it is one of our "optional" files. Thus, it is + effectively useless. + + Deprecated + ---------- + One of the things I'd like to do is have the code search for any classes + that are explicitly mentioned in the code. If a class isn't mentioned, I + get to assume that it is "optional," i.e. included via introspection. + The choice is either to use PHP's tokenizer or use regexps; regexps would + be faster but a tokenizer would be more correct. If this ends up being + unfeasible, adding dependency comments isn't a bad idea. (This could + even be done automatically by search/replacing require_once, although + we'd have to manually inspect the results for the optional requires.) + + NOTE: This ends up not being necessary, as we're going to make the user + figure out all the extra classes they need, and only include the core + which is predetermined. + +Using the autoload framework with include stubs works nicely with +introspective classes: instead of having to have require_once inside +the function, we can let autoload do the work; we simply need to +new $class or accept the object straight from the caller. Handling filters +becomes a simple matter of ticking off configuration directives, and +if ConfigSchema spits out errors, adding the necessary includes. We could +also use the autoload framework as a fallback, in case the user forgets +to make the include, but doesn't really care about performance. + + Insight + ------- + All of this talk is merely a natural extension of what our current + standalone functionality does. However, instead of having our code + perform the includes, or attempting to inline everything that possibly + could be used, we boot the issue to the user, making them include + everything or setup the fallback autoload handler. + +Configuration Schema +-------------------- + +A common deficiency for all of the conditional include setups (including +the dynamically built include PHP stub) is that if one of this +conditionally included files includes a configuration directive, it +is not accessible to configdoc. A stopgap solution for this problem is +to have it piggy-back off of the data in the merge-library.php script +to figure out what extra files it needs to include, but if the file also +inherits classes that don't exist, we're in big trouble. + +I think it's high time we centralized the configuration documentation. +However, the type checking has been a great boon for the library, and +I'd like to keep that. The compromise is to use some other source, and +then parse it into the ConfigSchema internal format (sans all of those +nasty documentation strings which we really don't need at runtime) and +serialize that for future use. + +The next question is that of format. XML is very verbose, and the prospect +of setting defaults in it gives me willies. However, this may be necessary. +Splitting up the file into manageable chunks may alleviate this trouble, +and we may be even want to create our own format optimized for specifying +configuration. It might look like (based off the PHPT format, which is +nicely compact yet unambiguous and human-readable): + +Core.HiddenElements +TYPE: lookup +DEFAULT: array('script', 'style') // auto-converted during processing +--ALIASES-- +Core.InvisibleElements, Core.StupidElements +--DESCRIPTION-- +

+ Blah blah +

+ +The first line is the directive name, the lines after that prior to the +first --HEADER-- block are single-line values, and then after that +the multiline values are there. No value is restricted to a particular +format: DEFAULT could very well be multiline if that would be easier. +This would make it insanely easy, also, to add arbitrary extra parameters, +like: + +VERSION: 3.0.0 +ALLOWED: 'none', 'light', 'medium', 'heavy' // this is wrapped in array() +EXTERNAL: CSSTidy // this would be documented somewhere else with a URL + +The final loss would be that you wouldn't know what file the directive +was used in; with some clever regexps it should be possible to +figure out where $config->get($ns, $d); occurs. Reflective calls to +the configuration object is mitigated by the fact that getBatch is +used, so we can simply talk about that in the namespace definition page. +This might be slow, but it would only happen when we are creating +the documentation for consumption, and is sugar. + +We can put this in a schema/ directory, outside of HTML Purifier. The serialized +data gets treated like entities.ser. + +The final thing that needs to be handled is user defined configurations. +They can be added at runtime using ConfigSchema::registerDirectory() +which globs the directory and grabs all of the directives to be incorporated +in. Then, the result is saved. We may want to take advantage of the +DefinitionCache framework, although it is not altogether certain what +configuration directives would be used to generate our key (meta-directives!) + + Further thoughts + ---------------- + Our master configuration schema will only need to be updated once + every new version, so it's easily versionable. User specified + schema files are far more volatile, but it's far too expensive + to check the filemtimes of all the files, so a DefinitionRev style + mechanism works better. However, we can uniquely identify the + schema based on the directories they loaded, so there's no need + for a DefinitionId until we give them full programmatic control. + + These variables should be directly incorporated into ConfigSchema, + and ConfigSchema should handle serialization. Some refactoring will be + necessary for the DefinitionCache classes, as they are built with + Config in mind. If the user changes something, the cache file gets + rebuilt. If the version changes, the cache file gets rebuilt. Since + our unit tests flush the caches before we start, and the operation is + pretty fast, this will not negatively impact unit testing. + +One last thing: certain configuration directives require that files +get added. They may even be specified dynamically. It is not a good idea +for the HTMLPurifier_Config object to be used directly for such matters. +Instead, the userland code should explicitly perform the includes. We may +put in something like: + +REQUIRES: HTMLPurifier_Filter_ExtractStyleBlocks + +To indicate that if that class doesn't exist, and the user is attempting +to use the directive, we should fatally error out. The stub includes the core files, +and the user includes everything else. Any reflective things like new +$class would be required to tie in with the configuration. + +It would work very well with rarely used configuration options, but it +wouldn't be so good for "core" parts that can be disabled. In such cases +the core include file would need to be modified, and the only way +to properly do this is use the configuration object. Once again, our +ability to create cache keys saves the day again: we can create arbitrary +stub files for arbitrary configurations and include those. They could +even be the single file affairs. The only thing we'd need to include, +then, would be HTMLPurifier_Config! Then, the configuration object would +load the library. + + An aside... + ----------- + One questions, however, the wisdom of letting PHP files write other PHP + files. It seems like a recipe for disaster, or at least lots of headaches + in highly secured setups, where PHP does not have the ability to write + to its root. In such cases, we could use sticky bits or tell the user + to manually generate the file. + + The other troublesome bit is actually doing the calculations necessary. + For certain cases, it's simple (such as URIScheme), but for AttrDef + and HTMLModule the dependency trees are very complex in relation to + %HTML.Allowed and friends. I think that this idea should be shelved + and looked at a later, less insane date. + +An interesting dilemma presents itself when a configuration form is offered +to the user. Normally, the configuration object is not accessible without +editing PHP code; this facility changes thing. The sensible thing to do +is stipulate that all classes required by the directives you allow must +be included. + +Unit testing +------------ + +Setting up the parsing and translation into our existing format would not +be difficult to do. It might represent a good time for us to rethink our +tests for these facilities; as creative as they are, they are often hacky +and require public visibility for things that ought to be protected. +This is especially applicable for our DefinitionCache tests. + +Migration +--------- + +Because we are not *adding* anything essentially new, it should be trivial +to write a script to take our existing data and dump it into the new format. +Well, not trivial, but fairly easy to accomplish. Primary implementation +difficulties would probably involve formatting the file nicely. + +Backwards-compatibility +----------------------- + +I expect that the ConfigSchema methods should stick around for a little bit, +but display E_USER_NOTICE warnings that they are deprecated. This will +require documentation!