mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2025-01-24 22:31:52 +00:00
12b811d749
Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com>
282 lines
12 KiB
Plaintext
282 lines
12 KiB
Plaintext
|
|
INCLUDES, AUTOLOAD, BYTECODE CACHES and OPTIMIZATION
|
|
|
|
The Problem
|
|
-----------
|
|
|
|
HTML Purifier contains a number of extra components that are not used all
|
|
of the time, only if the user explicitly specifies that we should use
|
|
them.
|
|
|
|
Some of these optional components are optionally included (Filter,
|
|
Language, Lexer, Printer), while others are included all the time
|
|
(Injector, URIFilter, HTMLModule, URIScheme). We will stipulate that these
|
|
are all developer specified: it is conceivable that certain Tokens are not
|
|
used, but this is user-dependent and should not be trusted.
|
|
|
|
We should come up with a consistent way to handle these things and ensure
|
|
that we get the maximum performance when there is bytecode caches and
|
|
when there are not. Unfortunately, these two goals seem contrary to each
|
|
other.
|
|
|
|
A peripheral issue is the performance of ConfigSchema, which has been
|
|
shown take a large, constant amount of initialization time, and is
|
|
intricately linked to the issue of includes due to its pervasive use
|
|
in our plugin architecture.
|
|
|
|
Pros and Cons
|
|
-------------
|
|
|
|
We will assume that user-based extensions will be included by them.
|
|
|
|
Conditional includes:
|
|
Pros:
|
|
- User management is simplified; only a single directive needs to be set
|
|
- Only necessary code is included
|
|
Cons:
|
|
- Doesn't play nicely with opcode caches
|
|
- Adds complexity to standalone version
|
|
- Optional configuration directives are not exposed without a little
|
|
extra coaxing (not implemented yet)
|
|
|
|
Include it all:
|
|
Pros:
|
|
- User management is still simple
|
|
- Plays nicely with opcode caches and standalone version
|
|
- All configuration directives are present
|
|
Cons:
|
|
- Lots of (how much?) extra code is included
|
|
- Classes that inherit from external libraries will cause compile
|
|
errors
|
|
|
|
Build an include stub (Let's do this!):
|
|
Pros:
|
|
- Only necessary code is included
|
|
- Plays nicely with opcode caches and standalone version
|
|
- require (without once) can be used, see above
|
|
- Could further extend as a compilation to one file
|
|
Cons:
|
|
- Not implemented yet
|
|
- Requires user intervention and use of a command line script
|
|
- Standalone script must be chained to this
|
|
- More complex and compiled-language-like
|
|
- Requires a whole new class of system-wide configuration directives,
|
|
as configuration objects can be reused
|
|
- Determining what needs to be included can be complex (see above)
|
|
- No way of autodetecting dynamically instantiated classes
|
|
- Might be slow
|
|
|
|
Include stubs
|
|
-------------
|
|
|
|
This solution may be "just right" for users who are heavily oriented
|
|
towards performance. However, there are a number of picky implementation
|
|
details to work out beforehand.
|
|
|
|
The number one concern is how to make the HTML Purifier files "work
|
|
out of the box", while still being able to easily get them into a form
|
|
that works with this setup. As the codebase stands right now, it would
|
|
be necessary to strip out all of the require_once calls. The only way
|
|
we could get rid of the require_once calls is to use __autoload or
|
|
use the stub for all cases (which might not be a bad idea).
|
|
|
|
Aside
|
|
-----
|
|
An important thing to remember, however, is that these require_once's
|
|
are valuable data about what classes a file needs. Unfortunately, there's
|
|
no distinction between whether or not the file is needed all the time,
|
|
or whether or not it is one of our "optional" files. Thus, it is
|
|
effectively useless.
|
|
|
|
Deprecated
|
|
----------
|
|
One of the things I'd like to do is have the code search for any classes
|
|
that are explicitly mentioned in the code. If a class isn't mentioned, I
|
|
get to assume that it is "optional," i.e. included via introspection.
|
|
The choice is either to use PHP's tokenizer or use regexps; regexps would
|
|
be faster but a tokenizer would be more correct. If this ends up being
|
|
unfeasible, adding dependency comments isn't a bad idea. (This could
|
|
even be done automatically by search/replacing require_once, although
|
|
we'd have to manually inspect the results for the optional requires.)
|
|
|
|
NOTE: This ends up not being necessary, as we're going to make the user
|
|
figure out all the extra classes they need, and only include the core
|
|
which is predetermined.
|
|
|
|
Using the autoload framework with include stubs works nicely with
|
|
introspective classes: instead of having to have require_once inside
|
|
the function, we can let autoload do the work; we simply need to
|
|
new $class or accept the object straight from the caller. Handling filters
|
|
becomes a simple matter of ticking off configuration directives, and
|
|
if ConfigSchema spits out errors, adding the necessary includes. We could
|
|
also use the autoload framework as a fallback, in case the user forgets
|
|
to make the include, but doesn't really care about performance.
|
|
|
|
Insight
|
|
-------
|
|
All of this talk is merely a natural extension of what our current
|
|
standalone functionality does. However, instead of having our code
|
|
perform the includes, or attempting to inline everything that possibly
|
|
could be used, we boot the issue to the user, making them include
|
|
everything or setup the fallback autoload handler.
|
|
|
|
Configuration Schema
|
|
--------------------
|
|
|
|
A common deficiency for all of the conditional include setups (including
|
|
the dynamically built include PHP stub) is that if one of this
|
|
conditionally included files includes a configuration directive, it
|
|
is not accessible to configdoc. A stopgap solution for this problem is
|
|
to have it piggy-back off of the data in the merge-library.php script
|
|
to figure out what extra files it needs to include, but if the file also
|
|
inherits classes that don't exist, we're in big trouble.
|
|
|
|
I think it's high time we centralized the configuration documentation.
|
|
However, the type checking has been a great boon for the library, and
|
|
I'd like to keep that. The compromise is to use some other source, and
|
|
then parse it into the ConfigSchema internal format (sans all of those
|
|
nasty documentation strings which we really don't need at runtime) and
|
|
serialize that for future use.
|
|
|
|
The next question is that of format. XML is very verbose, and the prospect
|
|
of setting defaults in it gives me willies. However, this may be necessary.
|
|
Splitting up the file into manageable chunks may alleviate this trouble,
|
|
and we may be even want to create our own format optimized for specifying
|
|
configuration. It might look like (based off the PHPT format, which is
|
|
nicely compact yet unambiguous and human-readable):
|
|
|
|
Core.HiddenElements
|
|
TYPE: lookup
|
|
DEFAULT: array('script', 'style') // auto-converted during processing
|
|
--ALIASES--
|
|
Core.InvisibleElements, Core.StupidElements
|
|
--DESCRIPTION--
|
|
<p>
|
|
Blah blah
|
|
</p>
|
|
|
|
The first line is the directive name, the lines after that prior to the
|
|
first --HEADER-- block are single-line values, and then after that
|
|
the multiline values are there. No value is restricted to a particular
|
|
format: DEFAULT could very well be multiline if that would be easier.
|
|
This would make it insanely easy, also, to add arbitrary extra parameters,
|
|
like:
|
|
|
|
VERSION: 3.0.0
|
|
ALLOWED: 'none', 'light', 'medium', 'heavy' // this is wrapped in array()
|
|
EXTERNAL: CSSTidy // this would be documented somewhere else with a URL
|
|
|
|
The final loss would be that you wouldn't know what file the directive
|
|
was used in; with some clever regexps it should be possible to
|
|
figure out where $config->get($ns, $d); occurs. Reflective calls to
|
|
the configuration object is mitigated by the fact that getBatch is
|
|
used, so we can simply talk about that in the namespace definition page.
|
|
This might be slow, but it would only happen when we are creating
|
|
the documentation for consumption, and is sugar.
|
|
|
|
We can put this in a schema/ directory, outside of HTML Purifier. The serialized
|
|
data gets treated like entities.ser.
|
|
|
|
The final thing that needs to be handled is user defined configurations.
|
|
They can be added at runtime using ConfigSchema::registerDirectory()
|
|
which globs the directory and grabs all of the directives to be incorporated
|
|
in. Then, the result is saved. We may want to take advantage of the
|
|
DefinitionCache framework, although it is not altogether certain what
|
|
configuration directives would be used to generate our key (meta-directives!)
|
|
|
|
Further thoughts
|
|
----------------
|
|
Our master configuration schema will only need to be updated once
|
|
every new version, so it's easily versionable. User specified
|
|
schema files are far more volatile, but it's far too expensive
|
|
to check the filemtimes of all the files, so a DefinitionRev style
|
|
mechanism works better. However, we can uniquely identify the
|
|
schema based on the directories they loaded, so there's no need
|
|
for a DefinitionId until we give them full programmatic control.
|
|
|
|
These variables should be directly incorporated into ConfigSchema,
|
|
and ConfigSchema should handle serialization. Some refactoring will be
|
|
necessary for the DefinitionCache classes, as they are built with
|
|
Config in mind. If the user changes something, the cache file gets
|
|
rebuilt. If the version changes, the cache file gets rebuilt. Since
|
|
our unit tests flush the caches before we start, and the operation is
|
|
pretty fast, this will not negatively impact unit testing.
|
|
|
|
One last thing: certain configuration directives require that files
|
|
get added. They may even be specified dynamically. It is not a good idea
|
|
for the HTMLPurifier_Config object to be used directly for such matters.
|
|
Instead, the userland code should explicitly perform the includes. We may
|
|
put in something like:
|
|
|
|
REQUIRES: HTMLPurifier_Filter_ExtractStyleBlocks
|
|
|
|
To indicate that if that class doesn't exist, and the user is attempting
|
|
to use the directive, we should fatally error out. The stub includes the core files,
|
|
and the user includes everything else. Any reflective things like new
|
|
$class would be required to tie in with the configuration.
|
|
|
|
It would work very well with rarely used configuration options, but it
|
|
wouldn't be so good for "core" parts that can be disabled. In such cases
|
|
the core include file would need to be modified, and the only way
|
|
to properly do this is use the configuration object. Once again, our
|
|
ability to create cache keys saves the day again: we can create arbitrary
|
|
stub files for arbitrary configurations and include those. They could
|
|
even be the single file affairs. The only thing we'd need to include,
|
|
then, would be HTMLPurifier_Config! Then, the configuration object would
|
|
load the library.
|
|
|
|
An aside...
|
|
-----------
|
|
One questions, however, the wisdom of letting PHP files write other PHP
|
|
files. It seems like a recipe for disaster, or at least lots of headaches
|
|
in highly secured setups, where PHP does not have the ability to write
|
|
to its root. In such cases, we could use sticky bits or tell the user
|
|
to manually generate the file.
|
|
|
|
The other troublesome bit is actually doing the calculations necessary.
|
|
For certain cases, it's simple (such as URIScheme), but for AttrDef
|
|
and HTMLModule the dependency trees are very complex in relation to
|
|
%HTML.Allowed and friends. I think that this idea should be shelved
|
|
and looked at a later, less insane date.
|
|
|
|
An interesting dilemma presents itself when a configuration form is offered
|
|
to the user. Normally, the configuration object is not accessible without
|
|
editing PHP code; this facility changes thing. The sensible thing to do
|
|
is stipulate that all classes required by the directives you allow must
|
|
be included.
|
|
|
|
Unit testing
|
|
------------
|
|
|
|
Setting up the parsing and translation into our existing format would not
|
|
be difficult to do. It might represent a good time for us to rethink our
|
|
tests for these facilities; as creative as they are, they are often hacky
|
|
and require public visibility for things that ought to be protected.
|
|
This is especially applicable for our DefinitionCache tests.
|
|
|
|
Migration
|
|
---------
|
|
|
|
Because we are not *adding* anything essentially new, it should be trivial
|
|
to write a script to take our existing data and dump it into the new format.
|
|
Well, not trivial, but fairly easy to accomplish. Primary implementation
|
|
difficulties would probably involve formatting the file nicely.
|
|
|
|
Backwards-compatibility
|
|
-----------------------
|
|
|
|
I expect that the ConfigSchema methods should stick around for a little bit,
|
|
but display E_USER_NOTICE warnings that they are deprecated. This will
|
|
require documentation!
|
|
|
|
New stuff
|
|
---------
|
|
|
|
VERSION: Version number directive was introduced
|
|
DEPRECATED-VERSION: If the directive was deprecated, when was it deprecated?
|
|
DEPRECATED-USE: If the directive was deprecated, what should the user use now?
|
|
REQUIRES: What classes does this configuration directive require, but are
|
|
not part of the HTML Purifier core?
|
|
|
|
vim: et sw=4 sts=4
|