+ This is a quick and dirty document to get you on your way to writing
+ custom URI filters for your own URL filtering needs. Why would you
+ want to write a URI filter? If you need URIs your users put into
+ HTML to magically change into a different URI, this is
+ exactly what you need!
+
+
+
Creating the class
+
+
+ Any URI filter you make will be a subclass of HTMLPurifier_URIFilter.
+ The scaffolding is thus:
+
+
+
class HTMLPurifier_URIFilter_NameOfFilter extends HTMLPurifier_URIFilter
+{
+ var $name = 'NameOfFilter';
+ function prepare($config) {}
+ function filter(&$uri, $config, &$context) {}
+}
+
+
+ Fill in the variable $name with the name of your filter, and
+ take a look at the two methods. prepare() is an initialization
+ method that is called only once, before any filtering has been done of the
+ HTML. Use it to perform any costly setup work that only needs to be done
+ once. filter() is the guts and innards of our filter:
+ it takes the URI and does whatever needs to be done to it.
+
+
+
+ If you've worked with HTML Purifier, you'll recognize the $config
+ and $context parameters. On the other hand, $uri
+ is something unique to this section of the application: it's a
+ HTMLPurifier_URI object. The interface is thus:
+
+
+
class HTMLPurifier_URI
+{
+ var $scheme, $userinfo, $host, $port, $path, $query, $fragment;
+ function HTMLPurifier_URI($scheme, $userinfo, $host, $port, $path, $query, $fragment);
+ function toString();
+ function copy();
+ function getSchemeObj($config, &$context);
+ function validate($config, &$context);
+}
+
+
+ The first three methods are fairly self-explanatory: you have a constructor,
+ a serializer, and a cloner. Generally, you won't be using them when
+ you are manipulating the URI objects themselves.
+ getSchemeObj() is a special purpose method that returns
+ a HTMLPurifier_URIScheme object corresponding to the specific
+ URI at hand. validate() performs general-purpose validation
+ on the internal components of a URI. Once again, you don't need to
+ worry about these: they've already been handled for you.
+
+
+
URI format
+
+
+ As a URIFilter, we're interested in the member variables of the URI object.
+
+
+
+
Scheme
The protocol for identifying (and possibly locating) a resource (http, ftp, https)
+
Userinfo
User information such as a username (bob)
+
Host
Domain name or IP address of the server (example.com, 127.0.0.1)
+
Port
Network port number for the server (80, 12345)
+
Path
Data that identifies the resource, possibly hierarchical (/path/to, ed@example.com)
+
Query
String of information to be interpreted by the resource (?q=search-term)
+
Fragment
Additional information for the resource after retrieval (#bookmark)
+
+
+
+ Because the URI is presented to us in this form, and not
+ http://bob@example.com:8080/foo.php?q=string#hash, it saves us
+ a lot of trouble in having to parse the URI every time we want to filter
+ it. For the record, the above URI has the following components:
+
+
+
+
Scheme
http
+
Userinfo
bob
+
Host
example.com
+
Port
8080
+
Path
/foo.php
+
Query
q=string
+
Fragment
hash
+
+
+
+ Note that there is no question mark or octothorpe in the query or
+ fragment: these get removed during parsing.
+
+
+
+ With this information, you can get straight to implementing your
+ filter() method. But one more thing...
+
+
+
Return value: Boolean, not URI
+
+
+ You may have noticed that the URI is being passed in by reference.
+ This means that whatever changes you make to it, those changes will
+ be reflected in the URI object the callee had. Do not
+ return the URI object: it is unnecessary and will cause bugs.
+ Instead, return a boolean value, true if the filtering was successful,
+ or false if the URI is beyond repair and needs to be axed.
+
+
+
+ Let's suppose I wanted to write a filter that de-internationalized domain
+ names by converting them to Punycode.
+ Assuming that punycode_encode($input) converts $input to
+ Punycode and returns false on failure:
+
+
+
class HTMLPurifier_URIFilter_ConvertIDNToPunycode extends HTMLPurifier_URIFilter
+{
+ var $name = 'ConvertIDNToPunycode';
+ function filter(&$uri, $config, &$context) {
+ if (is_null($uri->host)) return true;
+ if ($uri->host == utf8_decode($uri->host) {
+ // is ASCII, abort
+ return true;
+ }
+ $host = punycode_encode($uri->host);
+ if ($host === false) return false;
+ $uri->host = $host;
+ return true;
+ }
+}
+
+
+ Notice I did not return $uri;.
+
+
+
Activating your filter
+
+
+ Having a filter is all well and good, but you need to tell HTML Purifier
+ to use it. Fortunately, this part's simple:
+
+ If you want to be really fancy, you can define a configuration directive
+ for your filter and have HTML Purifier automatically manage whether or
+ not your filter gets loaded or not (this is how internal filters manage
+ things):
+
+
+
diff --git a/docs/index.html b/docs/index.html
index 437a8bfd..8d295dda 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -40,6 +40,9 @@ information for casual developers using HTML Purifier.