htmlpurifier/library/HTMLPurifier/URIParser.php

<?php

/**
 * Parses a URI into the components and fragment identifier as specified
 * by RFC 3986.
 */
class HTMLPurifier_URIParser
{

    /**
     * Instance of HTMLPurifier_PercentEncoder to do normalization with.
     */
    protected $percentEncoder;

    public function __construct()
    {
        $this->percentEncoder = new HTMLPurifier_PercentEncoder();
    }

    /**
     * Parses a URI.
     * @param $uri string URI to parse
     * @return HTMLPurifier_URI representation of URI. This representation has
     *         not been validated yet and may not conform to RFC.
     */
    public function parse($uri)
    {
        $uri = $this->percentEncoder->normalize($uri);

        // Regexp is as per Appendix B.
        // Note that ["<>] are an addition to the RFC's recommended
        // characters, because they represent external delimeters.
        $r_URI = '!'.
            '(([a-zA-Z0-9\.\+\-]+):)?'. // 2. Scheme
            '(//([^/?#"<>]*))?'. // 4. Authority
            '([^?#"<>]*)'.       // 5. Path
            '(\?([^#"<>]*))?'.   // 7. Query
            '(#([^"<>]*))?'.     // 8. Fragment
            '!';

        $matches = array();
        $result = preg_match($r_URI, $uri, $matches);

        if (!$result) return false; // *really* invalid URI

        // seperate out parts
        $scheme     = !empty($matches[1]) ? $matches[2] : null;
        $authority  = !empty($matches[3]) ? $matches[4] : null;
        $path       = $matches[5]; // always present, can be empty
        $query      = !empty($matches[6]) ? $matches[7] : null;
        $fragment   = !empty($matches[8]) ? $matches[9] : null;

        // further parse authority
        if ($authority !== null) {
            $r_authority = "/^((.+?)@)?(\[[^\]]+\]|[^:]*)(:(\d*))?/";
            $matches = array();
            preg_match($r_authority, $authority, $matches);
            $userinfo   = !empty($matches[1]) ? $matches[2] : null;
            $host       = !empty($matches[3]) ? $matches[3] : '';
            $port       = !empty($matches[4]) ? (int) $matches[5] : null;
        } else {
            $port = $host = $userinfo = null;
        }

        return new HTMLPurifier_URI(
            $scheme, $userinfo, $host, $port, $path, $query, $fragment);
    }

}

// vim: et sw=4 sts=4
[2.1.0] Refactor AttrDef_URI: removed URIParser functionality - Genericized flush-definition-cache script git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1333 48356398-32a2-884e-a903-53898d9a118a 2007-08-01 14:55:09 +00:00			`<?php`

			`/**`
			`* Parses a URI into the components and fragment identifier as specified`
[3.1.0] Revamp URI handling of percent encoding and validation. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1709 48356398-32a2-884e-a903-53898d9a118a 2008-05-14 02:19:00 +00:00			`* by RFC 3986.`
[2.1.0] Refactor AttrDef_URI: removed URIParser functionality - Genericized flush-definition-cache script git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1333 48356398-32a2-884e-a903-53898d9a118a 2007-08-01 14:55:09 +00:00			`*/`
			`class HTMLPurifier_URIParser`
			`{`
Remove trailing whitespace. Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com> 2008-12-06 07:28:20 +00:00
[2.1.0] Refactor AttrDef_URI: removed URIParser functionality - Genericized flush-definition-cache script git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1333 48356398-32a2-884e-a903-53898d9a118a 2007-08-01 14:55:09 +00:00			`/**`
[3.1.0] Revamp URI handling of percent encoding and validation. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1709 48356398-32a2-884e-a903-53898d9a118a 2008-05-14 02:19:00 +00:00			`* Instance of HTMLPurifier_PercentEncoder to do normalization with.`
			`*/`
			`protected $percentEncoder;`
Remove trailing whitespace. Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com> 2008-12-06 07:28:20 +00:00
PSR-2 reformatting PHPDoc corrections With minor corrections. Signed-off-by: Marcus Bointon <marcus@synchromedia.co.uk> Signed-off-by: Edward Z. Yang <ezyang@mit.edu> 2013-07-16 11:56:14 +00:00			`public function __construct()`
			`{`
[3.1.0] Revamp URI handling of percent encoding and validation. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1709 48356398-32a2-884e-a903-53898d9a118a 2008-05-14 02:19:00 +00:00			`$this->percentEncoder = new HTMLPurifier_PercentEncoder();`
			`}`
Remove trailing whitespace. Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com> 2008-12-06 07:28:20 +00:00
[3.1.0] Revamp URI handling of percent encoding and validation. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1709 48356398-32a2-884e-a903-53898d9a118a 2008-05-14 02:19:00 +00:00			`/**`
			`* Parses a URI.`
[2.1.0] Refactor AttrDef_URI: removed URIParser functionality - Genericized flush-definition-cache script git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1333 48356398-32a2-884e-a903-53898d9a118a 2007-08-01 14:55:09 +00:00			`* @param $uri string URI to parse`
[3.1.0] Revamp URI handling of percent encoding and validation. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1709 48356398-32a2-884e-a903-53898d9a118a 2008-05-14 02:19:00 +00:00			`* @return HTMLPurifier_URI representation of URI. This representation has`
			`* not been validated yet and may not conform to RFC.`
[2.1.0] Refactor AttrDef_URI: removed URIParser functionality - Genericized flush-definition-cache script git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1333 48356398-32a2-884e-a903-53898d9a118a 2007-08-01 14:55:09 +00:00			`*/`
PSR-2 reformatting PHPDoc corrections With minor corrections. Signed-off-by: Marcus Bointon <marcus@synchromedia.co.uk> Signed-off-by: Edward Z. Yang <ezyang@mit.edu> 2013-07-16 11:56:14 +00:00			`public function parse($uri)`
			`{`
[3.1.0] Revamp URI handling of percent encoding and validation. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1709 48356398-32a2-884e-a903-53898d9a118a 2008-05-14 02:19:00 +00:00			`$uri = $this->percentEncoder->normalize($uri);`
Remove trailing whitespace. Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com> 2008-12-06 07:28:20 +00:00
[3.1.0] Revamp URI handling of percent encoding and validation. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1709 48356398-32a2-884e-a903-53898d9a118a 2008-05-14 02:19:00 +00:00			`// Regexp is as per Appendix B.`
Remove trailing whitespace. Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com> 2008-12-06 07:28:20 +00:00			`// Note that ["<>] are an addition to the RFC's recommended`
[3.1.0] Revamp URI handling of percent encoding and validation. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1709 48356398-32a2-884e-a903-53898d9a118a 2008-05-14 02:19:00 +00:00			`// characters, because they represent external delimeters.`
[2.1.0] Refactor AttrDef_URI: removed URIParser functionality - Genericized flush-definition-cache script git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1333 48356398-32a2-884e-a903-53898d9a118a 2007-08-01 14:55:09 +00:00			`$r_URI = '!'.`
Make URI parsing algorithm more strict. Thanks Michael Gusev <mgusev@sugarcrm.com> for contributing this patch. Signed-off-by: Edward Z. Yang <ezyang@mit.edu> 2013-04-16 20:46:00 +00:00			`'(([a-zA-Z0-9\.\+\-]+):)?'. // 2. Scheme`
[3.1.0] Revamp URI handling of percent encoding and validation. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1709 48356398-32a2-884e-a903-53898d9a118a 2008-05-14 02:19:00 +00:00			`'(//([^/?#"<>]*))?'. // 4. Authority`
			`'([^?#"<>]*)'. // 5. Path`
			`'(\?([^#"<>]*))?'. // 7. Query`
			`'(#([^"<>]*))?'. // 8. Fragment`
[2.1.0] Refactor AttrDef_URI: removed URIParser functionality - Genericized flush-definition-cache script git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1333 48356398-32a2-884e-a903-53898d9a118a 2007-08-01 14:55:09 +00:00			`'!';`
Remove trailing whitespace. Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com> 2008-12-06 07:28:20 +00:00
[2.1.0] Refactor AttrDef_URI: removed URIParser functionality - Genericized flush-definition-cache script git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1333 48356398-32a2-884e-a903-53898d9a118a 2007-08-01 14:55:09 +00:00			`$matches = array();`
			`$result = preg_match($r_URI, $uri, $matches);`
Remove trailing whitespace. Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com> 2008-12-06 07:28:20 +00:00
[2.1.0] Refactor AttrDef_URI: removed URIParser functionality - Genericized flush-definition-cache script git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1333 48356398-32a2-884e-a903-53898d9a118a 2007-08-01 14:55:09 +00:00			`if (!$result) return false; // really invalid URI`
Remove trailing whitespace. Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com> 2008-12-06 07:28:20 +00:00
[2.1.0] Refactor AttrDef_URI: removed URIParser functionality - Genericized flush-definition-cache script git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1333 48356398-32a2-884e-a903-53898d9a118a 2007-08-01 14:55:09 +00:00			`// seperate out parts`
			`$scheme = !empty($matches[1]) ? $matches[2] : null;`
			`$authority = !empty($matches[3]) ? $matches[4] : null;`
			`$path = $matches[5]; // always present, can be empty`
			`$query = !empty($matches[6]) ? $matches[7] : null;`
			`$fragment = !empty($matches[8]) ? $matches[9] : null;`
Remove trailing whitespace. Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com> 2008-12-06 07:28:20 +00:00
[2.1.0] Refactor AttrDef_URI: removed URIParser functionality - Genericized flush-definition-cache script git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1333 48356398-32a2-884e-a903-53898d9a118a 2007-08-01 14:55:09 +00:00			`// further parse authority`
			`if ($authority !== null) {`
[3.1.0] Revamp URI handling of percent encoding and validation. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1709 48356398-32a2-884e-a903-53898d9a118a 2008-05-14 02:19:00 +00:00			`$r_authority = "/^((.+?)@)?(\[[^\]]+\]\|[^:])(:(\d))?/";`
[2.1.0] Refactor AttrDef_URI: removed URIParser functionality - Genericized flush-definition-cache script git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1333 48356398-32a2-884e-a903-53898d9a118a 2007-08-01 14:55:09 +00:00			`$matches = array();`
			`preg_match($r_authority, $authority, $matches);`
			`$userinfo = !empty($matches[1]) ? $matches[2] : null;`
			`$host = !empty($matches[3]) ? $matches[3] : '';`
			`$port = !empty($matches[4]) ? (int) $matches[5] : null;`
			`} else {`
			`$port = $host = $userinfo = null;`
			`}`
Remove trailing whitespace. Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com> 2008-12-06 07:28:20 +00:00
[2.1.0] Create new URI object and migrate URI validation systems to use it. URIScheme interface changed. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1334 48356398-32a2-884e-a903-53898d9a118a 2007-08-01 18:34:46 +00:00			`return new HTMLPurifier_URI(`
			`$scheme, $userinfo, $host, $port, $path, $query, $fragment);`
[2.1.0] Refactor AttrDef_URI: removed URIParser functionality - Genericized flush-definition-cache script git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1333 48356398-32a2-884e-a903-53898d9a118a 2007-08-01 14:55:09 +00:00			`}`
Remove trailing whitespace. Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com> 2008-12-06 07:28:20 +00:00
[2.1.0] Refactor AttrDef_URI: removed URIParser functionality - Genericized flush-definition-cache script git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/trunk@1333 48356398-32a2-884e-a903-53898d9a118a 2007-08-01 14:55:09 +00:00			`}`

Add vim modelines to all files. Signed-off-by: Edward Z. Yang <edwardzyang@thewritingpot.com> 2008-12-06 09:24:59 +00:00			`// vim: et sw=4 sts=4`