mirror of
https://github.com/ezyang/htmlpurifier.git
synced 2025-01-06 22:41:54 +00:00
Extra cleanup on cleanUTF8.
Signed-off-by: Edward Z. Yang <ezyang@cs.stanford.edu>
This commit is contained in:
parent
9195cb7a2e
commit
4047a6230b
3
NEWS
3
NEWS
@ -19,6 +19,9 @@ NEWS ( CHANGELOG and HISTORY ) HTMLPurifier
|
|||||||
- Deleted some asserts to avoid linters from choking (#97)
|
- Deleted some asserts to avoid linters from choking (#97)
|
||||||
- Rework Serializer cache behavior to avoid chmod'ing if possible (#32)
|
- Rework Serializer cache behavior to avoid chmod'ing if possible (#32)
|
||||||
- Embedded semicolons in strings in CSS are now handled correctly!
|
- Embedded semicolons in strings in CSS are now handled correctly!
|
||||||
|
- We accidentally dropped certain Unicode characters if there was
|
||||||
|
one or more invalid characters. This has been fixed, thanks
|
||||||
|
to mpyw <ryosuke_i_628@yahoo.co.jp>
|
||||||
# By default, when a link has a target attribute associated
|
# By default, when a link has a target attribute associated
|
||||||
with it, we now also add rel="noopener" in order to
|
with it, we now also add rel="noopener" in order to
|
||||||
prevent the new window from being able to overwrite
|
prevent the new window from being able to overwrite
|
||||||
|
@ -101,6 +101,14 @@ class HTMLPurifier_Encoder
|
|||||||
* It will parse according to UTF-8 and return a valid UTF8 string, with
|
* It will parse according to UTF-8 and return a valid UTF8 string, with
|
||||||
* non-SGML codepoints excluded.
|
* non-SGML codepoints excluded.
|
||||||
*
|
*
|
||||||
|
* Specifically, it will permit:
|
||||||
|
* \x{9}\x{A}\x{D}\x{20}-\x{7E}\x{A0}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}
|
||||||
|
* Source: https://www.w3.org/TR/REC-xml/#NT-Char
|
||||||
|
* Arguably this function should be modernized to the HTML5 set
|
||||||
|
* of allowed characters:
|
||||||
|
* https://www.w3.org/TR/html5/syntax.html#preprocessing-the-input-stream
|
||||||
|
* which simultaneously expand and restrict the set of allowed characters.
|
||||||
|
*
|
||||||
* @param string $str The string to clean
|
* @param string $str The string to clean
|
||||||
* @param bool $force_php
|
* @param bool $force_php
|
||||||
* @return string
|
* @return string
|
||||||
@ -122,15 +130,12 @@ class HTMLPurifier_Encoder
|
|||||||
* function that needs to be able to understand UTF-8 characters.
|
* function that needs to be able to understand UTF-8 characters.
|
||||||
* As of right now, only smart lossless character encoding converters
|
* As of right now, only smart lossless character encoding converters
|
||||||
* would need that, and I'm probably not going to implement them.
|
* would need that, and I'm probably not going to implement them.
|
||||||
* Once again, PHP 6 should solve all our problems.
|
|
||||||
*/
|
*/
|
||||||
public static function cleanUTF8($str, $force_php = false)
|
public static function cleanUTF8($str, $force_php = false)
|
||||||
{
|
{
|
||||||
// UTF-8 validity is checked since PHP 4.3.5
|
// UTF-8 validity is checked since PHP 4.3.5
|
||||||
// This is an optimization: if the string is already valid UTF-8, no
|
// This is an optimization: if the string is already valid UTF-8, no
|
||||||
// need to do PHP stuff. 99% of the time, this will be the case.
|
// need to do PHP stuff. 99% of the time, this will be the case.
|
||||||
// The regexp matches the XML char production, as well as well as excluding
|
|
||||||
// non-SGML codepoints U+007F to U+009F
|
|
||||||
if (preg_match(
|
if (preg_match(
|
||||||
'/^[\x{9}\x{A}\x{D}\x{20}-\x{7E}\x{A0}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]*$/Du',
|
'/^[\x{9}\x{A}\x{D}\x{20}-\x{7E}\x{A0}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]*$/Du',
|
||||||
$str
|
$str
|
||||||
@ -255,7 +260,8 @@ class HTMLPurifier_Encoder
|
|||||||
// 7F-9F is not strictly prohibited by XML,
|
// 7F-9F is not strictly prohibited by XML,
|
||||||
// but it is non-SGML, and thus we don't allow it
|
// but it is non-SGML, and thus we don't allow it
|
||||||
(0xA0 <= $mUcs4 && 0xD7FF >= $mUcs4) ||
|
(0xA0 <= $mUcs4 && 0xD7FF >= $mUcs4) ||
|
||||||
(0xE000 <= $mUcs4 && 0x10FFFF >= $mUcs4)
|
(0xE000 <= $mUcs4 && 0xFFFD >= $mUcs4) ||
|
||||||
|
(0x10000 <= $mUcs4 && 0x10FFFF >= $mUcs4)
|
||||||
)
|
)
|
||||||
) {
|
) {
|
||||||
$out .= $char;
|
$out .= $char;
|
||||||
|
Loading…
Reference in New Issue
Block a user