我有一大笔档案,因此我为从XML中去除无效的utf-8特性建立了过滤器。
class ValidUTF8XMLFilter extends php_user_filter {
protected static $pattern = /([x09x0Ax0Dx20-x7E]|[xC2-xDF][x80-xBF]|xE0[xA0-xBF][x80-xBF]|[xE1-xECxEExEF][x80-xBF]{2}|xED[x80-x9F][x80-xBF]|xF0[x90-xBF][x80-xBF]{2}|[xF1-xF3][x80-xBF]{3}|xF4[x80-x8F][x80-xBF]{2})|./x ;
function filter($in, $out, &$consumed, $closing)
{
while ($bucket = stream_bucket_make_writeable($in)) {
$bucket->data = preg_replace(self::$pattern, $1 , $bucket->data);
$consumed += $bucket->datalen;
stream_bucket_append($out, $bucket);
}
return PSFS_PASS_ON;
}
}
This filter will remove also utf-8 characters not only invalid in xml, but also in utf-8. The regex is taken from Multilingual form encoding. The class was taken from this answer: How to skip invalid characters in XML file using PHP and rewritten. The pattern in that answer won t work for invalid utf-8 characters, eg. 0x1D.
这种过滤工作在下述情况下是否有效? 因为在这种情形下,由于tes而无效的 by子在缓冲末开始,在下一次过滤开始时终止? 这种状况是可能的吗?