我有这个代码将数字html实体解码为UTF8等效字符.
我正在尝试转换这个角色:
应该输出:
然而,它只是消失(没有输出). (我已经检查了页面的源代码,页面有正确的utf8字符集标题/元标记).
有谁知道代码有什么问题?
function entity_decode($string, $quote_style = ENT_COMPAT, $charset = "UTF-8") {
$string = html_entity_decode($string, $quote_style, $charset);
$string = preg_replace_callback('~([0-9a-fA-F]+);~i', "chr_utf8_callback", $string);
$string = preg_replace('~([0-9]+);~e', 'chr_utf8("\\1")', $string);
//this is another method, which also doesn't work..
//$string = preg_replace_callback("/(\[0-9]+;)/", "entity_decode_callback", $string);
return $string;
}
function chr_utf8_callback($matches) {
return chr_utf8(hexdec($matches[1]));
}
function chr_utf8($num) {
if ($num < 128) return chr($num);
if ($num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) + 128);
if ($num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
if ($num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12) & 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
return '';
}
function entity_decode_callback($m) {
return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES");
}
echo '=' . entity_decode('');
解决方法:
html_entity_decode已经做了你想要的:
$string = '';
echo html_entity_decode($string, ENT_COMPAT, 'UTF-8');
它将返回角色:
’ binary hex: c292
这是PRIVATE USE TWO (U+0092).由于它是私人使用,您的PHP配置/版本/编译可能根本不会返回它.
还有一些更多的怪癖:
But in HTML (other than XHTML, which uses XML rules), it’s a long-standing browser quirk that character references in the range to are misinterpreted to mean the characters associated with bytes 128 to 159 in the Windows Western code page (cp1252) instead of the Unicode characters with those code points. The HTML5 standard finally documents this behaviour.
标签:html,php,character-encoding,utf-8
来源: https://codeday.me/bug/20190902/1790609.html