怎么在html写入utf-8,如何将HTML字符引用(ף)转换为常规UTF-8?

那些是character references通过指定在小数点该字符的代码点(n;)指的是字符在ISO 10646或十六进制(n;)表示法。

您可以使用html_entity_decode解码等字符引用以及对entities defined for HTML 4实体引用,所以其他的引用类似<,>,&也将得到解码:如果你只是想解码

$str = html_entity_decode($str, ENT_NOQUOTES, 'UTF-8');

数字字符引用,您可以使用此:

function html_dereference($match) {

if (strtolower($match[1][0]) === 'x') {

$codepoint = intval(substr($match[1], 1), 16);

} else {

$codepoint = intval($match[1], 10);

}

return mb_convert_encoding(pack('N', $codepoint), 'UTF-8', 'UTF-32BE');

}

$str = preg_replace_callback('/(x[0-9a-f]+|[0-9]+);/i', 'html_dereference', $str);

由于YuriKolovsky和thirtydot在another question中已经指出,似乎浏览器供应商对'字符引用映射'的某些事情'默默'达成了一致,这与规范不同,并且没有相关文档。

似乎有一些字符引用通常会映射到Latin 1 supplement上,但实际上映射到不同的字符上。这是因为映射不是映射来自Windows-1252而是映射ISO 8859-1的字符,而Unicode字符集是在该映射上构建的。 Jukka Korpela写了一个extensive article on this topic。

现在,这里的一个扩展,上面提到的功能处理这个怪癖:

function html_character_reference_decode($string, $encoding='UTF-8', $fixMappingBug=true) {

$deref = function($match) use ($encoding, $fixMappingBug) {

if (strtolower($match[1][0]) === "x") {

$codepoint = intval(substr($match[1], 1), 16);

} else {

$codepoint = intval($match[1], 10);

}

// @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

if ($fixMappingBug && $codepoint >= 130 && $codepoint <= 159) {

$mapping = array(

8218, 402, 8222, 8230, 8224, 8225, 710, 8240, 352, 8249,

338, 141, 142, 143, 144, 8216, 8217, 8220, 8221, 8226,

8211, 8212, 732, 8482, 353, 8250, 339, 157, 158, 376);

$codepoint = $mapping[$codepoint-130];

}

return mb_convert_encoding(pack("N", $codepoint), $encoding, "UTF-32BE");

};

return preg_replace_callback('/(x[0-9a-f]+|[0-9]+);/i', $deref, $string);

}

$deref = create_function('$match', '

$encoding = '.var_export($encoding, true).';

$fixMappingBug = '.var_export($fixMappingBug, true).';

if (strtolower($match[1][0]) === "x") {

$codepoint = intval(substr($match[1], 1), 16);

} else {

$codepoint = intval($match[1], 10);

}

// @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

if ($fixMappingBug && $codepoint >= 130 && $codepoint <= 159) {

$mapping = array(

8218, 402, 8222, 8230, 8224, 8225, 710, 8240, 352, 8249,

338, 141, 142, 143, 144, 8216, 8217, 8220, 8221, 8226,

8211, 8212, 732, 8482, 353, 8250, 339, 157, 158, 376);

$codepoint = $mapping[$codepoint-130];

}

return mb_convert_encoding(pack("N", $codepoint), $encoding, "UTF-32BE");

');

function html5_decode($string, $flags=ENT_COMPAT, $charset='UTF-8') {

$deref = function($match) use ($flags, $charset) {

if ($match[1][0] === '#') {

if (strtolower($match[1][0]) === '#') {

$codepoint = intval(substr($match[1], 2), 16);

} else {

$codepoint = intval(substr($match[1], 1), 10);

}

// HTML 5 specific behavior

// @see http://dev.w3.org/html5/spec/tokenization.html#tokenizing-character-references

// handle Windows-1252 mismapping

// @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

// @see http://dev.w3.org/html5/spec/tokenization.html#table-charref-overrides

$overrides = array(

0x00=>0xFFFD,0x80=>0x20AC,0x82=>0x201A,0x83=>0x0192,0x84=>0x201E,

0x85=>0x2026,0x86=>0x2020,0x87=>0x2021,0x88=>0x02C6,0x89=>0x2030,

0x8A=>0x0160,0x8B=>0x2039,0x8C=>0x0152,0x8E=>0x017D,0x91=>0x2018,

0x92=>0x2019,0x93=>0x201C,0x94=>0x201D,0x95=>0x2022,0x96=>0x2013,

0x97=>0x2014,0x98=>0x02DC,0x99=>0x2122,0x9A=>0x0161,0x9B=>0x203A,

0x9C=>0x0153,0x9E=>0x017E,0x9F=>0x0178);

if (isset($windows1252Mapping[$codepoint])) {

$codepoint = $windows1252Mapping[$codepoint];

}

if (($codepoint >= 0xD800 && $codepoint <= 0xDFFF) || $codepoint > 0x10FFFF) {

$codepoint = 0xFFFD;

}

if (($codepoint >= 0x0001 && $codepoint <= 0x0008) ||

($codepoint >= 0x000E && $codepoint <= 0x001F) ||

($codepoint >= 0x007F && $codepoint <= 0x009F) ||

($codepoint >= 0xFDD0 && $codepoint <= 0xFDEF) ||

in_array($codepoint, array(

0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF,

0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE,

0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF,

0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE,

0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, 0x10FFFF))) {

$codepoint = 0xFFFD;

}

return mb_convert_encoding(pack("N", $codepoint), $charset, "UTF-32BE");

} else {

return html_entity_decode($match[0], $flags, $charset);

}

};

return preg_replace_callback('/&(#(?:x[0-9a-f]+|[0-9]+)|[A-Za-z0-9]+);/i', $deref, $string);

}

我也注意到,在PHP 5.4.0的html_entity_decode function加入名为ENT_HTML5对HTML 5的行为的另一标志。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值