Unicode 转UTF-8

最新推荐文章于 2024-04-10 10:43:38 发布

ZHOUTING_001

最新推荐文章于 2024-04-10 10:43:38 发布

阅读量220

点赞数

分类专栏： PHP 文章标签： C C++ C# PHP

本文链接：https://blog.csdn.net/ZHOUTING_001/article/details/83948887

版权

PHP 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

基础知识
UTF-8 　　UTF-8以字节为单位对Unicode进行编码。从Unicode到UTF-8的编码方式如下：　　
Unicode编码(16进制)　║　UTF-8 字节流(二进制) 　
000000 - 00007F　║　0xxxxxxx 　　
000080 - 0007FF　║　110xxxxx 10xxxxxx 　　
000800 - 00FFFF　║　1110xxxx 10xxxxxx 10xxxxxx 　
010000 - 10FFFF　║　11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 　
　UTF-8的特点是对不同范围的字符使用不同长度的编码。对于0x00-0x7F之间的字符，UTF-8编码与ASCII编码完全相同。UTF-8编码的最大长度是4个字节。从上表可以看出，4字节模板有21个x，即可以容纳21位二进制数字。Unicode的最大码位0x10FFFF也只有21位。　　例1：“汉”字的Unicode编码是0x6C49。0x6C49在0x0800-0xFFFF之间，使用用3字节模板了：1110xxxx 10xxxxxx 10xxxxxx。将0x6C49写成二进制是：0110 1100 0100 1001，用这个比特流依次代替模板中的x，得到：11100110 10110001 10001001，即E6 B1 89。　　例2：Unicode编码0x20C30在0x010000-0x10FFFF之间，使用用4字节模板了：11110xxx 10xxxxxx 10xxxxxx 10xxxxxx。将0x20C30写成21位二进制数字（不足21位就在前面补0）：0 0010 0000 1100 0011 0000，用这个比特流依次代替模板中的x，得到：11110000 10100000 10110000 10110000，即F0 A0 B0 B0。
参考文章：
http://baike.baidu.com/view/40801.htm

//从Unicode转到UTF8 PHP
function uc2html($str) {

$ret = '';

for( $i=0; $i<strlen($str)/2; $i++ ) {

$charcode = ord($str[$i*2])+256*ord($str[$i*2+1]);//得到每个unicode

if($charcode<127)

$ret .=chr($charcode);//127以前ASCII,UTF-8一样

else

$ret .= $this->u2utf8($charcode);//转换成UTF-8

}

return $ret;

}

function u2utf8($c) {
$str="";
if ($c < 0x80) {
$str.=$c;
} else if ($c < 0x800) {
$str.=chr(0xC0 | $c>>6);
$str.=chr(0x80 | $c & 0x3F);
} else if ($c < 0x10000) {
$str.=chr(0xE0 | $c>>12);
$str.=chr(0x80 | $c>>6 & 0x3F);
$str.=chr(0x80 | $c & 0x3F);
} else if ($c < 0x200000) {
$str.=chr(0xF0 | $c>>18);
$str.=chr(0x80 | $c>>12 & 0x3F);
$str.=chr(0x80 | $c>>6 & 0x3F);
$str.=chr(0x80 | $c & 0x3F);
}
return $str;
}