关于unicode和utf-8描述,请查看阮一峰老师的《字符编码笔记》
unicode与utf-8关系表
算法
转码字符:‘🤩’
十六进制: 0x1F929
二 进制:00000001 11111001 00101001
档 位:4
utf8格式:11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
上一次方法生成的目标格式(二进制) | 上一次转换后剩余值(二进制) | 取位数 | 方法 |
---|---|---|---|
- | - | - | 开始 |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 00000001 11111001 00101001 | 101001 | (0b10000000 | (0x1F929 & 0b00111111)) |
11110xxx 10xxxxxx 10xxxxxx 10101001 | 00000001 11111001 00xxxxxx | 1001 00 | (0b10000000 | (0x1F929 >> 6) & 00111111) |
11110xxx 10xxxxxx 10100100 10101001 | 00000001 1111xxxx xxxxxxxx | 01 1111 | (0b10000000 | (0x1F929 >> 12) & 00111111) |
11110xxx 10011111 10100100 10101001 | - - -000xx xxxxxxxx xxxxxxxx | - - -000 | (0b11110000 | 0x1F929 >> 18) |
11110000 10011111 10100100 10101001 | xxxxxxxx xxxxxxxx xxxxxxxx | - | 结束:按照从下往上的顺序,即正确的转换算法 |
“上一次转换后剩余值” 通过"方法"后,将"取位数"的bits转换成"上一次方法生成的目标格式",直到"上一次转换后剩余值"都转换完。
方法
分析
上述表格中的方法
分析:
如:
0b10000000 | (0x1F929 & 0b00111111)
- 以‘或’符号分割,分为2个部分。
- 左部:‘0b10000000’ 用于标识转换后的固定格式
- 右部:‘0x1F929 & 0b00111111’ 用于获取
取位数
对应的需要转换的具体位数值
- 0x1F929 & 0b00111111 = 00000001 11111001 00101001 & 0011 1111 = 00000000 00000000 00101001 = 101001
- 左部 | 右部 = 1000 0000 | 00101001 = 10101001
所以
方法
是用来计算转换utf-8时,每一个字节对应的填充关系。
js代码实例
/**
* @param {string} unicode
* @returns {string}
*/
function unicodeToUtf8(unicode) {
if(!unicode)
{
return unicode;
}
let ret = "", char;
for(let i = 0, il = unicode.length; i < il; ++i)
{
char = unicode.charCodeAt(i);
if(0x00 <= char && char <= 0x7F)
{
// ascii
ret += unicode.charAt(i);
}
else if(0x80 <= char && char <= 0x7FF)
{
ret += String.fromCharCode(0b11000000 | (char >> 6));
ret += String.fromCharCode(0b10000000 | (char & 0b00111111));
}
else if(0x0800 <= char && char <= 0xFFFF)
{
ret += String.fromCharCode(0b11100000 | (char >> 12));
ret += String.fromCharCode(0b10000000 | ((char >> 6) & 0b00111111));
ret += String.fromCharCode(0b10000000 | (char & 0b00111111));
}
else if(0x010000 <= char && char <= 0x1FFFFF)
{
ret += String.fromCharCode(0b11110000 | (char >> 18));
ret += String.fromCharCode(0b10000000 | ((char >> 12) & 0b00111111));
ret += String.fromCharCode(0b10000000 | ((char >> 6) & 0b00111111));
ret += String.fromCharCode(0b10000000 | (char & 0b00111111));
}
else if(0x200000 <= char && char <= 0x3FFFFFF)
{
ret += String.fromCharCode(0b11111000 | (char >> 24));
ret += String.fromCharCode(0b10000000 | ((char >> 18) & 0b00111111));
ret += String.fromCharCode(0b10000000 | ((char >> 12) & 0b00111111));
ret += String.fromCharCode(0b10000000 | ((char >> 6) & 0b00111111));
ret += String.fromCharCode(0b10000000 | (char & 0b00111111));
}
else if(0x4000000 <= char && char <= 0x7FFFFFFF)
{
ret += String.fromCharCode(0b11111100 | (char >> 30));
ret += String.fromCharCode(0b10000000 | ((char >> 24) & 0b00111111));
ret += String.fromCharCode(0b10000000 | ((char >> 18) & 0b00111111));
ret += String.fromCharCode(0b10000000 | ((char >> 12) & 0b00111111));
ret += String.fromCharCode(0b10000000 | ((char >> 6) & 0b00111111));
ret += String.fromCharCode(0b10000000 | (char & 0b00111111));
}
}
return ret;
}
完!
2024年8月13日:补充方法分析