utf-8 字符串转 unicode 字符串

最新推荐文章于 2024-07-18 15:29:38 发布

matrixtrank

最新推荐文章于 2024-07-18 15:29:38 发布

阅读量5.4k

点赞数

分类专栏： vc++

本文链接：https://blog.csdn.net/matrixtrank/article/details/80608784

版权

vc++ 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

utf-8编码简介

utf-8编码是一种变长编码, 中文字符用三个byte来存储，而编码范围在 0 到 0x7f 则使用1个字节存储

Number of bytes	Bits for code point	First code point	Last code point	Byte1	Byte2	Byte3	Byte4
1	7	U+0000	U+007F	0xxxxxxx
2	11	U+0080	U+07FF	110xxxxx	10xxxxxx
3	16	U+0800	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
4	21	U+10000	U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

以下是编码例子，这些都是来自于维基百科

以下代码能把 utf-8 多字节字符串，转换成为unicode 字符串，如转载请注明出处

static int z_pos(uint8_t x)
{
    for (int i = 0; i < 5; i++, x <<= 1) {
        if ( (x & 0x80) == 0 )
            return i;
    }

    return 4;
}

// convert UTF-8 string to wstring
std::wstring utf8_to_wstring(const std::string& str)
{
    std::wstring loc;
    uint8_t mask[5] = { 0x7f, 0x3f, 0x1f, 0x0f, 0x7};

    for (size_t i = 0; i < str.length();) {
        int byte_cnt = z_pos(str[i]);
        uint16_t sum = str[i] & mask[byte_cnt];

        for (size_t j = 1; j < byte_cnt; j++) {
            sum <<= 6;
            sum |= str[i+j] & mask[1];
        }

        i += byte_cnt ? byte_cnt : 1;
        loc.push_back(sum);
    }

    return loc;
}