假设您正在使用 UTF-8 ,问题是单个 UTF-8 个字符可以占据 1 到 4 个字节(理论上为 6 ) .
为了遍历它们,您需要计算每个字符的大小 . 以下代码使用一个简单的表,但您也可以通过位操作获得创意:
#include
#include
#include
// return individual utf-8 chars as a vector of strings
std::vector<:string> utf8_split_chars(std::string const& s)
{
// table to get the size of a utf-8 character
static const char u8char_size[] =
{
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3
, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 0, 0
};
std::vector<:string> utf8_chars;
// increment the index i by the size of each utf-8 char
for(auto i = 0U; i < s.size(); i += u8char_size[(unsigned char)s[i]])
{
utf8_chars.emplace_back(&s[i], u8char_size[(unsigned char)s[i]]);
}
return utf8_chars;
}
int main()
{
std::string s = u8"建造 otoño κάτω";
std::cout << "s: " << s <
auto chars = utf8_split_chars(s);
for(auto const& c: chars)
std::cout << "c: " << c << '\n';
}
Output:
s: 建造 otoño κάτω 22 bytes
c: 建
c: 造
c:
c: o
c: t
c: o
c: ñ
c: o
c:
c: κ
c: ά
c: τ
c: ω