c++11 标准模板（STL）本地化库 - 平面类别 - 在字符编码间转换，包括 UTF-8、UTF-16、UTF-32 （七）

最新推荐文章于 2024-06-21 13:59:32 发布

繁星璀璨G

最新推荐文章于 2024-06-21 13:59:32 发布

阅读量624

点赞数 22

分类专栏：本地化库文章标签：平面开发语言标准库模板 STL locale length

本文链接：https://blog.csdn.net/qq_40788199/article/details/137748955

版权

本地化库专栏收录该内容

70 篇文章 0 订阅

订阅专栏

本地化库

本地环境设施包含字符分类和字符串校对、数值、货币及日期/时间格式化和分析，以及消息取得的国际化支持。本地环境设置控制流 I/O 、正则表达式库和 C++ 标准库的其他组件的行为。

平面类别

在字符编码间转换，包括 UTF-8、UTF-16、UTF-32

std::codecvt

template<

class InternT,
class ExternT,
class State

> class codecvt;

类 std::codecvt 封装字符串的转换，包括宽和多字节，从一种编码到另一种。通过 std::basic_fstream<CharT> 进行的所有 I/O 操作都使用流中感染的 std::codecvt<CharT, char, std::mbstate_t> 本地环境平面。

继承图

标准库提供以下独立（本地环境无关）特化：

定义于头文件 `<locale>`
std::codecvt<char, char, std::mbstate_t>	恒等转换
std::codecvt<char16_t, char, std::mbstate_t>	在 UTF-16 和 UTF-8 间转换 (C++11 起)(C++20 中弃用)
std::codecvt<char16_t, char8_t, std::mbstate_t>	在 UTF-16 和 UTF-8 间转换 (C++20 起)
std::codecvt<char32_t, char, std::mbstate_t>	在 UTF-32 和 UTF-8 间转换 (C++11 起)(C++20 中弃用)
std::codecvt<char32_t, char8_t, std::mbstate_t>	在 UTF-32 和 UTF-8 间转换 (C++20 起)
std::codecvt<wchar_t, char, std::mbstate_t>	在系统原生宽和单字节窄字符集间转换

另外， C++ 程序中构造每个的 locale 对象实现其自身的四个特化的（ locale 限定）版本。

成员类型

成员类型	定义
`intern_type`	`InternT`
`extern_type`	`ExternT`
`state_type`	`State`

调用 do_max_length & 计算转换成给定的 internT 缓冲区会消耗的 externT 字符串长度

std::codecvt<InternT,ExternT,State>::length, 
std::codecvt<InternT,ExternT,State>::do_length

public: int length( StateT& state, const ExternT* from, const ExternT* from_end, std::size_t max ) const;	(1)
protected: virtual int do_length( StateT& state, const ExternT* from, const ExternT* from_end, std::size_t max ) const;	(2)

1) 公开成员函数，调用最终导出类的成员函数 do_length 。

2) 给定初始转换状态 state ，试图转换来自 [from, from_end) 所定义的字符数组的 externT 字符，为至多 max 个 internT 字符，并返回这种转换会消耗的 externT 字符数。如同以对某虚构的 [to, to+max) 输出缓冲区执行 do_in(state, from, from_end, from, to, to+max, to) 一般修改 state 。

返回值

假如以 do_in() 转换直至消耗所有 from_end-from 个字符，或产生 max 个 internT 字符，或出现转换错误，则会消耗的 externT 字符数。

非转换特化 std::codecvt<char, char, std::mbstate_t> 返回 std::min(max, from_end-from) 。

调用示例 linux

#include <locale>
#include <string>
#include <iostream>

int main()
{
    // 窄多字节编码
    std::string s = "z\u00df\u6c34\U0001d10b";
    std::mbstate_t mb = std::mbstate_t();
    std::cout << "Only the first " <<
              std::use_facet<std::codecvt<wchar_t, char, std::mbstate_t>>(
                  std::locale("en_US.utf8")
              ).length(mb, &s[0], &s[s.size()], 2)
              << " bytes out of " << s.size() << " would be consumed "
              " to produce the first 2 characters" << std::endl;

    return 0;
}

输出

Only the first 3 bytes out of 10 would be consumed  to produce the first 2 characters

调用示例 windows

#include <locale>
#include <iostream>
#include <vector>
#include <Windows.h>
#include <string>

std::vector<std::wstring> locals;

BOOL CALLBACK MyFuncLocaleEx(LPWSTR pStr, DWORD dwFlags, LPARAM lparam)
{
    locals.push_back(pStr);
    return TRUE;
}

std::string stows(const std::wstring& ws)
{
    std::string curLocale = setlocale(LC_ALL, NULL); // curLocale = "C";
    setlocale(LC_ALL, "chs");
    const wchar_t* _Source = ws.c_str();
    size_t _Dsize = 2 * ws.size() + 1;
    char *_Dest = new char[_Dsize];
    memset(_Dest, 0, _Dsize);
    wcstombs(_Dest, _Source, _Dsize);
    std::string result = _Dest;
    delete[]_Dest;
    setlocale(LC_ALL, curLocale.c_str());
    return result;
}

int main()
{
    EnumSystemLocalesEx(MyFuncLocaleEx, LOCALE_ALTERNATE_SORTS, NULL, NULL);

    for (std::vector<std::wstring>::const_iterator str = locals.begin();
            str != locals.end(); ++str)
    {
        std::string str1 = "z\u00df\u6c34\U0001d10b";
        std::mbstate_t mbstate = std::mbstate_t();
        std::wcout << *str ;
        std::cout << "  Only the first " <<
                  std::use_facet<std::codecvt<wchar_t, char, std::mbstate_t>>(
                      std::locale(stows(*str))
                  ).length(mbstate, &str1[0], &str1[str1.size()], 2)
                  << " bytes out of " << str1.size() << " would be consumed "
                  " to produce the first 2 characters" << std::endl;
    }

    return 0;
}

输出

de-DE_phoneb  Only the first 2 bytes out of 6 would be consumed  to produce the first 2 characters
es-ES_tradnl  Only the first 2 bytes out of 6 would be consumed  to produce the first 2 characters
hu-HU_technl  Only the first 2 bytes out of 6 would be consumed  to produce the first 2 characters
ja-JP_radstr  Only the first 2 bytes out of 6 would be consumed  to produce the first 2 characters
ka-GE_modern  Only the first 2 bytes out of 6 would be consumed  to produce the first 2 characters
x-IV_mathan  Only the first 2 bytes out of 6 would be consumed  to produce the first 2 characters
zh-CN_phoneb  Only the first 2 bytes out of 6 would be consumed  to produce the first 2 characters
zh-CN_stroke  Only the first 2 bytes out of 6 would be consumed  to produce the first 2 characters
zh-HK_radstr  Only the first 2 bytes out of 6 would be consumed  to produce the first 2 characters
zh-MO_radstr  Only the first 2 bytes out of 6 would be consumed  to produce the first 2 characters
zh-MO_stroke  Only the first 2 bytes out of 6 would be consumed  to produce the first 2 characters
zh-SG_phoneb  Only the first 2 bytes out of 6 would be consumed  to produce the first 2 characters
zh-SG_stroke  Only the first 2 bytes out of 6 would be consumed  to produce the first 2 characters
zh-TW_pronun  Only the first 2 bytes out of 6 would be consumed  to produce the first 2 characters
zh-TW_radstr  Only the first 2 bytes out of 6 would be consumed  to produce the first 2 characters