visual c++ 操作UTF-8

 微软的C/C++ CRT不支持UTF-8的locale, 所以只能求助于OS API, 囧. 下面是CRT源码证明。

//VC\crt\src\getqloc.c
    //  process codepage value
    iCodePage = ProcessCodePage(lpInStr ? lpInStr->szCodePage: NULL, _psetloc_data);

    //  verify codepage validity
    if (!iCodePage || iCodePage == CP_UTF7 || iCodePage == CP_UTF8 ||
        !IsValidCodePage((WORD)iCodePage))
        return FALSE;


解决办法:

#include <iostream>
#include <locale>
#include <string>
#include <fstream>
#include <vector>
#include <stdexcpt.h>
#include <windows.h>


std::string convert(wchar_t const *s, size_t len)
{
  int n=WideCharToMultiByte(CP_UTF8,0,s,len,0,0,0,0);
  if(n==0)
  {
    throw std::runtime_error("bad conv");
  }
  std::vector<char> buf(n);
  WideCharToMultiByte(CP_UTF8,0,s,len,&buf[0], n,0,0);
  return std::string(&buf[0],n);
}

std::wstring convert(char const* s, size_t len)
{
      int n=MultiByteToWideChar(CP_UTF8, 0, s,len, 0 ,0);
      if(n <= 0) 
      {
        throw std::runtime_error("bad conv");
      }
      std::vector<wchar_t> buf(n);
      n=MultiByteToWideChar(CP_UTF8, 0, s,len, &buf.front(),n);
      return std::wstring(&buf[0], n);
}

int main()
{
  //utf.txt is encoded as utf-8 with bom
  std::ifstream ifs ("utf.txt", std::ifstream::binary);
  char bom[3]; //utf-8 bom:0xefbbbf 
  ifs.read(bom, sizeof(bom));
  std::vector<char> content; 
  char ch;
  while(ifs.read(&ch, 1)){
    content.push_back(ch);
  }
  //to show it in console that has ansi/OEM GBK, convert it in wide char string 
  std::locale::global(std::locale(".936"));
  std::wstring wstr = convert(&content[0], content.size());
  std::wcout << wstr << std::endl;

  //save it back in utf-8
  std::ofstream ofs("out.txt", std::ofstream::binary);
  ofs.write(bom, sizeof(bom));
  std::string str = convert(wstr.c_str(), wstr.length());
  ofs.write(str.c_str(), str.length());
  return 0;
}


因为crt不支持uft-8,所以不要使用格式化的IO流操作(比如printf, scanf, operator>>, operatror <<等)


更新: 发现msvc fopen函数可以指定文件编码, 更容易实现。

#include <cstdio>
#include <cwchar>
#include <clocale>

int main()
{
  if (!setlocale(LC_ALL, ".936"))
  {
    fwprintf(stderr, L"Your system does not support GBK, pleaes ask your administrator for help.\n");
    return -1; //GBK
  }
  FILE* stream = fopen("utf.txt", "rt,ccs=UTF-8");
  if (stream){
    wchar_t buf[1024];
    while (fgetws(buf, 1024, stream) != NULL)
    {   //by default, the orientation is not set for stdout,
        //the firt IO operation determines its orientation.
        //so we have wide-oriented for stdout now  
        wprintf(buf);
    }
    fclose(stream);
  }
  return 0;
}


附上msdn 文档:

Unicode Support

fopen supports Unicode file streams. To open a Unicode file, pass a ccs flag that specifies the desired encoding to fopen, as follows.

fp = fopen("newfile.txt", "rt+, ccs= encoding ");

Allowed values of encoding are UNICODE, UTF-8, and UTF-16LE.

When a file is opened in Unicode mode, input functions translate the data that's read from the file into UTF-16 data stored as type wchar_t. Functions that write to a file opened in Unicode mode expect buffers that contain UTF-16 data stored as type wchar_t. If the file is encoded as UTF-8, then UTF-16 data is translated into UTF-8 when it is written, and the file's UTF-8-encoded content is translated into UTF-16 when it is read. An attempt to read or write an odd number of bytes in Unicode mode causes a parameter validation error. To read or write data that's stored in your program as UTF-8, use a text or binary file mode instead of a Unicode mode. You are responsible for any required encoding translation.

If the file already exists and is opened for reading or appending, the Byte Order Mark (BOM), if it present in the file, determines the encoding. The BOM encoding takes precedence over the encoding that is specified by the ccs flag. The ccs encoding is only used when no BOM is present or the file is a new file.




  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值