I have a .csv file with chinese characters. I need to read in these chinese characters and store them for further use in the program. I know that chinese characters have to be processed in utf format, using wchar_t and the like, but I am not able to figure out exactly how this is to be done. Can anyone please help me out?
解决方案
First of all, there is no unique way to encode Chinese characters. To be able to decode the file, you first have to know which encoding has been used.
The most common ones are utf-8, utf-16, big5 and gb2312. gb2312 is for simplified characters and mostly used in mainland China. big5 is for traditional characters and mostly used in Taiwan and Hongkong. Most international companies would use utf-8 or utf-16. In Utf-8 the encodings have a variable length (with a unit length of 1 byte) and is typically more efficient to store in a text contains a lot of characters in ASCII (since these only take up on byte in UTF-8), while in UTF-16 the characters have a unit length of 2 bytes (the characters also have a variable length).
It is also worth-while to read Joel Spolky's article on unicode: http://www.joelonsoftware.com/articles/Unicode.html
Let's suppose the cvs file is encoded in UTF-8.
So you have to specify the encoding.
Using the following, the file is interpreted as UTF-8 and converted to wchar_t which has a fix size (2 bytes in Windows and 4 bytes in Linux):
const std::locale utf8_locale
= std::locale(std::locale(), new std::codecvt_utf8());
std::wifstream file("filename");
file.imbue(utf8_locale);
You can then read and process the file for example like this
std::wstring s;
while (std::getline(dict, s))
{
// Do something with the string
auto end1 = s.find_first_of(L';');
...
}
For this you'll need these header files
#include
#include
#include
#include
#include