解决中文utf-8编码导致的字数统计难题-CSDN博客

本文链接：https://blog.csdn.net/zhizichina/article/details/7642614

在做下面这个试验的过程中出现了一些错误。现在仍然没有改正，所以建议先不要按照这种方法进行尝试，如果想试用utf-8转换的话，要先试用连接中的方法。

最近在做微博字数统计的时候，使用java写程序将获取的微博数据写成了utf-8格式，这让我在以后的程序中受尽苦头，utf-8是一种组合字符，其中英文占一个字符，而中文占三个字符。这样就在字数统计中遇到了很大的困难。但是借助于http://blog.csdn.net/chrisniu1984/article/details/7359908所说的方法还是将问题解决。

程序如下

#include <iostream>
#include <fstream>
#include <string>
#define utf8_asc(byte) (((unsigned char)(byte)>=0x00)&&((unsigned char)(byte)<=0x7f))
#define utf8_first(byte) (((unsigned char)(byte)>=0xc0)&&((unsigned char)(byte)<=0xfd))
#define utf8_other(byte) (((unsigned char)(byte)>=0x80)&&((unsigned char)(byte)<=0xbf))
using namespace std;
int ceil(int num){
	if (num&&0x01) return (num>>1) + 1;
	else return (num >> 1);
}
void count_text(string &file_name, long *count){
	ifstream fin;
	fin.open(file_name.c_str(), ios::in);

	if (!fin){
		cout << "file open error" << endl;
		return ;
	}
	else {
		cout << "file open ok" << endl;
	}
	string s;
	getline(fin, s);
	while (getline(fin, s)){
		getline(fin, s);
		int co = 0, co2 = 0;
		for (int i = 0; i < s.size(); ++i){
			if (utf8_first(s[i]))
				co++;
			if (utf8_asc(s[i]))
				co2++;
		}
		count[co+ceil(co2)] ++;
		getline(fin, s);
	}
	fin.close();
}

int main(){
	
	string file_name2 = "e://weibodata_2.txt";
	string file_name1 = "e://weibodata_5.txt";
	string file_name = "e://datatest.txt";
	long count[560] = {0};
//	count_text(file_name2, count);
	count_text(file_name, count);
	ofstream fout("e://count_number.txt");
	for (int i = 0; i < 560; ++i)
		fout << count[i] << endl;
	fout.close();


	return 0;
}

这个程序不知道是不是对的。在读文件数据之前要提前读一行，这让我很无解，不知道为什么要这么做。在不停的debug中浪费了很多的时间，但是后来想一想这种工作真的美什么必要去做，但是既然想到这么一个问题就好好把他解决吧。心情不爽，静不下来。