C++实现中文字频统计

Angel_EN

已于 2023-04-14 23:58:08 修改

阅读量1.6k

点赞数 4

于 2022-12-23 18:30:00 首次发布

本文链接：https://blog.csdn.net/Angel_EN/article/details/128413264

版权

c++ 数据结构开发语言排序算法哈希表

中文文本字频统计系统设计

一、实验内容

【问题描述】

中文信息处理中常用到汉字的字频统计，设计一个小工具统计文本的字频。

【基本要求】

(1)输入一个中文文本。

(2)用三种排序分别输出结果：①按汉字出现顺序输出的字频，②按汉字拼音（区位码）顺序输出的字频，③按字频逆序（同频时按汉字区位码序）输出的字频。

二、数据结构设计

定义了一个character_of_chinese类，用于储存读取到的每个一个汉字的机内码（构成该汉字的两个字节的字符串）、区位码以及出现次数（方便使用频数排序）；包含以下成员变量：

string chara ：机内码（构成该汉字的两个字节的字符串）。

int qwm ：区位码。

int counts ：出现次数（方便使用频数排序）。

专门针对汉字定义了一个key 类型位string 的字典(使用哈希表实现)，成员变量只包含一个指针作为数组用来储存汉字的频次。

收集输入的汉字并统计各汉字的频次。

三、算法设计

读入文本文件，将文件内容保存到strin变量
遍历strin，忽略掉非汉字部分，将各汉字及其统计频次保存到字典chara_dict中，同时也利用字典chara_dict去重，按文本顺序对每一个汉字构建character_of_chinese类对象（对象在构建时将自动计算区位码并保存在其成员变量qwm中，counts初始化为0）保存到三个顺序表中(chara_lst_by_ori、chara_lst_by_Pinyin、chara_lst_by_count)。
遍历chara_lst_by_count，利用字典chara_dict的查找功能(1)将chara_lst_by_count中所有元素的成员变量counts赋上对应的值（因为后面要对该顺序表按字频排序）。
使用归并排序（2）对chara_lst_by_count按字频（如果字频相等则按照汉字区位码）逆序进行排序。
使用归并排序对chara_lst_by_Pinyin按拼音（区位码）进行排序。
按照chara_lst_by_ori中的顺序查找字典chara_dict中所对应的字频；输出各汉字以及对应的字频并保存到文件。
按照chara_lst_by_Pinyin中的顺序查找字典chara_dict中所对应的字频；输出各汉字以及对应的字频并保存到文件。
按照chara_lst_by_count中的汉字以及对应的字频输出各汉字以及对应的字频并保存到文件。
进入查询（使用字典chara_dic的查找功能实现）操作循环，检测输入为q退出程序。

核心算法详细解释：

2、字典chara_dict的查找功能:

利用单个汉字文本计算出区位码，按照区位码的排列顺序计算出储存下标，算法公式：下标 = (十进制区码-16) * 94 + （十进制位码）- 1。

2、归并排序

通过递归将原序列左右折半拆解，拆到不能再拆为止，在每一层递归中，当其相对左右部分及它们所有子序列都拆解并排序完毕时，将左右两部分合并排序。

四、测试样例以及结果

略。

五、实验过程中出现的问题及解决方法

如何使用C++在含有中文和非中文的文本中识别出中文和非中文？

技术路线：以2为步长遍历含中文字符的字符串，如果值为正，则以1为步长跳过这个字符（char），否则判断对应两个char所换算得到的区位码是否属于汉字的表示范围。

最终结果：问题完美解决。

使用何种数据结构（方法）能够分别保存识别出的汉字以及其出现频次，同时能够快速地查找对应汉字出现的频次？

技术路线：使用哈希表实现的字典来保存汉字及其对应的频次，散列算法如下：先使用汉字机内码转为区位码的公式分别计算出汉字的区码和位码，再使用“(区码 - 16) * 94 + （位码）- 1”计算出存储位置，这样不会产生任何冲突，不需要进行冲突消解，查找时间复杂度低至O(1)，效率很高；数据使用数组存放即可。

最终结果：问题解决得很好，但内存空间消耗较大。

在.h文件中定义模板类，在.cpp文件中实现时，IDE不报错，但在编译中报错“undefined reference to `***<int>::***(***)'”。

技术路线：将模板类在.h文件中实现。

最终结果：问题完美解决。

六、自我评析与总结

完成了这次的二元多项式加减运算问题的课程设计后，我的心得体会很多,在实践中发现自己的弱点所在，如对C++文件操作不熟练、对C++语法掌握不够深等。

我在设计程序的过程中遇到许多问题，主要有下面几点：

1、搞不懂C++对汉字的处理：通过网上查找资料，我了解到，C++中，中文字符机内码占两个字节，具体表现为两个连续的负数char(如果是unsigned char则为大于127)，等效于ANSI编码；而在简体中文系统下，ANSI 编码代表 GB2312 编码，可以直接转换为区位码处理。

2、如何设计以汉字为键的高效率字典(哈希表)：通过了解汉字区位码和机内码的相关知识，知道汉字和机位码是一一对应的，于是我将汉字转为区位码再转换为0到6769(按照区位码序)的下标对数组进行存取操作。

3、模板类如果在.h文件中定义，在.cpp文件中实现，编码时IDE不会报错也没有提醒，但在编译中会报错：当编译器遇到某个头文件内部的特定类型的对象声明时，例如int，它必须能够访问模板实现源。否则，它将不知道如何构造对应类的成员函数。而且，如果将实现放在源文件(.cpp文件)中，并将其作为项目的单独部分，则编译器在尝试编译mian.cpp源文件时将无法找到它（即，此时仅仅include头文件是不够的，这只告诉编译器如何分配对象数据和如何构建对成员函数的调用，而不是如何构建成员函数。同时，编译器不会抱怨，它将假定这些函数在其他地方提供，并让链接器来查找它们）。因此，当需要链接时，链接器会找不到那些函数的实现，就会报错。

4、使用memcpy将一个vector(vc1)的全部数据复制到另一个vector(vc2)中时（两个vector里面是character_of_chinese类），数据能够成功复制，但在运行到main函数的末尾return 0时出现报错（经检查，没有出现数组越界）：经过查阅网上资料，可以使用std::copy()函数替代。

七、参考文献

[1] 殷人昆. 数据结构（用面向对象方法与C++语言描述）（第3版）[M]. 北京: 清华大学出版社, 2019.7.

附上代码：

main.cpp

#include<iostream>
#include<iomanip>
#include<string>
#include<fstream>
#include<sstream>
#include<vector>
#include<cstring>
//#include<algorithm>


#include"ch_dict.h"



#define coc character_of_chinese

using namespace std;

inline string load_file(const string &file_path)
{
	string strin;
	ifstream fin;
	fin.open("./test.txt");
	if (!fin) {
		cout << "文件不存在！";
		exit(1);
	}
	stringstream f;
	f << fin.rdbuf();
	fin.close();

	strin = f.str();
	return strin;
}


class character_of_chinese {
public:

	string chara;
	int qwm;//区位码
	int counts;
	
	character_of_chinese(string c):chara(c) 
	{
		unsigned char QWM[2];
		QWM[0] = chara[1] - 0xa0;
		QWM[1] = chara[0] - 0xa0;
		qwm = QWM[1] * 100 + QWM[0];
		counts = 0;
	}
	character_of_chinese(const character_of_chinese &c2)
	{
		chara = c2.chara;
		qwm = c2.qwm;
		counts = c2.counts;
	}
	character_of_chinese operator=(const character_of_chinese& c2)
	{
		chara = c2.chara;
		qwm = c2.qwm;
		counts = c2.counts;
		return *this;
	}


	friend bool operator<(coc c1,coc c2) {
		return c1.qwm < c2.qwm;
	}

	friend bool operator>(coc c1, coc c2) {
		return c1.qwm > c2.qwm;
	}

};

inline bool save_file(const string& file_path,const vector<coc> &vc)
{
	ofstream fout;
	fout.open(file_path, std::ios::out);
	if (!fout) return false;
	for (int i = 0; i < vc.size(); i++) {
		/*fout << vc[i].chara << " " << vc[i].counts << "     " << vc[i].qwm << endl;
		cout << vc[i].chara << " " << vc[i].counts << "     " << vc[i].qwm << endl;*/
		fout << vc[i].chara << " " << vc[i].counts << endl;
		cout << vc[i].chara << " " << vc[i].counts << endl;

	}
	fout.close();
	return true;
}

inline bool save_file(const string& file_path, const vector<coc>& vc, ch_dict<int>& mp)
{
	ofstream fout;
	fout.open(file_path, std::ios::out);
	if (!fout) return false;
	for (int i = 0; i < vc.size(); i++) {
		/*fout << vc[i].chara << " " << mp[vc[i].chara] << "     " << vc[i].qwm << endl;
		cout << vc[i].chara << " " << mp[vc[i].chara] << "     " << vc[i].qwm << endl;*/
		fout << vc[i].chara << " " << mp[vc[i].chara] << endl;
		cout << vc[i].chara << " " << mp[vc[i].chara] << endl;
	}
	fout.close();
	return true;

}

bool cmp4coc_counts(character_of_chinese c1, character_of_chinese c2) 
{
	if (c1.counts == c2.counts) return c1.qwm > c2.qwm; // 同频时按照汉字区位码逆序

	return c1.counts >= c2.counts; //逆序 
}

bool cmp4coc_Pinyin(character_of_chinese c1, character_of_chinese c2) 
{
	return c1 < c2;
} 

bool (*Maincmp)(coc, coc) = NULL;

void merge_array(vector<coc>& arr, int low, int mid, int high)
{
	int i=low, j=mid+1;
	vector<coc> temp;
	while (i <= mid && j <= high) 
	{
		if (Maincmp(arr[i], arr[j])) 
			temp.push_back(arr[i++]);
		else
			temp.push_back(arr[j++]);
	}
	while (i <= mid)
	{
		temp.push_back(arr[i++]);
	}
	while (j <= high)
	{
		temp.push_back(arr[j++]);
	}

	//memcpy(&arr[low], &temp[0], temp.size() * sizeof(coc));
	copy(temp.begin(), temp.end(), arr.begin()+low);

	temp.clear();
}

void merge_sort_do(vector<coc>& arr, int low, int high)
{
	if (low < high) 
	{
		int mid = (high + low) / 2;
		merge_sort_do(arr, low, mid);// 递归拆左边的序列
		merge_sort_do(arr, mid + 1, high);// 递归拆右边的序列
		merge_array(arr, low, mid, high);// 将两个有序的子序列排序合并成有序列
	}

}

void merge_sort(vector<coc>& arr, bool (*cmp)(coc, coc)) //归并排序
{ 
	Maincmp = cmp; //设置比较函数
	merge_sort_do(arr, 0, arr.size()-1); //进入递归
}

int main()
{
	string strin;
	ch_dict<int> chara_dict;

	vector<coc> chara_lst_by_ori;
	vector<coc> chara_lst_by_count;
	vector<coc> chara_lst_by_Pinyin;

	strin = load_file("./test.txt");
	//cin >> strin;
	cout << strin;
	for (int i = 0; i < strin.size()-1; i += 2) 
	{
		if (strin[i] > 0)
		{
			i--;
			continue;
		}
		string tmp = strin.substr(i,2);
		coc coctmp(tmp);
		if (coctmp.qwm < 1601 || coctmp.qwm > 8794)
		{
			continue;
		}

		if (chara_dict[tmp] == 0)
		{
			chara_dict[tmp] = 1;
			chara_lst_by_ori.push_back(coctmp);
			chara_lst_by_Pinyin.push_back(coctmp);
			chara_lst_by_count.push_back(coctmp);
		}
		else
		{
			chara_dict[tmp]++;
		}
	}

	
	for (int i = 0; i < chara_lst_by_count.size(); i++)
	{
		chara_lst_by_count[i].counts = chara_dict[chara_lst_by_count[i].chara];
	}
	merge_sort(chara_lst_by_count, cmp4coc_counts);
	merge_sort(chara_lst_by_Pinyin, cmp4coc_Pinyin);
	
	cout << "按汉字输入顺序排序：\n";
	save_file("./output_chara_lst_by_ori.txt", chara_lst_by_ori,chara_dict);
	cout << endl;
	cout << "按汉字拼音(区位码)顺序排序：\n";
	save_file("./output_chara_lst_by_Pinyin.txt", chara_lst_by_Pinyin,chara_dict);
	cout << endl;
	cout << "按汉字字频逆序排序(同频时按区位码顺序排序)：\n";
	save_file("./output_chara_lst_by_count.txt",chara_lst_by_count);
	cout << endl;


	while (true) 
	{
		cout << "输入汉字以查询字频（输入q退出）：";
		cin >> strin;
		if (strin == "q") {
			break;
		}
		if (strin[0] > 0) {
			cout << "请输入一个汉字。\n\n";
			continue;
		}
		if (strin.length() != 2) {
			cout << "请输入一个汉字。\n\n";
			continue;
		}
		coc coctmp(strin);
		if (coctmp.qwm < 1601 || coctmp.qwm > 8794) {
			cout << "请输入一个汉字。\n\n";
			continue;
		}
		cout << strin << "   " << chara_dict[strin] << "   区位码：" << coctmp.qwm << endl << endl;
	}


	
	return 0;
}

ch_dict.h

#pragma once
#include<string>
using namespace std;

template<class T>
class ch_dict //汉字-->T 字典
{
protected:
	T* hashmap;
	const int hash(string key);
public:
	ch_dict();
	ch_dict(const int & size);
	T& operator[](const string& key);
};


template<class T>
const int ch_dict<T>::hash(string key)
{
	unsigned char QWM[2] = { key[1] - 0xa0 ,key[0] - 0xa0 };
	int index = (QWM[1] - 16) * 94 + QWM[0] - 1;
	return index;
}

template<class T>
ch_dict<T>::ch_dict()
{
	hashmap = new T[6769];//[72]*[94]
	memset(hashmap, 0, sizeof(T) * 6769);
}

template<class T>
ch_dict<T>::ch_dict(const int& size)
{
	hashmap = new T[size];//[72]*[94]
	memset(hashmap, 0, sizeof(T) * size);
}

template<class T>
T& ch_dict<T>::operator[](const string& key)
{
	return hashmap[hash(key)];
}