统计一篇短文中单词出现频率

最新推荐文章于 2023-02-11 11:30:00 发布

Tander_Tang

最新推荐文章于 2023-02-11 11:30:00 发布

阅读量1.1k

点赞数

分类专栏： C++基础算法导论数据结构

本文链接：https://blog.csdn.net/tander_tang/article/details/50809990

版权

算法导论同时被 3 个专栏收录

10 篇文章 0 订阅

订阅专栏

数据结构

10 篇文章 0 订阅

订阅专栏

C++基础

8 篇文章 0 订阅

订阅专栏

散列查找的应用：给定一个英文文本文件，统计文件中所有单词出现的频率。

解决这问题最基本的工作是不断地对读入的单词在已有的单词中查找，如果存在就将该单词频数加1，如果不存在就将该单词插入并记录频数为1.下面C++代码的哈希函数使用了双重探测的办法。在确定表的时候是取比输入整数小且距离输入整数最近的素数。

#include<iostream>
#include<algorithm>
#include<fstream>
#include<iomanip>
#include<string>
using namespace std;
class HashEntry{
public:
	string words_;                //记录单词
	int totalTimes_;              //记录words_出现的次数
	bool operator<(HashEntry const&a){
		return totalTimes_ > a.totalTimes_;
	}
};
class HashTable{
private:
	HashEntry*hash;
	int nextPrime_;             //哈希表大小，用素数
	int numberOfWords_;         //记录不同单词数的总数
public:
	HashTable(int size);        
	int getNumberofwords(){ return numberOfWords_; }
	int hashFunction(string key);   //哈希函数
	void insertKey(string key);           //插入
	void showWord(double percentage);     //要输出的百分比
};
HashTable::HashTable(int size){
	int i;
	bool flag;
	if (size % 2 == 0)
		size--;        //将size变为奇数
	while (size){
		flag = true;
		for (i = 2; i*i <= size; i +=1){
			if (size%i == 0){
				flag = false;
				break;
			}	
		}
		if (flag)
				break;
		size-=2;
	}
	/*以上代码找到输入值size最近的一个素数*/
	nextPrime_ = size;
	numberOfWords_ = 0;
	hash = new HashEntry[nextPrime_];          //分配空间
	for (i = 0; i < nextPrime_; i++)
		hash[i].totalTimes_= 0;               //统计次数变量初始化为0
}
int HashTable::hashFunction(string key){      //计算hash值
	int i, num = 0, length;
	length = key.size()>8 ? 8 : key.size();   //最多取字母前8位
	for (i = length - 1; i >= 0; i--)
		num = num * 10 + key[i];              //取字符串前length位，通过10进制转换为整数
	return num;
}

void HashTable::insertKey(string key){
	int k1, k2,position,i;
	int hashValue = hashFunction(key);
	k1= hashValue%nextPrime_;                  //第一个哈希函数的值
	k2 = hashValue % (nextPrime_ - 2) + 1;      //第二个哈希函数的值
	for (i = 0; i < nextPrime_; i++){ 
		position = (k1 + i*k2)%nextPrime_;           //双重探查     
		if (hash[position].totalTimes_ == 0 || hash[position].words_ == key){
			if (hash[position].totalTimes_ == 0)
				numberOfWords_++;
			hash[position].words_ = key;
			hash[position].totalTimes_++;
			break;
		}
	}
}

void HashTable::showWord(double percentage){
	int i;
	int words = int(percentage*numberOfWords_);   //要输出的单词数量
	sort(hash, hash + nextPrime_);                //数量大到小排列
	for (i = 0; i < words; i++){
		cout << setw(15) << setfill(' ') << hash[i].words_ << "       " << hash[i].totalTimes_ << endl;
	}
}

bool CheckWords(string &key){   //对输入单词做一个简单筛选
	if (key[0]<'A' || key[0]>'z' || (key[0]<'a'&&key[0]>'Z'))
		return false;
	if (key.size() < 3)
		return false;
	if (key[key.size() - 1] == '.' || key[key.size() - 1] == ','){
		string::iterator it = key.begin();
		key.erase(it + key.size()-1);
	}
	return true;
}

int main(){
	/*测试*/
	string s;
	HashTable Hash(5000);   
	ifstream sin("a.txt");
	while (sin >> s){
		if (CheckWords(s))
			Hash.insertKey(s);
	}
	Hash.showWord(10.00/100);  //输出高频出现的单词的前10/100.
	return 0;
}

Tander_Tang

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
统计一篇短文中单词出现频率

散列查找的应用：给定一个英文文本文件，统计文件中所有单词出现的频率。解决这问题最基本的工作是不断地对读入的单词在已有的单词中查找，如果存在就将该单词频数加1，如果不存在就将该单词插入并记录频数为1.下面C++代码的哈希函数使用了双重探测的办法。在确定表的时候是取比输入整数小且距离输入整数最近的素数。#include#include#include#include#includeu
复制链接

扫一扫