【C/C++ 07】词频统计

一、题目

读入一篇英文短文,去除介词、连词、冠词、副词、代词等非关键性单词后,统计每个单词出现的次数,并将单词按出现次数的降序和单词字符的升序进行显示5个单词。

二、算法

1. 通过<fstream>库创建fstream流对象,并从文件中读取全部字符存入内存。

2. 将内存的字符串进行空格、标点符号、换行符的分割,若分割后的单词属于需要记入统计的单词,则将其存入map容器中,通过map进行词频统计。

3. 对map中的词频统计结果写入vector中进行排序,并按照规定的排序顺序进行打印。

三、代码

#define _CRT_SECURE_NO_WARNINGS 1

#include <iostream>
#include <fstream>
#include <cstring>
#include <map>
#include <vector>
#include <algorithm>
using namespace std;

// 不计入统计的单词表:介词、连词、冠词、副词、代词
vector<string> g_delWord = {
	"to", "in", "on", "for", "of", "from", "between", "behind", "by", "about", "at", "with", "than",
	"a", "an", "the", "this", "that", "there",
	"and", "but", "or", "so", "yet",
	"often", "very", "then", "therefore",
	"i", "you", "we", "he", "she", "my", "your", "hes", "her", "our", "us", "it", "they", "them",
	"am", "is", "are", "was", "were", "be",
	"when", "where", "who", "what", "how",
	"will", "would", "can", "could"
};

// 仿函数
struct Compare
{
	bool operator()(const pair<string, int> e1, const pair<string, int> e2)
	{
		return e1.second > e2.second;
	}
};

int main()
{
	// 1. 读入文件数据
	//    ofstream:写文件
	//	  ifstream:读文件
	//	  fstream:读写文件
	fstream f;

	//	  ios::in
	//	  ios::out
	//	  ios::app,追加写,配合ios::out使用
	//	  ios::trunc,覆盖写,配合ios::out使用
	//	  ios::binary,以二进制的形式
	f.open("./test1.txt", ios::in);
	if (!f.is_open())
	{
		cout << "file open failed!" << endl;
		return 1;
	}
	char text[4096] = { 0 };
	f.read(text, 4096);

	// 2. 分割字符串存入map
	map<string, int> wordMap;
	const char* cut = " ,.!?;:\n";	// 部分单词分隔符
	char* w = strtok(text, cut);
	while (w)
	{
		string word = w;

		// 单词转小写
		string lwrWord;
		transform(word.begin(), word.end(), back_inserter(lwrWord), ::tolower);

		// 排除不计入统计的单词
		if (find(g_delWord.begin(), g_delWord.end(), lwrWord) == g_delWord.end())
		{
			wordMap[lwrWord]++;
			// map 的 “[]”重载,有插入、查询、修改功能,返回值为键值对的second值或false
		}

		w = strtok(NULL, cut);
	}

	// 3. 词频排序
	vector<pair<string, int>> wordVec;
	for (auto& e : wordMap)
	{
		wordVec.push_back(e);
	}

	// sort是基于快速排序实现的算法,是不稳定的排序算法,可用stable_sort代替
	stable_sort(wordVec.begin(), wordVec.end(), Compare());

	for (int i = 0; i < 5; ++i)
	{
		cout << "<" << wordVec[i].first << "," << wordVec[i].second << ">" << endl;
	}

	return 0;
}

四、测试

测试文档test1.txt

No one can help others as much as you do. 
No one can express himself like you. 
No one can express what you want to convey. 
No one can comfort others in your own way. 
No one can be as understanding as you are. 
No one can feel happy, carefree, and no one can smile as much as you do. 
In a word, no one can show your features to anyone else.
hi, how are you? I love you!

运行结果

  • 12
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
以下是一个使用OpenMP并行计算单词出现次数的C++代码: ```c++ #include <iostream> #include <fstream> #include <sstream> #include <string> #include <vector> #include <algorithm> #include <unordered_map> #include <omp.h> using namespace std; // 去除字符串首尾的空格和标点符号 string clean_word(string word) { while (!isalpha(word[0])) { word.erase(0, 1); } while (!isalpha(word[word.size() - 1])) { word.erase(word.size() - 1, 1); } return word; } // 统计单词出现次数 void count_words(unordered_map<string, int>& word_counts, const string& filename) { ifstream file(filename); if (!file.is_open()) { cerr << "Failed to open file: " << filename << endl; exit(1); } string line; while (getline(file, line)) { stringstream ss(line); string word; while (ss >> word) { word = clean_word(word); if (word != "") { #pragma omp atomic word_counts[word]++; } } } } int main(int argc, char** argv) { if (argc < 2) { cerr << "Usage: " << argv[0] << " <filename>" << endl; exit(1); } string filename = argv[1]; unordered_map<string, int> word_counts; double start_time = omp_get_wtime(); count_words(word_counts, filename); double end_time = omp_get_wtime(); // 排序并写入文件 vector<pair<string, int>> word_count_pairs(word_counts.begin(), word_counts.end()); sort(word_count_pairs.begin(), word_count_pairs.end(), [](const pair<string, int>& a, const pair<string, int>& b) { return a.second > b.second; }); ofstream out_file("word_count.txt"); if (!out_file.is_open()) { cerr << "Failed to create output file." << endl; exit(1); } for (const auto& pair : word_count_pairs) { out_file << pair.first << " " << pair.second << endl; } cout << "Time: " << end_time - start_time << "s" << endl; return 0; } ``` 在这个程序中,我们使用 `omp atomic` 关键字来保证多线程同时对一个单词进行修改时,各线程的修改不会互相覆盖。 运行程序时,需要将 OpenMP 支持打开。在 Linux 上,可以使用如下命令编译: ```bash g++ -fopenmp -o wordcount wordcount.cpp ``` 然后使用 `./wordcount <filename>` 来运行程序,其中 `<filename>` 是要统计单词出现次数的文件名。 程序会将统计结果按照“单词 出现次数”的格式保存到 `word_count.txt` 文件中,并在命令行输出程序运行时间。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

AllinTome

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值