词频统计pta

17111_Chaochao1984a

于 2024-06-11 08:16:03 发布

阅读量218

点赞数 2

文章标签： c# 开发语言

本文链接：https://blog.csdn.net/Chaochao1984a/article/details/139586075

版权

词频统计（Term Frequency Analysis，简称TFA）是一种文本分析方法，用于统计文本中每个单词出现的频率。在Python中，可以使用各种库来实现词频统计，如nltk、collections等。以下是一个使用Python进行词频统计的简单示例：

示例：词频统计

1. 准备文本

假设我们有以下文本：

text = "Hello world! Hello universe. This is a test text for word frequency analysis. Hello again."

2. 清洗和分词

首先，我们需要清洗文本，比如转换为小写，去除标点符号，然后进行分词（将文本分割成单词列表）。

import re # 转换为小写并使用正则表达式去除标点符号 cleaned_text = re.sub(r'[^\w\s]', '', text.lower()) # 分词 words = cleaned_text.split()

3. 统计词频

使用collections.Counter来统计词频。

from collections import Counter word_counts = Counter(words)

4. 输出结果

打印词频统计结果。

for word, count in word_counts.items(): print(f"{word}: {count}")

完整代码

将上述步骤合并，得到完整的词频统计代码：

import re from collections import Counter # 原始文本 text = "Hello world! Hello universe. This is a test text for word frequency analysis. Hello again." # 清洗文本并分词 cleaned_text = re.sub(r'[^\w\s]', '', text.lower()) words = cleaned_text.split() # 统计词频 word_counts = Counter(words) # 打印词频结果 for word, count in word_counts.items(): print(f"{word}: {count}")