python做词频分析时的停止词，长度，去除标点符号处理

EaSoNgo111

已于 2023-04-10 17:07:48 修改

阅读量646

点赞数

文章标签： java 开发语言 python excel 数据分析

于 2023-04-10 17:07:31 首次发布

本文链接：https://blog.csdn.net/EaSoNgo111/article/details/130064651

版权

# coding=utf-8
import string

import collections

import yake

def word_count(list):
    dic = collections.Counter(list)
    # for i in dic:
    #     info = []
    #     info.append(i)
    #     info.append(dic[i])
    #     all_info.append(info)
    return dic


def phrase_extract(text):
    text = text.lower()
    custom_kw_extractor = yake.KeywordExtractor(top=100, lan="en")
    keywords = custom_kw_extractor.extract_keywords(text)
    phrase_list = []
    for keyword, score in keywords:
        if len(keyword.split(' ')) > 1:
            phrase_list.append(keyword.lower())
    phrases_list = []
    for phrase in phrase_list:
        for i in range(0, len(text.split(phrase)) - 1):
            phrases_list.append(phrase)
    return phrases_list


def words_count(text):
    phrases_list = phrase_extract(text)
    # word_tokens = nltk.tokenize.word_tokenize(text.strip())
    word_list = []
    for i in text.split(' '):
        word_list.append(i)
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in word_list]
    tokens = [word for word in tokens if word.isalpha()]
    # stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    tokens = [word for word in tokens if len(word) > 1]
    tokens = tokens + phrases_list
    return word_count(tokens)

对输入的文本进行分析处理，返回包含单词和短语出现频率的字典。下面是代码的详细解释：

phrase_extract(text)函数：该函数用yake模块提取出原始文本中的短语，并以小写字母形式返回列表。
将原始文本通过split()方法划分为单词列表后，使用str.maketrans()方法去掉标点符号、使用isalpha()方法过滤出只含有字母的单词列表。
接着，使用nltk库中stopwords模块获取英文停用词表，过滤掉其中在停用词表中出现的单词，并排除长度为1的单词。
最后，将步骤1中得到的短语列表与不在停用词中的单词列表拼接成新的列表，并交给word_count函数进行计数，返回一个包含单词和短语出现频率的字典。

EaSoNgo111

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python做词频分析时的停止词，长度，去除标点符号处理

对输入的文本进行分析处理，返回包含单词和短语出现频率的字典。模块获取英文停用词表，过滤掉其中在停用词表中出现的单词，并排除长度为1的单词。函数：该函数用yake模块提取出原始文本中的短语，并以小写字母形式返回列表。最后，将步骤1中得到的短语列表与不在停用词中的单词列表拼接成新的列表，并交给。函数进行计数，返回一个包含单词和短语出现频率的字典。方法过滤出只含有字母的单词列表。方法划分为单词列表后，使用。方法去掉标点符号、使用。接着，使用nltk库中。
复制链接

扫一扫