python 中文统计词频,分词,去除停止词

最新推荐文章于 2023-02-07 12:15:29 发布

songhao8080

最新推荐文章于 2023-02-07 12:15:29 发布

阅读量1.9k

点赞数 1

本文链接：https://blog.csdn.net/songhao8080/article/details/103670179

版权

Python

# coding: utf-8 # In[46]: import <a href="https://www.168seo.cn/tag/jieba" title="View all posts in jieba" target="_blank">jieba</a> text = '''新乡SEO 昊天 seo 168seo.cn 免费分享最新的SEO技术,本站的目的是与同行交流SEO知识,并提供企业网站优化、企业网站诊断等服务,白帽SEO从我做起,专注用户体验研究'' ''' seg_list = <a href="https://www.168seo.cn/tag/jieba" title="View all posts in jieba" target="_blank">jieba</a>.cut_for_search(text) # 搜索引擎模式 # 对于要处理的文本进行搜索引擎分词处理 data = list(seg_list) # 分词后转化成list stopwords = [line.rstrip() for line in open('stopwords.txt', 'r', encoding="gbk").readlines()] # 读取停止词,生成list data = [d for d in data if d not in stopwords] # 剔除停止词 c = dict.fromkeys(data, 0) # 构造构造字典,并且默认值为0 for x in data: c[x] += 1 # 统计频次 newc = sorted(c.items(), key=lambda x: x[1], reverse=True) # 进行高频词排序 print(newc) # In[ ]: