stopwords.txt中英文数据集，四川大学机器智能实验室停用词库,哈工大停用词表,中文停用词表,百度停用词表百度网盘下载

最新推荐文章于 2023-07-08 08:35:22 发布

woshishui68892

最新推荐文章于 2023-07-08 08:35:22 发布

阅读量1.1w

点赞数 110

本文链接：https://blog.csdn.net/woshishui68892/article/details/108203121

版权

今天找stopwords.txt数据集找了好长时间，真是气死了，好多都是需要金币，这数据集不是应该共享的么。故搜集了一些数据集，主要包括四川大学机器智能实验室停用词库,哈工大停用词表,中文停用词表,百度停用词表和一些其他的stopword.text。最后用python将这些数据集合并成一个完整的数据集stopword.txt。
百度网盘地址在链接: https://pan.baidu.com/s/1KBkOzYk-wRYaWno6HSOE9g 提取码: 4sm6
文件不是很大可以直接下载。下面是详细的目录。
在这里插入图片描述

python合并txt文件的代码如下，百度网盘里面也有。

# -*- coding:utf-8 -*-
# 作者     ：Administrator
# 创建时间 ：2020/8/24 16:33
# 文件     ：stopword.py
# IDE      :PyCharm
# -*-coding:utf-8-*-
import os
#合并停用词文本
mergefiledir = os.getcwd()+'\\stopwords-master'   #文件夹名
filenames = os.listdir(mergefiledir)
file = open('stopwords.txt', 'w')

for filename in filenames:
    filepath = mergefiledir + '\\' + filename
    print(filepath)
    for line in open(filepath,"r",encoding="utf-8"):   #这里有可能会出现编码问题，现在是没什么问题。可以运行
        file.writelines(line)
    file.write('\n')
#去除重复
lines = open('stopwords.txt', 'r')
newfile = open('stopword.txt', 'w')
new = []
for line <

最低0.47元/天解锁文章