使用jieba进行数据预处理(分词,过滤停用词及标点,获取词频、关键词等)

整理停用词 去空行和两边的空格

#encoding=utf-8
filename = "stop_words.txt"

f = open(filename,"r",encoding='utf-8')
result = list()
for line in f.readlines():
    line = line.strip()
    if not len(line):
        continue

    result.append(line)
f.close
with open("stop_words2.txt","w",encoding='utf-8') as fw:
    for sentence in result:
        sentence.encode('utf-8')
        data=sentence.strip()  
        if len(data)!=0:  
            fw.write(data)
            fw.write("\n") 
print ("end")

分词、停用词过滤(包括标点)

#encoding=utf-8
import jieba
filename = "../data/1000页洗好2.txt"
stopwords_file = "../data/stop_words2.txt"

stop_f = open(stopwords_file,"r",encoding='utf-8')
stop_words = list()
for line in stop_f.readlines():
    line = line.strip()
    if not len(line):
        continue

    stop_words.append(line)
stop_f.close

print(len(stop_words))

f = open(filename,"r",encoding='utf-8')
result = list()
for line in f.readlines():
    line = line.strip()
    if not len(line):
        continue
    outstr = '' 
    seg_list = jieba.cut(line,cut_all=False) 
    for word in seg_list:  
        if word not in stop_words:  
            if word != '\t':  
                outstr += word 
                outstr += " "  
   # seg_list = " ".join(seg_list)
    result.append(outstr.strip())
f.close

with open("../data/test2.txt","w",encoding='utf-8') as fw:
    for sentence in result:
        sentence.encode('utf-8'
评论 7
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值