使用请先安装结巴分词。(这样分类也只是个思路,这个思路还可以搞不少事情。)
处理百万关键词分分钟的事儿
下面贴代码。有bug请大神帮改正。封装版周末再上。待分类词放在ceshi.txt里,输出结果为text.txt输出格式为:分类:关键词
#coding:utf-8
#by@qiuye
import jieba
import jieba.analyse
f1 = open('ceshi.txt','r')
s1 = f1.read()
tags = jieba.analyse.extract_tags(s1,topK=20)
s2 = ','.join(tags).encode('utf-8')
l2 = s2.split(',')
f1.close()
f2 = open('ceshi.txt','r')
s2 = ''
for i in f2.readlines():
s2 = s2 + i
seg_list = jieba.cut(s2)
s2 = '|'.join(seg_list)
<span style="line-height: 1.5;">f2.close()</span>
l4 = []
for word in l2:
for i in l3:
if word in i.split('|'):
l4.append(word+':'+i)
open('text.txt','w').close()
f3 = open('text.txt','a')
for i in l4:
l5 = i.split('|')
s4 = ''
for word in l5:
s4 = s4 + word
f3.write(s4+'\n')
f3.close()