使用停止词库的中文分词

最新推荐文章于 2024-06-09 20:59:08 发布

0KI0

最新推荐文章于 2024-06-09 20:59:08 发布

阅读量65

点赞数

分类专栏：文本处理文章标签：中文分词自然语言处理深度学习

本文链接：https://blog.csdn.net/qq_39477242/article/details/79885065

版权

文本处理专栏收录该内容

1 篇文章 0 订阅

订阅专栏

import jieba
# 停止词库路径
filepathj="C:/Users/Administrator/Desktop/junkwords.txt"
# 语料路径
filepathji="E:/2018_taidibei/code/why2.txt"

## 将停止词库弄成一个列表
def stopwordslist(filepathj):
stopwords = [line.strip() for line in open(filepathj, 'r').readlines()]
return stopwords

#将停止词库列表赋给stopwords，导入停止词库
stopwords = stopwordslist(filepathj)

# 对句子进行分词,返回一个使用停止词库并分完词的列表
def seg_sentence(sentence):
sentence_seged = jieba.cut(sentence.strip())
seg_words=[]
for word in sentence_seged:
if word not in stopwords:
if word != '\t':
seg_words.append(word)
return seg_words

# 对句子进行分词 ,存为字符串形式
def seg_sentence(sentence):
sentence_seged = jieba.cut(sentence.strip())
outstr = ''
for word in sentence_seged:
if word not in stopwords:
if word != '\t':
outstr += word
outstr += " "
return outstr

#示例
#sentence='高速公路上驾驶会扣多少分？'
#未使用停止词库分词：高速公路上驾驶会扣多少分

#使用停止词库分词：['高速公路', '驾驶', '会扣', '分']

# 存为字符串 '高速公路驾驶会扣分 '

0KI0

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
使用停止词库的中文分词

import jieba# 停止词库路径filepathj="C:/Users/Administrator/Desktop/junkwords.txt"# 语料路径filepathji="E:/2018_taidibei/code/why2.txt"## 将停止词库弄成一个列表def stopwordslist(filepathj): stopwords = [line.strip(...
复制链接

扫一扫