pkuseg模型配置及简单文件分词、去停用词

最新推荐文章于 2024-09-28 07:45:00 发布

普通且自信66

最新推荐文章于 2024-09-28 07:45:00 发布

阅读量4.2k

点赞数 5

分类专栏：分词

分词专栏收录该内容

3 篇文章 0 订阅

订阅专栏

https://github.com/lancopku/pkuseg-python
安装下载之类的准备事项不再描述。

模型配置

pkuseg.pkuseg(model_name = "default", user_dict = "default", postag = False)
	model_name		模型路径。
			        "default"，默认参数，表示使用我们预训练好的混合领域模型(仅对pip下载的用户)。
				"news", 使用新闻领域模型。
				"web", 使用网络领域模型。
				"medicine", 使用医药领域模型。
				"tourism", 使用旅游领域模型。
			        model_path, 从用户指定路径加载模型。
	user_dict		设置用户词典。
				"default", 默认参数，使用我们提供的词典。
				None, 不使用词典。
				dict_path, 在使用默认词典的同时会额外使用用户自定义词典，可以填自己的用户词典的路径，词典格式为一行一个词。
	postag		        是否进行词性分析。
				False, 默认参数，只进行分词，不进行词性标注。
				True, 会在分词的同时进行词性标注。

对文件进行分词

pkuseg.test(readFile, outputFile, model_name = "default", user_dict = "default", postag = False, nthread = 10)
	readFile		输入文件路径。
	outputFile		输出文件路径。
	model_name		模型路径。同pkuseg.pkuseg
	user_dict		设置用户词典。同pkuseg.pkuseg
	postag			设置是否开启词性分析功能。同pkuseg.pkuseg
	nthread			测试时开的进程数。

代码示例


import pkuseg
from collections import Counter
import pprint

content = []
with open('first2.txt',encoding='utf-8') as f:
    content = f.read()
    # print(content)

# 希望分词时用户词典中的词固定不分开
lexicon = []
with open('t01.txt',encoding='utf-8') as f:
    lexicon = f.read()

# 以默认配置加载模型,给定用户词典
seg = pkuseg.pkuseg(user_dict='t01.txt')

# 进行分词
text = seg.cut(content)


#停用词
stopwords = []
with open('stopwords.txt',encoding='utf-8') as f:
    stopwords = f.read()

new_text = []
for w in text:
    if w not in stopwords:
        new_text.append(w)
        print(new_text)
#
# counter = Counter(new_text)
# pprint.pprint(counter.most_common(20))