目录
1.停用词
汉语中有一类没有多少意义的词语,比如助词“的”、连词“以及”、副词“甚至”、语气词“吧”,称为停用词。一个句子去掉了停用词并不影响理解。停用词视具体任务的不同而不同,比如在网站系统中,一些非法的敏感词也视作停用词。因此,停用词过滤就是一个常见的预处理过程。
2.实现思路
考虑到词典中含有单字词语,用双数组词典树存储字典更划算:
def load_from_file(path):
"""
从词典文件加载DoubleArrayTrie
:param path: 词典路径
:return: 双数组trie树
"""
map = JClass('java.util.TreeMap')() # 创建TreeMap实例
with open(path, encoding='utf-8') as src:
for word in src:
word = word.strip() # 去掉Python读入的\n
map[word] = word
return JClass('com.hankcs.hanlp.collection.trie.DoubleArrayTrie')(map)
def load_from_words(*words):
"""
从词汇构造双数组trie树
:param words: 一系列词语
:return:
"""
map = JClass('java.util.TreeMap')() # 创建TreeMap实例
for word in words:
map[word] = word
return JClass('com.hankcs.hanlp.collection.trie.DoubleArrayTrie')(map)
加载停用词后,会得到一颗双数组字典树。针对分词结果,遍历每个词语,若他在字典树中,则删除。
def remove_stopwords_termlist(termlist, trie):
return [term.word for term in termlist if not trie.containsKey(term.word)]
在敏感词过滤的场景下,通常需要将敏感词替换为特殊字符串,如***,可以先分词在替换,也可以不分词直接利用接口查找敏感词并完成替换。
def replace_stropwords_text(text, replacement, trie):
searcher = trie.getLongestSearcher(JString(text), 0)
offset = 0
result = ''
while searcher.next():
begin = searcher.begin
end = begin + searcher.length
if begin > offset:
result += text[offset: begin]
result += replacement
offset = end
if offset < len(text):
result += text[offset:]
return result
3.全部实现代码:
from jpype import JString
from pyhanlp import *
def load_from_file(path):
"""
从词典文件加载DoubleArrayTrie
:param path: 词典路径
:return: 双数组trie树
"""
map = JClass('java.util.TreeMap')() # 创建TreeMap实例
with open(path, encoding='utf-8') as src:
for word in src:
word = word.strip() # 去掉Python读入的\n
map[word] = word
return JClass('com.hankcs.hanlp.collection.trie.DoubleArrayTrie')(map)
def load_from_words(*words):
"""
从词汇构造双数组trie树
:param words: 一系列词语
:return:
"""
map = JClass('java.util.TreeMap')() # 创建TreeMap实例
for word in words:
map[word] = word
return JClass('com.hankcs.hanlp.collection.trie.DoubleArrayTrie')(map)
def remove_stopwords_termlist(termlist, trie):
return [term.word for term in termlist if not trie.containsKey(term.word)]
def replace_stropwords_text(text, replacement, trie):
searcher = trie.getLongestSearcher(JString(text), 0)
offset = 0
result = ''
while searcher.next():
begin = searcher.begin
end = begin + searcher.length
if begin > offset:
result += text[offset: begin]
result += replacement
offset = end
if offset < len(text):
result += text[offset:]
return result
if __name__ == '__main__':
HanLP.Config.ShowTermNature = False
trie = load_from_file(HanLP.Config.CoreStopWordDictionaryPath)
text = "停用词的意义相对而言无关紧要吧。"
segment = DoubleArrayTrieSegment()
termlist = segment.seg(text)
print("分词结果:", termlist)
print("分词结果去除停用词:", remove_stopwords_termlist(termlist, trie))
trie = load_from_words("的", "相对而言", "吧")
print("不分词去掉停用词", replace_stropwords_text(text, "**", trie))
4.运行结果:
分词结果:[停用,词,的,意义,相对而言,无关紧要,吧,。]
分词结果去掉停用词:[停用,词,意义,无关紧要]
不分词去掉停用词:停用词**意义**无关紧要**。