AllenNLP中常使用spacy对英文进行分词,但是spacy不能对中文分词。因此我想尝试加一个中文分词的word_splitter。对比了一些中文分词包,最后在THULAC和jieba中进行选择,据说前者更准确,后者速度更快,为了保证准确,我选择了前者(不知道还有没有更好的)。我参考spacy部分的代码弄了一个THUNLPSplitter。
测试代码:(postags指是否标注词性,only_tokens指最终是否只保留字符,去掉词性等属性,user_dict指用户自定义词典,是一个文件路径)
from allennlp.data.tokenizers.word_splitter import THUNLPSplitter
from allennlp.data.tokenizers.token import show_token
splitter = THUNLPSplitter(pos_tags=False)
print(splitter.split_words("武汉市长江大桥"))
splitter2 = THUNLPSplitter(pos_tags=True, only_tokens=False)
tokens = splitter2.split_words("武汉市长江大桥")
for token in tokens:
print(show_token(token))
splitter3 = THUNLPSplitter(pos_tags=False, user_dict='F:\\test\\userdict.txt')
print(splitter3.split_words("中美合拍,文体两开花。皮皮虾我们走"))
结果如下:
用户词典是一个txt文件,一行有一个自定义的词,这里我没有定义“皮皮虾”,所以分词不正确,加入后可以分出“皮皮虾”。
自定义“皮皮虾”后的结果如下:
但是用户词典有时候效果不好,作者表示是在后处理的时候使用用户词典,如果定义词中的一个字在分词后和其他字形成词,则不会使用用户自定义的词。这一点感觉不太好。
完整代码如下:
@WordSplitter.register('thunlp')
class THUNLPSplitter(WordSplitter):
"""
A ``WordSplitter`` that uses THUNLP's tokenizer. To Split Chinese sentences.
simplify:Convert traditional characters to simplified characters
filt:Remove meaningless words
user_dict:a txt file, one word in a line.
"""
def __init__(self,pos_tags: bool = False,
simplify: bool = False,
filt: bool = False,
only_tokens: bool = True,
user_dict: List[str] = None) -> None:
import thulac
if pos_tags:
seg_only = False
else:
seg_only = True
self.thunlp = thulac.thulac(seg_only=seg_only, T2S=simplify, filt=filt, user_dict=user_dict)
self._only_tokens = only_tokens
def _sanitize(self, tokens: List[str]) -> List[Token]:
"""
Converts spaCy tokens to allennlp tokens. Is a no-op if
keep_spacy_tokens is True
"""
sanitize_tokens = []
for token_attri in tokens:
token = Token(token_attri[0])
if self._only_tokens:
pass
else:
token = Token(token.text,
token.idx,
token.lemma_,
token_attri[1],
token.tag_,
token.dep_,
token.ent_type_)
sanitize_tokens.append(token)
return sanitize_tokens
@overrides
def batch_split_words(self, sentences: List[str]) -> List[List[Token]]:
split_words = []
for sent in sentences:
split_words.append(self._sanitize(tokens) for tokens in self.thunlp.cut(sent, text=False))
return split_words
@overrides
def split_words(self, sentence: str) -> List[Token]:
return self._sanitize(self.thunlp.cut(sentence, text=False))