使用BERT的tokenizer替换spacy的tokenizer
spacy是支撑自己编写tokenzier的,可以非常简单的将tokenizer换成BERT的tokenizer
- 写一个类继承
spacy.tokenizer.Tokenizer
,自定义分词规则的时候传入BERT的tokenizer
class CustomTokenizer(Tokenizer):
def __init__(self,vocab,tokenizer):
super().__init__(vocab)
# 自定义分词规则
self.bert_tokenizer = tokenizer
def __call__(self, text):
# 自定义分词逻辑
tokens = self.bert_tokenizer.tokenize(text)
doc = spacy.tokens.doc.Doc(self.vocab,tokens)
return doc
- 导入spacy模型、BERT的tokenizer,实例化一个spacy的Tokenizer类并替换spacy的tokenizer
nlp = spacy.load('en_core_web_lg')
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokenizer = CustomTokenizer(nlp.vocab,bert_tokenizer)
nlp.tokenizer = tokenizer
- 测试
text = "I am spiderman."
doc = nlp(text)
for token in doc:
print(token.text)
'''output
i
am
spider
##man
.
'''