来自 NLTK 包的 TreebankWordTokenizer 分词器,它内置了多种常见的英语分词规则。例如,它从相邻的词条中将短语结束符号(?!.;,)分开,将包含句号的小数当成单个词条。另外,它还包含一些英文缩略语的规则,例如,“don’t”会切分成[“do”, “n’t”]。
from nltk.tokenize import TreebankWordTokenizer
sentence = """Monticello wasn't designated as UNESCO World Heritage Site until 1987."""
tokenizer = TreebankWordTokenizer()
print(tokenizer.tokenize(sentence))