参考:How to avoid NLTK’s sentence tokenizer splitting on abbreviations?
NLTK自带的nltk.tokenize
库可以实现英文分句,但是当句子中存在缩写词时分句会错误:
from nltk.tokenize import sent_tokenize
sens = sent_tokenize('Fig. 2 shows a U.S.A. map.Look!')
print(sens)
"""
输出:['Fig.', '2 shows a U.S.A. map.Look!']
"""
可以使用nltk.tokenize.punkt
库自定义缩写词列表进行分句:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
abbreviation = ['fig', 'u.s.a']
punkt_param.abbrev_types = set(abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)
sens = tokenizer.tokenize('Fig. 2 shows a U.S.A. map.Look!')
print(sens)
"""
输出:['Fig. 2 shows a U.S.A. map.Look!']
"""
注意:
- 自定义缩略词列表单词全部小写,结尾的句号要去掉
- 句子结束标点后面要加一个空格,否则分句失败