使用MWETokenizer可以自定义词语
>>> import nltk
>>> from nltk.tokenize import MWETokenizer
>>> test = 'In a little or a little bit or a lot in spite of'
>>> nltk.word_tokenize(test)
['In', 'a', 'little', 'or', 'a', 'little', 'bit', 'or', 'a', 'lot', 'in', 'spite', 'of']
>>> tokenizer = MWETokenizer([('a', 'little'), ('a', 'little', 'bit'), ('a', 'lot')], separator = '_')
>>> tokenizer.add_mwe(('in', 'spite', 'of'))
>>> tokenizer.tokenize(nltk.word_tokenize(test))
['In', 'a_little', 'or', 'a_little_bit', 'or', 'a_lot', 'in_spite_of']
参考文档