把文章分为句子
para = "Hello World. It's good to see you. Thanks for buying this book."
from nltk.tokenize import sent_tokenize
sent_tokenize(para)
把句子分为单词
from nltk.tokenize import word_tokenize
word_tokenize('Hello World.')
>>> from nltk.tokenize import RegexpTokenizer >>> tokenizer = RegexpTokenizer("[\w']+") >>> tokenizer.tokenize("Can't is a contraction.") ["Can't", 'is', 'a', 'contraction']
去除停用词
from nltk.corpus import stopwords
english_stops = set(stopwords.words('english'))
words = ["Can't", 'is', 'a', 'contraction']
[word for word in words if word not in english_stops]
词形归并
>>> from nltk.stem import Po