句子切分(Sentence Tokenize)
nltk的分词是句子级别的,所以对于一篇文档首先要将文章按句子进行分割,然后句子进行分词
from nltk.tokenize import sent_tokenize
text = """Hello Mr. Smith, how are you doing today? The weather is great, and
city is awesome.The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text = sent_tokenize(text)
print(tokenized_text)
['Hello Mr. Smith, how are you doing today?', 'The weather is great, and \ncity is awesome.The sky is pinkish-blue.', "You shouldn't eat cardboard"]
单词切分(Word Tokenize)
import nltk
sent = "Study hard and improve every day."
token = nltk.word_tokenize(sent)
print(token)
['Study', 'hard', 'and', 'improve', 'every', 'day', '.']
移除标点符号
对每个切词调用该函数,移除字符串中的标点符号,string.punctuation包含了所有的标点符号,从切词中把这些标点符号替换为空格。