英文自然语言处理
1.文本划分为句子
import nltk
from nltk.tokenize import sent_tokenize # 按句子分割 见. 就断开
text = ' Welcome readers. I hope you find it interesting. Please do reply.'
# print(sent_tokenize(text))
# 结果:[' Welcome readers.', 'I hope you find it interesting.', 'Please do reply.']
# 切分大批量的句子,加载PunktSentenceTokenizer 并使用其tokenize()函数来进行切分
# from nltk.tokenize import PunktSentenceTokenizer 没有用这个语句
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text = " Hello everyone. Hope all are fine and doing well. Hope you find the book interesting."
# print(tokenizer.tokenize(text))
# 结果:[' Hello everyone.', 'Hope all are fine and doing well.', 'Hope you find the book interesting']
# for row in tokenizer.tokenize(text):
# print(row)
# 结果: Hello everyone.
# Hope all are fine and doing well.
# Hope you find the book interesting.
2.将句子切分成单词(文本 --> 句子–>单词)
# word_tokenize()函数
# word_tokenize 函数使用 NLTK 包的一个叫作 TreebankWordTokenizer 类的实例用于执行单词的切分