NLTK
NLTK的全称是natural language toolkit,是一套基于python的自然语言处理工具集。
NLTK是Python很强大的第三方库,可以很方便的完成很多自然语言处理(NLP)的任务,包括分词、词性标注、命名实体识别(NER)及句法分析。
NLTK的安装
nltk的安装十分便捷,只需要pip就可以。
pip install nltk
在nltk中集成了语料与模型等的包管理器,通过在python解释器中执行
import nltk
nltk.download()
from nltk.corpus import brown
brown.words()
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
一、NLTK进行分词
nltk.sent_tokenize(text) #对文本按照句子进行分割
nltk.word_tokenize(sent) #对句子进行分词
假设我们有如下的示例文本:
Hello Adam, how are you? I hope everything is going well. Today is a good day, see you dude.
为了将这个文本标记化为句子,我们可以使用句子标记器:
from nltk.tokenize import sent_tokenize
mytext = "Hello Adam, how are you? I hope everything is going well. Today is a good day, see you dude."
print(sent_tokenize(mytext))
['Hello Adam, how are you?', 'I hope everything is going well.', 'Today is a good day, see you dude.']
你可能会说,这是一件容易的事情。我不需要使用 NLTK 标记器,并且我可以使用正则表达式来分割句子,因为每个句子前后都有标点符号或者空格。
那么,看看下面的文字:
Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude.
呃!Mr. 是一个词,虽然带有一个符号。让我们来试试使用 NLTK 进行分词:
from nltk.tokenize import sent_tokenize
mytext = "Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude."
print(sent_tokenize(mytext))
['Hello Mr. Adam, how are you?', 'I hope everything is going well.', 'Today is a good day, see you dude.']
Great!结果棒极了。然后我们尝试使用词语标记器来看看它是如何工作的:
from nltk.tokenize import word_tokenize
mytext = "Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude."
print(word_tokenize(mytext))
['Hello', 'Mr.', 'Adam', ',', 'how', 'are', 'you', '?', 'I', 'hope', 'everything', 'is', 'going', 'well&#