Python31 自然语言处理NLP之NLTK的使用

智能建造研究生

于 2024-07-11 08:30:12 发布

阅读量753

点赞数 24

分类专栏： python学习 AI算法的Python实现文章标签：自然语言处理人工智能

本文链接：https://blog.csdn.net/Argulo/article/details/140340743

版权

python学习同时被 2 个专栏收录

39 篇文章 0 订阅

订阅专栏

AI算法的Python实现

3 篇文章 0 订阅

订阅专栏

1.关于自然语言处理NLP

自然语言处理NLP是人工智能和计算机科学的一个子领域，专注于计算机与人类（自然）语言之间的互动。其目的是使计算机能够理解、解释和生成人类语言。NLP 涉及语言学、计算机科学和人工智能的多学科交叉，通过统计、机器学习和深度学习等方法处理和分析大量的自然语言数据。

核心任务和应用

NLP 包括多种任务和应用，主要分为以下几类：

1. 文本处理

分词：将文本分割成独立的单词或短语。
词性标注：标识每个单词在句子中的词性（如名词、动词、形容词等）。
句法分析：解析句子的语法结构，包括依存关系和短语结构分析。

2. 文本分类

情感分析：检测文本中的情感倾向，如积极、中立或消极。
主题建模：识别文本中的主题或隐藏的语义结构。
垃圾邮件过滤：检测并过滤电子邮件中的垃圾内容。

3. 信息提取

命名实体识别（NER）：识别文本中的实体，如人名、地名、组织等。
关系抽取：从文本中提取实体之间的关系。
事件抽取：识别和分类文本中的事件信息。

4. 机器翻译

自动翻译：将文本从一种语言翻译成另一种语言，如 Google 翻译。

5. 文本生成

语言生成：生成与输入语义相关的自然语言文本。
摘要生成：从长文本中提取关键内容生成摘要。

6. 对话系统

聊天机器人：与用户进行自然语言对话，如客服机器人。
智能助理：提供信息查询、任务管理等服务，如 Siri、Alexa。

主要技术和方法

1. 统计方法

早期的 NLP 方法主要依赖于统计模型，如 n-gram 模型、隐马尔可夫模型（HMM）和条件随机场（CRF），用于各种语言处理任务。

2. 机器学习

传统机器学习方法，如支持向量机（SVM）、朴素贝叶斯、决策树等被广泛应用于文本分类、情感分析等任务。

3. 深度学习

近年来，深度学习技术在 NLP 中取得了显著进展，主要包括：

循环神经网络（RNN）：特别是长短期记忆（LSTM）和门控循环单元（GRU）被用于处理序列数据，如文本生成和机器翻译。
卷积神经网络（CNN）：用于文本分类和句子建模。
Transformer：由 Google 提出的 Transformer 结构及其衍生模型（如 BERT、GPT）在多种 NLP 任务中表现优异。

工具和库

NLTK：Python 的自然语言处理库，提供丰富的工具和资源。
spaCy：高效的 NLP 库，适用于工业应用。
Gensim：用于主题建模和文档相似性计算的库。
Transformers：Hugging Face 提供的库，包含多种预训练模型，如 BERT、GPT 等。

2.NTLK库

NLTK（Natural Language Toolkit）是一个广泛使用的开源 Python 库，专门用于处理自然语言文本。它提供了丰富的工具和资源，用于完成各种自然语言处理（NLP）任务，包括文本预处理、词性标注、句法分析、语义分析、机器翻译等。NLTK 适用于教育和研究领域，同时也是入门 NLP 的理想工具。

核心组件和功能

NLTK 包含多个模块和子包，提供了各种 NLP 功能。以下是一些核心组件和功能：

1. 文本预处理

分词（Tokenization）：将文本分割成独立的单词或句子。

# 导入 NLTK 库
import nltk

# 下载 punkt 数据包，用于分句和分词
nltk.download('punkt')

# 定义一个句子
sentence = "Natural language processing is fun."

# 使用 NLTK 的 word_tokenize 函数对句子进行分词
# word_tokenize 函数将输入的字符串按单词进行分割，生成一个单词列表
words = nltk.word_tokenize(sentence)

# 打印分词后的结果
# 结果是一个包含句子中每个单词的列表
print(words)

# 输出：
'''
['Natural', 'language', 'processing', 'is', 'fun', '.']
'''

去除停用词（Stopword Removal）：去除无意义的常见词（如 "the", "is"）。

# 从 NLTK 的语料库模块中导入 stopwords
from nltk.corpus import stopwords

# 下载 stopwords 数据包，包含各种语言的常见停用词
nltk.download('stopwords')

# 获取英语的停用词集合
stop_words = set(stopwords.words('english'))

# 过滤掉分词结果中的停用词
# 对于每个单词 w，如果该单词（转换为小写后）不在停用词集合中，则保留该单词
filtered_words = [w for w in words if not w.lower() in stop_words]

# 打印过滤后的单词列表
# 结果是一个去除了停用词的单词列表
print(filtered_words)

# 输出：
'''
['Natural', 'language', 'processing', 'fun', '.']
'''

词干提取（Stemming）：将单词还原为词干形式。

# 从 NLTK 的 stem 模块中导入 PorterStemmer
from nltk.stem import PorterStemmer

# 初始化 PorterStemmer 对象
stemmer = PorterStemmer()

# 对分词结果中的每个单词进行词干提取
# stemmer.stem(w) 方法会提取单词的词干
stemmed_words = [stemmer.stem(w) for w in words]

# 打印词干提取后的单词列表
# 结果是一个包含每个单词词干形式的列表
print(stemmed_words)

# 输出：
'''
['natur', 'languag', 'process', 'is', 'fun', '.']
'''

词形还原（Lemmatization）：将单词还原为其基本形式。

# 从 NLTK 的 stem 模块中导入 WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# 下载 wordnet 数据包，包含用于词形还原的词典
nltk.download('wordnet')

# 初始化 WordNetLemmatizer 对象
lemmatizer = WordNetLemmatizer()

# 对分词结果中的每个单词进行词形还原
# lemmatizer.lemmatize(w) 方法会将单词还原为其基本形式
lemmatized_words = [lemmatizer.lemmatize(w) for w in words]

# 打印词形还原后的单词列表
# 结果是一个包含每个单词基本形式的列表
print(lemmatized_words)

# 输出：
'''
['Natural', 'language', 'processing', 'is', 'fun', '.']
'''

2. 词性标注（Part-of-Speech Tagging）

标注句子中的每个单词的词性：

# 下载词性标注器的模型数据包
nltk.download('averaged_perceptron_tagger')

# 对分词结果进行词性标注
# nltk.pos_tag(words) 方法会为每个单词分配词性标签
tagged_words = nltk.pos_tag(words)

# 打印词性标注后的单词列表
# 结果是一个包含单词及其词性标签的元组列表
print(tagged_words)

# 输出：
'''
[('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('is', 'VBZ'), ('fun', 'NN'), ('.', '.')]
'''

3. 命名实体识别（Named Entity Recognition）

识别句子中的命名实体：

# 下载用于命名实体识别的模型数据包
nltk.download('maxent_ne_chunker')

# 下载 words 数据包，包含用于命名实体识别的词典
nltk.download('words')

# 使用词性标注后的单词列表进行命名实体识别
# nltk.chunk.ne_chunk(tagged_words) 方法会识别句子中的命名实体
entities = nltk.chunk.ne_chunk(tagged_words)

# 打印命名实体识别后的结果
# 结果是一个包含标记的命名实体的树结构
print(entities)

# 输出：
'''
(S Natural/JJ language/NN processing/NN is/VBZ fun/NN ./.)
'''

4. 句法分析（Syntactic Parsing）

解析句子的语法结构：

# 从 NLTK 导入上下文无关文法（CFG）模块
from nltk import CFG

# 定义上下文无关文法（CFG）
grammar = CFG.fromstring("""
S -> NP VP    # 句子 S 由名词短语 NP 和动词短语 VP 组成
VP -> V NP    # 动词短语 VP 由动词 V 和名词短语 NP 组成
NP -> 'John' | 'Mary' | 'Bob'    # 名词短语 NP 由三个名字中的一个组成
V -> 'loves' | 'hates'    # 动词 V 包含 'loves' 和 'hates'
""")

# 使用定义的语法创建一个 ChartParser
parser = nltk.ChartParser(grammar)

# 将句子分词
sentence = "John loves Mary".split()

# 使用解析器解析句子
for tree in parser.parse(sentence):
    # 打印解析得到的树结构
    print(tree)

# 输出：
'''
(S (NP John) (VP (V loves) (NP Mary)))
'''

5. 语料库和词典资源

NLTK 提供了丰富的语料库和词典资源，涵盖了各种语言和应用领域。主要语料库包括：Gutenberg Corpus（经典文学作品，如《白鲸》）、Brown Corpus（平衡的英语语料库）、Reuters Corpus（新闻文档）、Inaugural Address Corpus（美国总统就职演说）、Movie Reviews Corpus（影评文本）、Web Text Corpus（互联网文本）、Shakespeare Corpus（莎士比亚戏剧文本）和 Treebank Corpus（句法树和词性标注的文本）。主要词典资源包括：WordNet（大型英语词典数据库）、Names Corpus（常见男性和女性名字）、Stopwords Corpus（多种语言的停用词列表）、Swadesh Corpus（基本词汇列表）、CMU Pronouncing Dictionary（英语发音词典）和 Opinion Lexicon（正面和负面情感词汇列表）。这些资源为各种自然语言处理任务提供了基础数据支持。：

# 从 NLTK 的语料库模块中导入 gutenberg 语料库
from nltk.corpus import gutenberg

# 下载 gutenberg 语料库
nltk.download('gutenberg')

# 获取 'melville-moby_dick.txt' 文件的原始文本内容
sample = gutenberg.raw('melville-moby_dick.txt')

# 打印前 500 个字符的文本内容
print(sample[:500])

# 输出是从 NLTK 的 Gutenberg 语料库中提取并打印了《白鲸》（Moby Dick）前 500 个字符的内容。：
'''
[Moby Dick by Herman Melville 1851]


ETYMOLOGY.

(Supplied by a Late Consumptive Usher to a Grammar School)

The pale Usher--threadbare in coat, heart, body, and brain; I see him
now.  He was ever dusting his old lexicons and grammars, with a queer
handkerchief, mockingly embellished with all the gay flags of all the
known nations of the world.  He loved to dust his old grammars; it
somehow mildly reminded him of his mortality.

"While you take in hand to school others, and to teac
'''

3.代码示例

①文本预处理和词性标注

以下是一个完整的示例，展示了如何使用 NLTK 进行文本预处理和词性标注：

import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# 下载必要的 NLTK 数据包
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

# 加载文本
text = "NLTK is a leading platform for building Python programs to work with human language data."

# 分词
words = nltk.word_tokenize(text)

# 去除停用词
stop_words = set(stopwords.words('english'))
filtered_words = [w for w in words if not w.lower() in stop_words]

# 词干提取
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(w) for w in filtered_words]

# 词形还原
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(w) for w in filtered_words]

# 词性标注
tagged_words = nltk.pos_tag(lemmatized_words)

print("分词：", words)
print("去除停用词：", filtered_words)
print("词干提取：", stemmed_words)
print("词形还原：", lemmatized_words)
print("词性标注：", tagged_words)

②使用 NLTK 和 Matplotlib 可视化文本词汇分布与频率分析

import nltk
from nltk.book import *  # 导入 NLTK 书中的所有内容
import matplotlib.pyplot as plt

# 下载所需的资源包
nltk.download('genesis')
nltk.download('book')
nltk.download('inaugural')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('stopwords')
nltk.download('wordnet')

# 搜索文本
text1.concordance("monstrous")
text2.concordance("affection")
text3.concordance("lived")
text5.concordance("lol")  # 聊天记录

# 搜索相似词
text1.similar("monstrous")
print("-----分割线----")
text2.similar("monstrous")  # 不同文本中的相似词

# 搜索共同上下文
text2.common_contexts(["monstrous","very"])

# 词汇分布图
text4.dispersion_plot(["citizens","democracy","freedom","duties","America"])  # 救济演说语料
# 随着时间发展，单词出现的频率

# 使用 text1 生成文本（需要 text1 已经导入）
generated_text = text1.generate(50)  # 生成 50 个单词的文本
print(generated_text)

# 计数词汇
len(text3)
print(len(text3))
sorted(set(text3))  # 排序
len(set(text3))  # 去重

#重复词密度
print(len(text3)/len(set(text3)))  # 平均每个标识符出现16次

# 关键词密度
print(text3.count("smote"))
print(100* text4.count("a")/len(text4))
def lexical_diversity(text):
    return len(text)/len(set(text))
def percentage(count,total):
    return 100*count/total
print(lexical_diversity(text3))
print("text5 diversity is {}".format(lexical_diversity(text5)))
print("in text4,a percentage {}%".format(percentage(text4.count("a"),len(text4))))


# 创建词汇频率分布
fdist1 = FreqDist(text1)

# 打印词汇频率分布的摘要
print(fdist1)

# 获取词汇表
vocabulary1 = fdist1.keys()

# 可视化频率分布
plt.figure(figsize=(10, 6))
fdist1.plot(30, cumulative=False)  # 绘制前30个词汇的频率分布
plt.show()

# 输出：
'''
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
Displaying 25 of 79 matches:
, however , and , as a mark of his affection for the three girls , he left them
t . It was very well known that no affection was ever supposed to exist between
deration of politeness or maternal affection on the side of the former , the tw
d the suspicion -- the hope of his affection for me may warrant , without impru
hich forbade the indulgence of his affection . She knew that his mother neither
rd she gave one with still greater affection . Though her late conversation wit
 can never hope to feel or inspire affection again , and if her home be uncomfo
m of the sense , elegance , mutual affection , and domestic comfort of the fami
, and which recommended him to her affection beyond every thing else . His soci
ween the parties might forward the affection of Mr . Willoughby , an equally st
 the most pointed assurance of her affection . Elinor could not be surprised at
he natural consequence of a strong affection in a young and ardent mind . This 
 opinion . But by an appeal to her affection for her mother , by representing t
 every alteration of a place which affection had established as perfect with hi
e will always have one claim of my affection , which no other can possibly shar
f the evening declared at once his affection and happiness . " Shall we see you
ause he took leave of us with less affection than his usual behaviour has shewn
ness ." " I want no proof of their affection ," said Elinor ; " but of their en
onths , without telling her of his affection ;-- that they should part without 
ould be the natural result of your affection for her . She used to be all unres
distinguished Elinor by no mark of affection . Marianne saw and listened with i
th no inclination for expense , no affection for strangers , no profession , an
till distinguished her by the same affection which once she had felt no doubt o
al of her confidence in Edward ' s affection , to the remembrance of every mark
 was made ? Had he never owned his affection to yourself ?" " Oh , no ; but if 
Displaying 25 of 38 matches:
ay when they were created . And Adam lived an hundred and thirty years , and be
ughters : And all the days that Adam lived were nine hundred and thirty yea and
nd thirty yea and he died . And Seth lived an hundred and five years , and bega
ve years , and begat Enos : And Seth lived after he begat Enos eight hundred an
welve years : and he died . And Enos lived ninety years , and begat Cainan : An
 years , and begat Cainan : And Enos lived after he begat Cainan eight hundred 
ive years : and he died . And Cainan lived seventy years and begat Mahalaleel :
rs and begat Mahalaleel : And Cainan lived after he begat Mahalaleel eight hund
years : and he died . And Mahalaleel lived sixty and five years , and begat Jar
s , and begat Jared : And Mahalaleel lived after he begat Jared eight hundred a
and five yea and he died . And Jared lived an hundred sixty and two years , and
o years , and he begat Eno And Jared lived after he begat Enoch eight hundred y
 and two yea and he died . And Enoch lived sixty and five years , and begat Met
 ; for God took him . And Methuselah lived an hundred eighty and seven years , 
 , and begat Lamech . And Methuselah lived after he begat Lamech seven hundred 
nd nine yea and he died . And Lamech lived an hundred eighty and two years , an
ch the LORD hath cursed . And Lamech lived after he begat Noah five hundred nin
naan shall be his servant . And Noah lived after the flood three hundred and fi
xad two years after the flo And Shem lived after he begat Arphaxad five hundred
at sons and daughters . And Arphaxad lived five and thirty years , and begat Sa
ars , and begat Salah : And Arphaxad lived after he begat Salah four hundred an
begat sons and daughters . And Salah lived thirty years , and begat Eber : And 
y years , and begat Eber : And Salah lived after he begat Eber four hundred and
 begat sons and daughters . And Eber lived four and thirty years , and begat Pe
y years , and begat Peleg : And Eber lived after he begat Peleg four hundred an
Displaying 25 of 822 matches:
ast PART 24 / m boo . 26 / m and sexy lol U115 boo . JOIN PART he drew a girl w
ope he didnt draw a penis PART ewwwww lol & a head between her legs JOIN JOIN s
a bowl i got a blunt an a bong ...... lol JOIN well , glad it worked out my cha
e " PART Hi U121 in ny . ACTION would lol @ U121 . . . but appearently she does
30 make sure u buy a nice ring for U6 lol U7 Hi U115 . ACTION isnt falling for 
 didnt ya hear !!!! PART JOIN geeshhh lol U6 PART hes deaf ppl here dont get it
es nobody here i wanna misbeahve with lol JOIN so read it . thanks U7 .. Im hap
ies want to chat can i talk to him !! lol U121 !!! forwards too lol JOIN ALL PE
k to him !! lol U121 !!! forwards too lol JOIN ALL PErvs ... redirect to U121 '
 loves ME the most i love myself JOIN lol U44 how do u know that what ? jerkett
ng wrong ... i can see it in his eyes lol U20 = fiance Jerketts lmao wtf yah I 
cooler by the minute what 'd I miss ? lol noo there too much work ! why not ?? 
 that mean I want you ? U6 hello room lol U83 and this .. has been the grammar 
 the rule he 's in PM land now though lol ah ok i wont bug em then someone wann
flight to hell :) lmao bbl maybe PART LOL lol U7 it was me , U83 hahah U83 ! 80
ht to hell :) lmao bbl maybe PART LOL lol U7 it was me , U83 hahah U83 ! 808265
082653953 K-Fed got his ass kicked .. Lol . ACTION laughs . i got a first class
 . i got a first class ticket to hell lol U7 JOIN any texas girls in here ? any
 . whats up U155 i was only kidding . lol he 's a douchebag . Poor U121 i 'm bo
 ??? sits with U30 Cum to my shower . lol U121 . ACTION U1370 watches his nads 
 ur nad with a stick . ca u U23 ewwww lol *sniffs* ewwwwww PART U115 ! owww spl
ACTION is resisting . ur female right lol U115 beeeeehave Remember the LAst tim
pm's me . charge that is 1.99 / min . lol @ innocent hahah lol .... yeah LOLOLO
 is 1.99 / min . lol @ innocent hahah lol .... yeah LOLOLOLLL U12 thats not nic
s . lmao no U115 Check my record . :) Lol lick em U7 U23 how old r u lol Way to
true contemptible christian abundant few part mean careful puzzled
mystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless
-----分割线----
very so exceedingly heartily a as good great extremely remarkably
sweet vast amazingly
am_glad a_pretty a_lucky is_pretty be_glad
long , from one to the top - mast , and no coffin and went out a sea
captain -- this peaking of the whales . , so as to preserve all his
might had in former years abounding with them , they toil with their
lances , strange tales
long , from one to the top - mast , and no coffin and went out a sea
captain -- this peaking of the whales . , so as to preserve all his
might had in former years abounding with them , they toil with their
lances , strange tales
44764
16.050197203298673
5
1.457806031353621
16.050197203298673
text5 diversity is 7.420046158918563
in text4,a percentage 1.457806031353621%
<FreqDist with 19317 samples and 260819 outcomes>

<Figure size 640x480 with 1 Axes>
<Figure size 1000x600 with 1 Axes>
'''

可视化输出：

词汇分布图 显示了特定单词在文本中的出现位置，便于分析这些单词在不同部分的分布情况。