NLTK简介及使用示例

最新推荐文章于 2025-03-31 23:51:26 发布

风情客家__

最新推荐文章于 2025-03-31 23:51:26 发布

阅读量3.5w

点赞数 37

分类专栏： AI 文章标签：人工智能 nlp nltk 简介

本文链接：https://blog.csdn.net/justlpf/article/details/121707391

版权

AI 专栏收录该内容

30 篇文章

订阅专栏

参考文章：自然语言处理库——NLTK_满腹的小不甘-CSDN博客

NLP 自然语言处理的开发环境搭建_村雨遥-CSDN博客_nlp开发

nlp---Nltk 常用方法_飘过的春风-CSDN博客

NLTK 基础知识总结_村雨遥-CSDN博客_nltk

NLTK :: Natural Language Toolkit(官网)

NLTK :: Sample usage for stem

手动下载并安装nltk_data_justlpf的专栏-CSDN博客

NLTK_百度百科

GitHub - nltk/nltk_data: NLTK Data

3. 词汇规范化（Lexicon Normalization）

（1）词形还原（lemmatization）

1.简介

Natural Language Toolkit，自然语言处理工具包，在NLP领域中，最常使用的一个Python库。
NLTK是一个开源的项目，包含：Python模块，数据集和教程，用于NLP的研究和开发。
NLTK由Steven Bird和Edward Loper在宾夕法尼亚大学计算机和信息科学系开发。
NLTK包括图形演示和示例数据。其提供的教程解释了工具包支持的语言处理任务背后的基本概念。

NLTK（www.nltk.org）是在处理预料库、分类文本、分析语言结构等多项操作中最长遇到的包。其收集的大量公开数据集、模型上提供了全面、易用的接口，涵盖了分词、词性标注(Part-Of-Speech tag, POS-tag)、命名实体识别(Named Entity Recognition, NER)、句法分析(Syntactic Parse)等各项 NLP 领域的功能。

NLTK能干啥？

搜索文本
单词搜索：
相似词搜索；
相似关键词识别；
词汇分布图；
生成文本；
计数词汇

NLTK设计目标

简易性；
一致性；
可扩展性；
模块化；

NLTK中的语料库

古腾堡语料库：gutenberg；
网络聊天语料库：webtext、nps_chat；
布朗语料库：brown；
路透社语料库：reuters；
就职演说语料库：inaugural；
其他语料库；

文本语料库结构

isolated：独立型；
categorized：分类型；
overlapping：重叠型；
temporal：暂时型；

基本语料库函数

条件频率分布

NLP的开发环境搭建主要分为以下几步：

Python安装
参考:windows系统下搭建Python开发环境_justlpf的专栏-CSDN博客
NLTK系统安装
自动下载nltk_data一般会失败, 手动下载并配置nltk_data, 参考：手动下载并安装nltk_data_justlpf的专栏-CSDN博客

NLTK模块及功能介绍：

1. 分词

文本是由段落（Paragraph）构成的，段落是由句子（Sentence）构成的，句子是由单词构成的。切词是文本分析的第一步，它把文本段落分解为较小的实体（如单词或句子），每一个实体叫做一个Token，Token是构成句子（sentence ）的单词、是段落（paragraph）的句子。NLTK能够实现句子切分和单词切分两种功能。

(1) 句子切分（断句）

把段落切分成句子：

from nltk.tokenize import sent_tokenize
 
text="""Hello Mr. Smith, how are you doing today? The weather is great, and 
city is awesome.The sky is pinkish-blue. You shouldn't eat cardboard"""
 
tokenized_text=sent_tokenize(text)
 
print(tokenized_text)
'''
结果：
  ['Hello Mr. Smith, how are you doing today?', 
   'The weather is great, and city is awesome.The sky is pinkish-blue.', 
   "You shouldn't eat cardboard"]
'''

（2）单词切分（分词）

句子切分成单词:

import nltk
 
sent = "I am almost dead this time"
token = nltk.word_tokenize(sent)
# 结果：token['I','am','almost','dead','this','time']

2. 处理切词

对切词的处理，需要移除标点符号和移除停用词和词汇规范化。

(1)移除标点符号

对每个切词调用该函数，移除字符串中的标点符号，string.punctuation包含了所有的标点符号，从切词中把这些标点符号替换为空格。

import string

"""移除标点符号"""
if __name__ == '__main__':
    # 方式一
    # s = 'abc.'
    text_list = "Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome."
    text_list = text_list.translate(str.maketrans(string.punctuation, " " * len(string.punctuation)))  # abc
    print("s: ", text_list)


    # 方式二
    english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']
    text_list = [word for word in text_list if word not in english_punctuations]
    print("text: ", text_list)

（2）移除停用词

停用词（stopword）是文本中的噪音单词，没有任何意义，常用的英语停用词，例如：is, am, are, this, a, an, the。NLTK的语料库中有一个停用词，用户必须从切词列表中把停用词去掉。

import nltk
from nltk.corpus import stopwords

# nltk.download('stopwords')
# Downloading package stopwords to 
# C:\Users\Administrator\AppData\Roaming\nltk_data\corpora\stopwords.zip.
# Unzipping the stopwords.zip

"""移除停用词""" 
stop_words = stopwords.words("english")

if __name__ == '__main__':
    text = "Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome."

    word_tokens = nltk.tokenize.word_tokenize(text.strip())
    filtered_word = [w for w in word_tokens if not w in stop_words]

    print("word_tokens: ", word_tokens)
    print("filtered_word: ", filtered_word)
    '''
    word_tokens：['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?',
     'The', 'weather', 'is', 'great', ',', 'and', 'city', 'is', 'awesome', '.']
    filtered_word：['Hello', 'Mr.', 'Smith', ',', 'today', '?', 'The', 'weather', 'great', ',', 'city', 'awesome', '.']
    '''

3. 词汇规范化（Lexicon Normalization）

词汇规范化是指把词的各种派生形式转换为词根，在NLTK中存在两种抽取词干的方法porter和wordnet。

（1）词形还原（lemmatization）

利用上下文语境和词性来确定相关单词的变化形式，根据词性来获取相关的词根，也叫lemma，结果是真实的单词。
基于字典的映射。nltk中要求手动注明词性，否则可能会有问题。因此一般先要分词、词性标注，再词性还原。

from nltk.stem import WordNetLemmatizer  
lemmatizer = WordNetLemmatizer()  
lemmatizer.lemmatize('leaves') 
# 输出：'leaf'

(2) 词干提取（stem）

从单词中删除词缀并返回词干，可能不是真正的单词。

# 基于Porter词干提取算法
from nltk.stem.porter import PorterStemmer  
porter_stemmer = PorterStemmer()  
porter_stemmer.stem(‘maximum’)
 
# 基于Lancaster 词干提取算法
from nltk.stem.lancaster import LancasterStemmer  
lancaster_stemmer = LancasterStemmer()  
lancaster_stemmer.stem(‘maximum’)
 
# 基于Snowball 词干提取算法
from nltk.stem import SnowballStemmer  
snowball_stemmer = SnowballStemmer(“english”)  
snowball_stemmer.stem(‘maximum’)

from nltk.stem.wordnet import WordNetLemmatizer  # from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()  # 词形还原
 
from nltk.stem.porter import PorterStemmer   # from nltk.stem import PorterStemmer
stem = PorterStemmer()   # 词干提取
 
word = "flying"
print("Lemmatized Word:",lem.lemmatize(word,"v"))
print("Stemmed Word:",stem.stem(word))
'''
Lemmatized Word: fly
Stemmed Word: fli
'''

4. 词性标注

词性（POS）标记的主要目标是识别给定单词的语法组，POS标记查找句子内的关系，并为该单词分配相应的标签。

sent = "Albert Einstein was born in Ulm, Germany in 1879."
tokens = nltk.word_tokenize(sent)
 
tags = nltk.pos_tag(tokens)
 
'''
[('Albert', 'NNP'), ('Einstein', 'NNP'), ('was', 'VBD'), ('born', 'VBN'), 
('in', 'IN'), ('Ulm', 'NNP'), (',', ','), ('Germany', 'NNP'), ('in', 'IN'), ('1879', 'CD'), ('.', '.')]
'''

5. 获取近义词

查看一个单词的同义词集用synsets(); 它有一个参数pos，可以指定查找的词性。WordNet接口是面向语义的英语词典，类似于传统字典。它是NLTK语料库的一部分。

import nltk
nltk.download('wordnet')  # Downloading package wordnet to C:\Users\Administrator\AppData\Roaming\nltk_data...Unzipping corpora\wordnet.zip.
 
from nltk.corpus import wordnet
 
word = wordnet.synsets('spectacular')
print(word)
# [Synset('spectacular.n.01'), Synset('dramatic.s.02'), Synset('spectacular.s.02'), Synset('outstanding.s.02')]
 
print(word[0].definition())
print(word[1].definition())
print(word[2].definition())
print(word[3].definition())
'''
a lavishly produced performance
sensational in appearance or thrilling in effect
characteristic of spectacles or drama
having a quality that thrusts itself into attention
'''

6.其它示例：

6.1 词频提取

把切分好的词表进行词频排序（按照出现次数排序）：

all_words = nltk.FreqDist(w.lower()  for  w  in  nltk.word_tokenize( "I'm foolish foolish man" ))
print (all_words.keys())
all_words.plot()

dict_keys(["'m", 'man', 'i', 'foolish'])：

只考虑最高频率的两个词，并且绘制累积图：

all_words.plot( 2 , cumulative = True )

6.2 其它实例代码示例

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2018-9-28 22:21
# @Author  : Manu
# @Site    : 
# @File    : python_base.py
# @Software: PyCharm

from __future__ import division
import nltk
import matplotlib
from nltk.book import *
from nltk.util import bigrams

# 单词搜索
print('单词搜索')
text1.concordance('boy')
text2.concordance('friends')

# 相似词搜索
print('相似词搜索')
text3.similar('time')

#共同上下文搜索
print('共同上下文搜索')
text2.common_contexts(['monstrous','very'])

# 词汇分布表
print('词汇分布表')
text4.dispersion_plot(['citizens', 'American', 'freedom', 'duties'])

# 词汇计数
print('词汇计数')
print(len(text5))
sorted(set(text5))
print(len(set(text5)))

# 重复词密度
print('重复词密度')
print(len(text8) / len(set(text8)))

# 关键词密度
print('关键词密度')
print(text9.count('girl'))
print(text9.count('girl') * 100 / len(text9))

# 频率分布
fdist = FreqDist(text1)

vocabulary = fdist.keys()
for i in vocabulary:
    print(i)

# 高频前20
fdist.plot(20, cumulative = True)

# 低频词
print('低频词：')
print(fdist.hapaxes())

# 词语搭配
print('词语搭配')
words = list(bigrams(['louder', 'words', 'speak']))
print(words)