一起来学自然语言处理----NLTK数据包加载以及字符串操作

最新推荐文章于 2024-07-08 16:14:38 发布

小陈步吃人

最新推荐文章于 2024-07-08 16:14:38 发布

阅读量2.8k

点赞数 6

分类专栏：自然语言学习笔记文章标签： python 自然语言处理

本文链接：https://blog.csdn.net/Itsme_MrJJ/article/details/123660118

版权

自然语言学习笔记专栏收录该内容

5 篇文章 0 订阅

订阅专栏

字符串操作

一、切分
二、标准化
三、替换与矫正
四、在文本上应用zipf定律
五、相似度度量

🙈🙈这里没啥重点，我这里说下我来学习这个东西的初衷和背景。首先我刚开始学，而且是跟着书籍学的，写这个的目的一方面是有一个读书笔记，另一方面我觉得这样可以督促我读完一整本书，而且个人觉得这种方式学的快😏。至于背景可能是最近公司关于刊登优化有了更加规范的流程，所以设计如何提取关键字，成了可能突破自我的途径，所以抽空看看书。接着又是一段废话。
👇👇👇
自然语言处理（Natural Language Processing，NLP）关注的是自然语言与计算机之间的交互。它是人工智能（Artificial Intelligence，Al）和计算语言学的主要分支之一。它提供了计算机和人类之间的无缝交互并使得计算机能够在机器学习的帮助下理解人类语言。在编程语言（例如C、C++、Java、Python等）里用于表示一个文件或文档内容的基础数据类型被称为字符串。

废话说多了，开始！在这里，我们将探索各种可以在字符串上执行的操作，这些操作将有助于完成各种 NLP 任务。
工欲善其事必先利其器。所以先安装NLTK包，因为我们之后的所有操作都是基于该包进行的。方法有二：
其一直接安装nltk，并使用

import nltk
nltk.download()

命令调出下载界面，大概长下面这样，
在这里插入图片描述
但是由于文件太大，每次都是提示由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败很是头疼，所以光端选择第二种方法。

其二是先在nltk官网下载安装包（官网），然后文件解压后把packages对应的文件全部复制到D:\Anaconda\nltk_data，这个路径是我anaconda的路径，你们看自己的路径放置，各个文件夹里面的压缩包也要解压到当前文件夹，否则后期回找不到文件，测试from nltk.book import *，得到如下界面，说明没啥问题了。
在这里插入图片描述

一、切分

切分可以认为是将文本分割成更小的并被称作标识符的模块的过程，它被认为是 NLP
的一个重要步骤。当安装好 NLTK 包并且 Python 的交互式开发环境（IDLE）也运行起来时，我们就可
以将文本或者段落切分成独立的语句。为了实现切分，我们可以导入语句切分函数，该函
数的参数即为需要被切分的文本。sent_tokenize 函数使用了 NLTK 包的一个叫作PunktSentenceTokenizer 类的实例。基于那些可以标记句子开始和结束的字母和标点
符号，NLTK 中的这个实例已经被训练用于对不同的欧洲语言执行切分。

示例1：将文本切分成语句

text=" Welcome readers. I hope you find it interesting. Please do reply."
from nltk.tokenize import sent_tokenize
sent_tokenize(text)

Out[8]: [' Welcome readers.', 'I hope you find it interesting.', 'Please do reply.']

这样，一段给定的文本就被分割成了独立的句子。我们还可以进一步对这些独立的句
子进行处理。
要切分大批量的句子，我们可以加载 PunktSentenceTokenizer 并使用其tokenize()函数来进行切分。下面的代码展示了该过程：

tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
text=" Hello everyone. Hope all are fine and doing well. Hope you find the book interesting"
tokenizer.tokenize(text)
Out[10]: 
[' Hello everyone.',
 'Hope all are fine and doing well.',
 'Hope you find the book interesting']

示例2：其他语言文本的切分

为了对除英文之外的其他语言执行切分，我们可以加载它们各自的 pickle 文件（可以在
tokenizers/punkt 里边找到），然后用该语言对文本进行切分，这些文本是 tokenize()函数的参数。对于法语文本的切分，我们将使用如下的 french.pickle 文件：

french_tokenizer=nltk.data.load('tokenizers/punkt/french.pickle')
french_tokenizer.tokenize(
"""Deux agressions en quelques jours, 
voilà ce qui a motivé hier matin le débrayage collège francobritanniquede Levallois-Perret. Deux agressions en quelques jours, 
voilà ce qui a motivé hier matin le débrayage Levallois. L'équipe 
pédagogique de ce collège de 750 élèves avait déjà été choquée 
par l'agression, janvier , d'un professeur d'histoire. L'équipe 
pédagogique de ce collège de 750 élèves avait déjà été choquée par 
l'agression, mercredi , d'un professeur d'histoire""")

Out[11]: ['Deux agressions en quelques jours, \nvoilà ce qui a motivé hier matin le débrayage collège franco\x02britanniquede Levallois-Perret.',
 'Deux agressions en quelques jours, \nvoilà ce qui a motivé hier matin le débrayage Levallois.',
 "L'équipe \npédagogique de ce collège de 750 élèves avait déjà été choquée \npar l'agression, janvier , d'un professeur d'histoire.",
 "L'équipe \npédagogique de ce collège de 750 élèves avait déjà été choquée par \nl'agression, mercredi , d'un professeur d'histoire"]

示例3：将句子切分为单词

现在，我们将对独立的句子执行处理，独立的句子会被切分为单词。通过使用
word_tokenize()函数可以执行单词的切分。word_tokenize 函数使用 NLTK 包的一个叫作 TreebankWordTokenizer 类的实例用于执行单词的切分。

text = nltk.word_tokenize("PierreVinken , 59 years old , will join as a nonexecutive director on Nov. 29")
text
Out[12]:
['PierreVinken', ',', '59', 'years', 'old', ',', 'will', 'join', 'as', 'a', 'nonexecutive',
 'director',' on', 'Nov.', '29']

实现单词的切分还可以通过加载 TreebankWordTokenizer，然后调用 tokenize()函数来完成，其中 tokenize()函数的参数是需要被切分为单词的句子。基于空格和标点
符号，NLTK 包的这个实例已经被训练用于将句子切分为单词。

from nltk.tokenize import TreebankWordTokenizer 
tokenizer = TreebankWordTokenizer() 
tokenizer.tokenize("Have a nice day. I hope you find the book interesting")

Out[13]:
['Have', 'a', 'nice', 'day.', 'I', 'hope', 'you', 'find', 'the', 
'book', 'interesting']

不同的切词器对于同一个句子切分方法不同，例如：

text = nltk.word_tokenize(" Don't hesitate to ask questions")
print(text)
['Do', "n't", 'hesitate', 'to', 'ask', 'questions']

from nltk.tokenize import WordPunctTokenizer
tokenizer=WordPunctTokenizer() 
print(tokenizer.tokenize(" Don't hesitate to ask questions"))
['Don', "'", 't', 'hesitate', 'to', 'ask', 'questions']

示例四：使用正则表达式进行切分

可以通过构建如下两种正则表达式来实现单词的切分：

通过匹配单词。
通过匹配空格或间隔。
我们可以导入 NLTK 包的 RegexpTokenizer 模块，并构建一个与文本中的标识符相
匹配的正则表达式：

# 通过匹配单词
from nltk.tokenize import RegexpTokenizer 
tokenizer=RegexpTokenizer("[\w']+") 
tokenizer.tokenize("Don't hesitate to ask questions") 
["Don't", 'hesitate', 'to', 'ask', 'questions']

# 通过匹配空格或间隔
from nltk.tokenize import RegexpTokenizer 
tokenizer=RegexpTokenizer('\s+',gaps=True) 
tokenizer.tokenize("Don't hesitate to ask questions")
["Don't", 'hesitate', 'to', 'ask', 'questions']

# 要筛选以大写字母开头的单词
from nltk.tokenize import RegexpTokenizer 
tokenizer=RegexpTokenizer("[A-Z]\w+'\w") 
tokenizer.tokenize("Don't hesitate to ask questions") 
["Don't"]

总结：文本切分有很多方法，而且不同的分词器达到的效果不一样，所以，为了实现对自然语言文本的处理，我们需要对其执行标准化，也就是后面的内容。

二、标准化

标准化主要涉及消除标点符号、将整个文本转换为大写或小写、数字转换成单词、扩展缩略词、文本的规范化等操作。

示例1：消除标点符号

很多时候，在切分文本的过程中，标点符号对我们的后续研究并无用，所以我们希望删除标点符号。当在 NLTK 中执行标准化操作时，删除标点符号被认为是主要的任务之一。

# 原始切分效果
text = [" It is a pleasant evening.","Guests, who came from US arrived at the venue","Food was tasty."]
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer() 
tokenized_docs = [tokenizer.tokenize(doc) for doc in text]
print(tokenized_docs)
[['It', 'is', 'a', 'pleasant', 'evening', '.'], 
['Guests', ',', 'who', 'came', 'from', 'US', 'arrived', 'at', 'the', 'venue'],
['Food', 'was', 'tasty', '.']]

# 去除标点符号
import re 
import string
from nltk.tokenize import WordPunctTokenizer
text = [" It is a pleasant evening.","Guests, who came from US arrived at the venue","Food was tasty."]
# string.punctuation 所有的标点字符,等价于'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
# re.escape处理所有正则中的特殊符号
tokenizer = WordPunctTokenizer() 
tokenized_docs = [tokenizer.tokenize(doc) for doc in text]
x = re.compile('[%s]' % re.escape(string.punctuation)) 
tokenized_docs_no_punctuation = []
for review in tokenized_docs: 
    new_review = [] 
    for token in review: 
        # re.sub函数主要用于替换字符串中的匹配项
        # re 模块中的替换函数 x.sub() == re.sub(x,token)
        new_token = x.sub(u'', token) 
        if not new_token == u'': 
            new_review.append(new_token) 
    tokenized_docs_no_punctuation.append(new_review)
    
print(tokenized_docs_no_punctuation)
[['It', 'is', 'a', 'pleasant', 'evening'], 
['Guests', 'who', 'came', 'from', 'US', 'arrived', 'at', 'the', 'venue'], 
['Food', 'was', 'tasty']]

示例2：文本的大小写转化

通过 lower()和 upper()函数可以将一段给定的文本彻底转换为小写或大写文本。将文本转换为大小写的任务也属于文本标准化的范畴。

text='HARdWork IS KEy to SUCCESS' 
print(text.lower()) 
hardwork is key to success 
print(text.upper()) 
HARDWORK IS KEY TO SUCCESS

示例3：处理停用词

停止词是指在执行信息检索任务或其他自然语言任务时需要被过滤掉的词，因为这些词对理解句子的整体意思没有多大的意义。许多搜索引擎通过去除停止词来工作，以便缩小搜索范围。消除停止词在 NLP 中被认为是至关重要的标准化任务之一。
NLTK 库为多种语言提供了一系列的停止词，为了可以从 nltk_data/corpora/ stopwords 中访问停止词列表，我们需要解压 stopwords压缩文件。

from nltk.corpus import stopwords
stops = set(stopwords.words('english')) 
words=["Don't", 'hesitate','to','ask','questions'] 
[word for word in words if word not in stops]
["Don't", 'hesitate', 'ask', 'questions']

nltk.corpus.reader.WordListCorpusReader 类的实例是一个 stopwords语料库，它拥有一个参数为 fileid 的 words()函数。这里参数为 English，它指的是在英语文件中存在的所有停止词。如果 words()函数没有参数，那么它指的将是关于所有语言的全部停止词。
可以在其中执行停止词删除的其他语言，或者在 NLTK 中其文件存在停止词的语言数量都可以通过使用 fileids()函数找到：

print(stopwords.fileids())
['arabic', 'azerbaijani', 'bengali', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']

上面列出的任何一种语言都可以用作 words()函数的参数，以便获取该语言的停止词。还可以查看具体语言的具体停用词。

print(stopwords.words('english'))
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", 
"you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself',
 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 
 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 
 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 
 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 
 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 
 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 
 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once',
  'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 
 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too',
  'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 
  'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', 
  "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', 
  "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't",
   'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', 
   "wouldn't"]

三、替换与矫正

在本节中，我们将讨论用其他类型的标识符来替换标识符。我们还会讨论如何来校正
标识符的拼写（通过用正确拼写的标识符替换拼写不正确的标识符）

示例1：使用正则表达式替换单词

为了消除错误或执行文本的标准化，需要做单词替换。一种可以完成文本替换的方法是使用正则表达式。之前，在执行缩略词切分时我们遇到了问题。通过使用文本替换，我们可以用缩略词的扩展形式来替换缩略词。例如，doesn’t 可以被替换为 does not。
这里编写以下代码开始，并命名此程序为 replacers.py，最后将其保存在nltk_data 同级文件夹中即anaconda文件夹中，以便以后可以重复使用这个替换规则

import re 
replacement_patterns = [ 
(r'won\'t', 'will not'), 
(r'can\'t', 'cannot'), 
(r'i\'m', 'i am'), 
(r'ain\'t', 'is not'), 
(r'(\w+)\'ll', '\g<1> will'), 
(r'(\w+)n\'t', '\g<1> not'), 
(r'(\w+)\'ve', '\g<1> have'), 
(r'(\w+)\'s', '\g<1> is'), 
(r'(\w+)\'re', '\g<1> are'), 
(r'(\w+)\'d', '\g<1> would') 
] 
class RegexpReplacer(object): 
    def __init__(self, patterns=replacement_patterns): 
        self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns] 
        
        
    def replace(self, text): 
        s = text 
        for (pattern, repl) in self.patterns: 
            (s, count) = re.subn(pattern, repl, s) 
        return s

这里我们定义了替换模式，模式第一项表示需要被匹配的模式，第二项是其对应的替
换模式。RegexpReplacer 类被定义用来执行编译模式对的任务，并且它提供了一个叫
作 replace()的方法，该方法的功能是用另一种模式来执行模式的替换。
测试使用

import nltk 
from replacers import RegexpReplacer 
replacer = RegexpReplacer() 
replacer.replace("Don't hesitate to ask questions") 
'Do not hesitate to ask questions'
replacer.replace("She must've gone to the market but she didn't go")
'She must have gone to the market but she did not go'

示例2：执行切分前替换

标识符替换操作可以在切分前执行，以避免在切分缩略词的过程中出现问题：

import nltk 
from nltk.tokenize import word_tokenize 
from replacers import RegexpReplacer 
replacer=RegexpReplacer() 
word_tokenize("Don't hesitate to ask questions") 
['Do', "n't", 'hesitate', 'to', 'ask', 'questions']

word_tokenize(replacer.replace("Don't hesitate to ask questions"))
['Do', 'not', 'hesitate', 'to', 'ask', 'questions']

示例3：处理重复字符

有时候，人们在写作时会涉及一些可以引起语法错误的重复字符。例如考虑这样的一个句子：I like it a lotttttt。在这里，lotttttt 是指 lot。所以现在我们将使用反向引用方法来去除这些重复的字符，在该方法中，一个字符指的是正则表达式分组中的先前字符。消除重复字符也被认为是标准化任务之一。
首先，将以下代码附加到先前创建的 replacers.py 文件中：

class RepeatReplacer(object):
    def __init__(self):
        self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
        self.repl = r'\1\2\3'


    def replace(self, word):
        repl_word = self.repeat_regexp.sub(self.repl, word)
        if repl_word != word:
            return self.replace(repl_word)
        else:
            return repl_word

测试使用

from replacers import RepeatReplacer 
replacer=RepeatReplacer() 
replacer.replace('lotttt') 
'lot'
replacer.replace('ohhhhh') 
'oh'

使用 RepeatReplacer 的问题是它会将 happy 转换为 hapy，这样是不妥的。为了避免这个问题，我们可以嵌入 wordnet 与其一起使用（WordNet是由Princeton 大学的心理学家，语言学家和计算机工程师联合设计的一种基于认知语言学的英语词典。它不是光把单词以字母顺序排列，而且按照单词的意义组成一个“单词的网络”。理解成一本字典）,代码调整为：

import re 
from nltk.corpus import wordnet 
class RepeatReplacer(object):
    def __init__(self):
        self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
        self.repl = r'\1\2\3'


    def replace(self, word):
    # 首先判断是不是单词，是单词直接输出，else 执行下面替换
        if wordnet.synsets(word):
            return word
        repl_word = self.repeat_regexp.sub(self.repl, word)
        if repl_word != word:
            return self.replace(repl_word)
        else:
            return repl_word

关于理解上述代码：它匹配可能是以零个或多个(\ w *)字符开始，以零个或多个(\ w *)，或者一个(\ w)其后面带有相同字符的字符而结束的字符。而\1\2指的是反向引用，可以理解是group的另一种写法。
例如，lotttt 被分拆为(lo)(t)t(tt)。这里减少了一个 t 并且字符串变为 lottt。
分拆的过程还将继续，最后得到的结果字符串是 lot。

示例4：同义词替换

现在我们将看到如何用其同义词来替代一个给定的单词。对于已经存在的replacers.py 文件，我们可以为其添加一个名为 WordReplacer 的类，这个类提供了一个单词与其同义词之间的映射关系。

class WordReplacer(object):
    def __init__(self, word_map):
        self.word_map = word_map


    def replace(self, word):
        return self.word_map.get(word, word)

测试

from replacers import WordReplacer 
replacer=WordReplacer({'congrats':'congratulations'}) 
replacer.replace('congrats')
'congratulations'
replacer.replace('maths') 
'maths'

在上述这段代码中，replace()函数在 word_map 中寻找单词对应的同义词。如果给定
的单词存在同义词，则该单词将被其同义词替换；如果给定单词的同义词不存在，则不执行替换，将返回单词本身。

四、在文本上应用zipf定律

Zipf（齐夫定律）定律指出，文本中标识符出现的频率与其在排序列表中的排名或位置成反比。该
定律描述了标识符在语言中是如何分布的：一些标识符非常频繁地出现，另一些出现频率较低，还有一些基本上不出现。

怎么理解呢？就是2/8法则，如果把所有的单词（字）放在一起看呢？会不会20%的词（字）占了80%的出现次数？答案是肯定的。

五、相似度度量

有许多可用于执行 NLP 任务的相似性度量。NLTK 中的 nltk.metrics 包用于提供各种评估或相似性度量，这将有利于执行各种各样的 NLP 任务。在 NLP 中，为了测试标注器、分块器等的性能，可以使用从信息检索中检索到的标准分数。
让我们来看看如何使用标准分（从一个训练文件中获取的）来分析命名实体识别器的输出：

from __future__ import print_function 
from nltk.metrics import * 
training='PERSON OTHER PERSON OTHER OTHER ORGANIZATION'.split() 
testing='PERSON OTHER OTHER OTHER OTHER OTHER'.split()
print(accuracy(training,testing))
0.6666666666666666

trainset=set(training)
testset=set(testing)
precision(trainset,testset)
1.0

print(recall(trainset,testset))
0.6666666666666666

print(f_measure(trainset,testset))
0.8

1、使用编辑距离算法执行相似度度量

两个字符串之间的编辑距离或 Levenshtein 编辑距离算法用于计算为了使两个字符串相等所插入、替换或删除的字符数量。
在编辑距离算法中需要执行的操作包含以下内容：

将字母从第一个字符串复制到第二个字符串（cost 为 0），并用另一个字母替换字母（cost 为 1）：
D(i−1,j−1) + d(si,tj)（替换 /复制操作）
删除第一个字符串中的字母（cost 为 1）：
D(i,j−1)+1（删除操作）
在第二个字符串中插入一个字母（cost 为 1）：
D(i,j) = min D(i−1,j)+1 （插入操作）

nltk.metrics 包中的 Edit Distance 算法的 Python 代码可在相应包查看。

让我们看一看使用 NLTK 中的 nltk.metrics 包来计算编辑距离的代码：

import nltk 
from nltk.metrics import * 
edit_distance("relate","relation")
3

edit_distance("suggestion","calculation")
7

这里，当我们计算 relate 和 relation 之间的编辑距离时，需要执行三个操作（一个替换操作和两个插入操作）。当计算 suggestion 和 calculation 之间的编辑距离时，需要执行七个操作（六个替换操作和一个插入操作）。

2、使用Jaccard系数执行相似度度量

Jaccard 系数或 Tanimoto 系数可以认为是两个集合 X 和 Y 交集的相似程度。它可以定义如下：

Jaccard(X,Y)=|X∩Y|/|XUY|。
Jaccard(X,X)=1。
Jaccard(X,Y)=0 if X∩Y=0。
让我们来看看 NLTK 中 Jaccard 相似性系数的实现：

import nltk 
from nltk.metrics import * 
X=set([10,20,30,40]) 
Y=set([20,30,60]) 
print(jaccard_distance(X,Y))
0.6

当X==Y时，jaccard系数为1；当X与Y不相交，jaccard系数为0
jaccard距离表示样本或集合的不相似程度，jaccard距离越大，样本相似度越低。故jaccard距离用于描述不相似度，缺点是只适用于二元数据的集合,所以上面说的jaccard相似度和jaccard距离的关系是Jaccard distance （A, B） = 1 - Jaccard（A, B）

3、其他字符串相似性度量

二进制距离是一个字符串相似性指标。如果两个标签相同，它的返回值为 0.0；否则，它的返回值为 1.0。
让我们来看看在 NLTK 中如何实现二进制距离算法度量：

import nltk 
from nltk.metrics import * 

X = set([10,20,30,40]) 
Y= set([30,50,70]) 
binary_distance(X, Y)
1.0

现在，我们已经学会了各种可以在文本（由字符串集合组成）上执行的操作。也已经理解了字符串切分、替换和标准化的概念，以及使用 NLTK 在字符串上应用各种相似性度量方法。此外我们还讨论了可能适用于一些现存文档的 Zipf 定律。