语料库数据处理个案实例（计算机搭配强度、删除表中的停用词、词料检索的KWIC实现）

最新推荐文章于 2024-03-26 13:59:48 发布

Triumph19

最新推荐文章于 2024-03-26 13:59:48 发布

阅读量1.7k

点赞数 1

分类专栏：利用Python进行数据分析 python文本分析文章标签： python 机器学习开发语言

本文链接：https://blog.csdn.net/Triumph19/article/details/125408406

版权

利用Python进行数据分析同时被 2 个专栏收录

28 篇文章 20 订阅

订阅专栏

python文本分析

8 篇文章 1 订阅

订阅专栏

7.5 计算机搭配强度

搭配是语言地道与否的标志，是区分本族语言和非本族语言的重要指标，因此，语料库语言学和语言教学都非常重视搭配的研究。比如，汉语的"吃饭"是动词与名词搭配，动词"吃"和名词"饭"搭配，而英语更多地说"have meal"，很少说"eat meal"；汉语既可以说"喝茶"，也可以说"吃茶"，而英语只能说"drink tea"，不会说"eat tea"；同样，汉语说"吃药"，英语说“take medicine”。计算语言学领域也非常重视搭配的研究。计算语言学领域有时候将上一节提到的Ngrams也称做搭配。
我们可以通过简单计算Ngrams频次的方法来计算搭配强度，也可以用卡方（Chi-square)、互信息(Pointwise Mutual information,PMI)、对数似然比（Log-likelihood ratio）等检验方法来计算搭配的强度。NLTK库提供了计算上述几种检验方法来计算Bigrams和Ngrams强度的模块。
本小节将讨论如何计算Ngrams频次，如何计算Ngrams的卡方值、PMI值等。另外，我们在7.12小节讨论Stanford CoreNLP软件包的使用时，也将讨论如何利用句法分析的方法来提取动词-名词、形容词-名词等搭配，并计算她们的搭配强度。

7.5.1 计算搭配的频次

在7.4小节中，我们提取了文本的词块（Ngrams)，对提取的词块做了清洗处理，并将处理结果保存到n_grams_AlphaNum列表中。我们可以通过这些词块的频次来表示它们的强度，比如只提取频次大于等于2的词块。请看下面的代码。

# to compute the frequency of ngrams in n_grams_AlphaNum
# put this snippet of code after the second snippet of code in Section 7.4
freq_dict = {}

for i in n_grams_AlphaNum:
    if i in freq_dict.keys():
        freq_dict[i] += 1
    else:
        freq_dict[i] = 1
for j in freq_dict.keys():
    if freq_dict[j] >= 2:
        print(j[0],j[1],j[2],j[3],'\t',freq_dict[j])

因为例中的文本非常小，提取的四词块频次均为1，所以本例中没有打印结果。读者可以用一个较大文本进行试验。

7.5.2 计算二词词块的搭配强度

NLTK库的collocations模块提供了BigramAssocMeasures等函数来计算二词词块的频次。请看下面的代码。

import nltk.collocations
string = '''I give Pirrip as my father's family name, on the authority of his tombstone and my sister,--Mrs. Joe Gargery, who married the blacksmith. As I never saw my father or my mother, and never saw any likeness of either of them (for their days were long before the days of photographs), my first fancies regarding what they were like were unreasonably derived from their tombstones. The shape of the letters on my father's, gave me an odd idea that he was a square, stout, dark man, with curly black hair.'''
string_tokenized = nltk.word_tokenize(string.lower())
finder = nltk.collocations.BigramCollocationFinder.from_words(string_tokenized)
bgm = nltk.collocations.BigramAssocMeasures()
scored = finder.score_ngrams(bgm.likelihood_ratio)
scored

上面的代码中，我们首先引入nltk.collocations和定义需要处理的文本，然后同nltk.word_tokenize对文本进行分词处理。接下来，我们通过nltk.collocations中的BigramCollocationFinder.from_words()函数提取分词后的二词词块，并将之赋值给finder。如果我们执行print(finder)，返回的结果为<nltk.collocations.Bigram CollocationFinder object at 0x919eacc>，也就是说，finder实际上是一个Bigram CollocationFinder对象。
接下来，我们定义nltk.collocations.BigramAssocMeasures()，并通过finder.score_ngrams()函数来计算Bigrams的likelihood_ratio值。finder.score_ngrams()只有一个参数，即需要计算的统计检验名称，我们可以将之定义为bgm.likelihood_ratio、bgm.student_t、bgm.chi_sq、bgm.pmi、bgm.dice等。在本例中，我们将搭配强度检验方法定义为bgm.likelihood_ratio。最后，打印结果如下(局部）：

[(('never', 'saw'), 19.91866838483344),
 (('my', 'father'), 19.09923162702352),
 (('father', "'s"), 16.09958337506456),
 (('(', 'for'), 11.354974483983558),
 (('--', 'mrs.'), 11.354974483983558),
 (('a', 'square'), 11.354974483983558),
 (('an', 'odd'), 11.354974483983558),
 (('any', 'likeness'), 11.354974483983558),
 (('black', 'hair'), 11.354974483983558),
 (('curly', 'black'), 11.354974483983558),
 (('dark', 'man'), 11.354974483983558),
 .......
 (('of', 'the'), 1.6576837641065838),
 ((',', 'my'), 0.4658676573818533)]

7.5.3 计算三词词块的搭配强度

与计算二词词块搭配强度相似，NLTK库的collocations模块提供TrigramAssocMeasures等函数来计算三词词块的频次。请看下面的代码。

import nltk
import nltk.collocations

string = '''I give Pirrip as my father's family name, on the authority of his tombstone and my sister,--Mrs. Joe Gargery, who married the blacksmith. As I never saw my father or my mother, and never saw any likeness of either of them (for their days were long before the days of photographs), my first fancies regarding what they were like were unreasonably derived from their tombstones. The shape of the letters on my father's, gave me an odd idea that he was a square, stout, dark man, with curly black hair.'''

string_tokenized = nltk.word_tokenize(string.lower())
bgm = nltk.collocations.TrigramAssocMeasures()
finder = nltk.collocations.TrigramCollocationFinder.from_words(string_tokenized)
scored = finder.score_ngrams(bgm.likelihood_ratio)
print(scored)

上面代码与二词词块搭配强度计算的唯一不同在于，我们将BigramsAssocMeasures换成了TrigramsAssocMeasures。在计算三词词块搭配强度等检验方法。在本例中，我们将搭配强度检验方法定义为bgm.likelihood_ratio。
最后，打印结果。结果如下（局部）：

[(('my', 'father', "'s"), 35.19881500208808), (('never', 'saw', 'any'), 28.50105414657721), (('my', 'father', 'or'), 26.635121101238198), (('and', 'never', 'saw'), 25.747333628748777), (('i', 'never', 'saw'), 25.747333628748777), ((',', 'on', 'the'), 9.527528647915378)]

7.6 删除词表中的停用词

停用词(stopwords)是指文本中出现的非常高频的代词、介词、副词等词类。我们在分析词表时，往往需要分析实词，而可能并不太关心停用词。因此，我们可以通过Python来删除词表中的停用词，以聚焦于分析词表中的其他词。NLTK库中内置多种语言的停用词表，我们可以通过stopwords.words(‘english’)语句引用英语停用词表。请看下面的代码。

import nltk
from nltk.corpus import stopwords

stopwords_list = stopwords.words('english')
print(stopwords_list) #打印出停用词列表

在这里插入图片描述

string = '''I give Pirrip as my father's family name, on the authority of his tombstone and my sister,--Mrs. Joe Gargery, who married the blacksmith. As I never saw my father or my mother, and never saw any likeness of either of them (for their days were long before the days of photographs), my first fancies regarding what they were like were unreasonably derived from their tombstones. The shape of the letters on my father's, gave me an odd idea that he was a square, stout, dark man, with curly black hair.'''

wordlist = nltk.word_tokenize(string.lower())

for word in wordlist:
    if word not in stopwords_list:
        print(word)

我们定义需处理的文本，通过nltk.word_tokenize()函数对该文本进行分词处理，制作文本词表，并赋值给wordlist变量。最后，for … in 对wordlist中的单词循环遍历，如果单词不在停用词表中，则将之打印出来。

7.7 词料检索的KWIC实现

在使用Wordsmith或AntConc软件检索关键词时，经常看到返回结果时使用了Key Word in Context(KWIC)的显示方式，即将检索的关键词放在中间对齐的中间位置，关键词左右各留出一定数量的单词或字符串作为语境，以方便研究者在一定语境中阅读关键词。NLTK库有concordance()函数可以实现关键词的KWIC检索。
请看下面的例子。我们希望在ge.txt文本中检索关键词’but’，并用KWIC形式返回检索结果。代码如下：

import nltk

file_in = open(r'D:\works\文本分析\leopythonbookdata-master\texts\ge.txt','r')
raw_text = file_in.read()
tokens = nltk.word_tokenize(raw_text)

nltk_text = nltk.Text(tokens)
nltk_text.concordance('but')

上面代码中，首先引入nltk,然后定义ge.txt文件句柄。下面两行读取ge.txt，并将之进行分词处理。倒数第二行代码，通过nltk.Text()函数将分词后的文本列表转换成nltk的Text数据，因为concordance()只能检索nltk的Text数据。最后一行，利用concordance()对’but’进行检索。
concordance()函数的基本格式为：concordance(keyword,width = 75,lines = 25),其中keyword为检索的关键词，返回结果默认有75个字符，默认返回25行检索行。如果选择默认设置，则不用在concordance()中设置参数。也可以将width和lines设置成其他值。