《精通Python自然语言处理》
Deepti Chopra(印度)
王威 译
第八章 信息检索:访问信息
8.1信息检索简介
信息检索可以定义为检索最合适的信息作为用户查询相应的过程。
信息检索任务的准确度是依据精确率和召回率来度量的。
召回率 | Recall(R)=(XnY)/Y |
---|---|
精确率 | Precision(P)=(XnY)/x |
F-值 | F-Measure=2*(XnY)/(X+Y) |
8.1.1停止词删除
获取英文中可以被检测到的停止词集合:
import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))
找出不是停止词的单词个数:
def not_stopwords(text):
stopwords = nltk.corpus.stopwords.words('english')
content = [w for w in text if w.lower() not in stopwords]
return len(content) / len(text)
print(not_stopwords(nltk.corpus.reuters.words()))
将大写字母转化为小写字母,然后再去除停止词:
import nltk
from collections import Counter
import string
from nltk.corpus import stopwords
def get_tokens() :
with open(' /home/d/TRY /NLTK/STOP.txt') as stopl:
tokens = nltk .word_tokenize (stopl.read() . lower() .translate (None, string. punctuation) )
return tokens
if __name__ == ”__main__ ”:
tokens = get_tokens()
print ("tokens[:20]=is")% (tokens[:20])
count1 = Counter (tokens)
print("before: len(count1) = 8s") %(len (count1))
filtered1 = [w for W in tokens if not W in stopwords.words ('english')]
print("filteredl tokens[:20]=8s") % (filtered1[:20])
count1 = Counter (filtered1)
print("after: len(count1) = %s") % (len (count1))
print("most_common = %s") % (count.most_common(10))
tagged1 = nltk.pos_tag(filtered1)
print(“tagged1[:2] = %s”) % (tagged1[:2])
8.1.2使用向量空间模型进行信息检索
将文档表示成向量的方法之一是使用TF-IDF(词频-反文档频率,Term Frequency-Inverse Document Frequency)。
TF(词频) | 一个给定的标识符在文档中出现的次数除以该文档中标识符的总数。也可以被定义为给定文档中某些特定项出现的频率。 |
---|---|
IDF(反文档频率) | 可以认为是反文档频率。也可认为是语料库中包含给定特征项的文档数目 |
对语料库中的每个文档进行切分:
authen = OAuthHandler (CLIENT_ID, CLIENT_SECRET, CALBACK)
authen.set_access_token (ACCESS_TOKEN)
ap = API (authen)
venue = ap.venues(id = '4bc47eeb5631c9b69672a230’)
stopwords = nltk.corpus.stopwords.words('english')
tokenizer.RegexpTokenizer("[\w']+", flags = re.UNICODE)
def freq(word, tokens) :
return tokens.count (word)
#Compute the frequency for each term.
vocabulary = []
docs = {}
all_tips = []
for tip in (venue.tips()) :
tokens = tokenizer.tokenize(tip.text)
bitokens = bigrams (tokens)
tritokens.trigrams (tokens)
tokens = [tcken.lower() for token in tokens if len(token) > 2]
tokens = [coken for token in tokens if token not in stopwords]
bitokens = [' '.join(token) . lower() for token in bitokens]
bitokens.[token for token in bitokens if token not in stopwords]
tritokens = [' '.join(token).lower() for token in tritokens]
tritokens = [token for token in tritokens if token not in stopwords]
ftokens = []
ftokens.extend (tokens)
ftokens.extend (bitokens)
ftokens.extend (tritokens)
docs[tip.text] = {'freq': {} }
for token in ftokens:
docs[tip.text]['freq][token] = freq(token, ftokens)
print (docs)
执行tf向量标准化:
authen.OAuthHandler(CLIENT_ID, CLIENT_SECRET, CALLBACK)
authen.set_access_token (ACCESS _TOKEN)
ap = API(auth)
venue = ap.venues (id='4bd47e95631c9b69672a230')
stopwords = nltk.corpus. stopwords.words('english')
tokenizer = RegexpTokeokenizer("[\w']+", flags=re.UNICODE)
def freq(word, tokens):
return tokens.count (word)
def word_count (tokens) :
return len(tokens)
def tf (word, tokens):
return (freq(word, tokens) / float (word_count (tokens)))
#Compute the frequency for each term.
vocabulary = []
docs = {}
all_tips = []
for tip in (venue.tips()):
tokens.tokenizer.tokenize(tip.text)
bitokens = bigrams (tokens)
tritokens = trigrams (tokens)
tokens = [token.lower() for token in tokens if len(token) > 2]
tokens = [token for token in tokens if token not in stopwords]
bitokens = [' ' .join(token).lower() for token in bitokens]
bitokens = [token for token in bitokens if token not in stopwords]
tritokens = [' ' .join(token).lower() for token in tritokens]
tritokens = [token for token in tritokens if token not in stopwords]
ftokens = []
ftokens.extend (tokens)
ftokens.extend (bitokens)
ftokens.extend (tritokens)
docs[tip.text] = {'freq': {}, 'tf': {}}
for token in ftokens:
#The Computed Erequency
docs[tip.text]['freq'] [token] = freq(token, ftokens)
#Normalized Frequercy
docs[tip.text] ['tf'] [token] = tf (token, ftokens)
print (docs)
计算TF-IDF值:
authen.OAuthHandler (CLIENT_ID, CLIENT_SECRET, CALLBACK)
authen.set_access_token (ACCESS_TOKEN)
ap = API (authen)
venue = ap.venues(id='4bd47eeb5631c9b69672a230')
stopwords = nltk.corpus.stopwords.words('english')
tokenizer = RegexpTokenizer ("[\w']+", flags=re.UNICODE)
def freq(word, doc) :
return doc.count(word)
def word_count(doc) :
return len(doc)
def tf(word, doc):
return (freqlword, doc) / float (word_count (doc)))
def num_docs.containing (word, list_of_docs):
count = 0
for document in list_of_docs:
if freq(word, document) > 0:
count += 1
return 1 + count
det idf(word, list_0f_docs):
return math.log(len(list_of_docs) / float (num _docs _containing(word, list_of_docs)))
#Compute the frequency for each term.
vocabulary = []
docs = {}
all_tips = []
for tip in (venue.tips()):
tokens.tokenizer.tokenize (tip.text)
bitokens.bigrams (tokens)
tritokens.trigrams (tokens)
tokens = [token.lower() for token in tokens if len(token) > 2]
tokens = [token for token in tokens if token not in stopwords]
bitokens = [' '.join(token).lower() for token in bitokens]
bitokens = [token for token in bitokens if token not in stopwords]
tritokens = [' '.join(token) .lower() for token in tritokens]
tritokens = [token for token in tritokens if token not in stopwords]
ftokens = []
ftokens.extend (tokens)
ftokens.extend (bitokens)
ftokens.extend (tritokens)
docs[tip.text] = {'freq': {}, 'tf': {}, 'idf': {}}
for token in ftokens:
#The frequency computed for each tip
docs[tip.text]['freq'][token] = freq(token, ftokens)
#The term-frequency (Normalized Frequencyl
docs[tip.text]['tf'][token] = tf(token, ftokens)
vocabulary.append (ftokens)
for doc in docs:
for token in docs [doc]['tf']:
#The Inverse-Document-Frequency
docs[doc]['idf'] [token] = idf (token, vocabulary)
print (docs)
计算文档中每个特征项的TF-IDF值:
authen = OAuthHandler (CLIENT_ID, CLIENT_SECRET, CALLBACK)
authen.set_aceess_token (ACCESS_TOKEN)
ap = API (authen)
venue = ap.venues (id='4bd47eeb5631c9b69672a230')
stopwords = nltk.corpus.stopwords.words('english')
tokenizer = RegexpTokenizer("[\w']+", flags=re.UNICODE)
def freq(word, doc) :
return doc.count(word)
def word_count (doc) :
return len(doc)
def tf(word, doc) ;
return (freg(word, doc) / float (word_count (doc)))
def num_docs_containing(word, list_of_docs):
count = 0
for document in list_of_docs:
if freq(word, document) > 0:
count += 1
return 1 + count
def idf(word, list_of_docs) :
return math.log(len(list_of_docs) / float (num_docs_containing(word, list_of_docs))
def tf_idf(word, doc, list_of_docs):
return (tf(word, doc) * idf(word, list_of_docs))
#Compute the frequency for each term.
vocabulary = [ ]
docs = {}
all_tips = []
for tip in (venue.tips()):
tokens.tokenizer.tokenize(tip.text)
betokens = bigrams(tokens)
tritokens = trigrams(tokens)
tokens = [token.lower() for token in tokens if len(token) > 2]
tokens = [token for token in tokens if token nct in stopwords]
bitokens = [' '.join(token).lower() for token in bitokens]
bitokens = [token for token in bitokens if token not in stopwords]
tritokens = [' '.join(token) .lower() for token in tritokens]
tritokens = [token for token in tritokens if token not in stopwords]
ftokens = []
ftokens.extend (tokens)
ftokens.extend (bitokens)
ftokens.extend (tritokens)
docs[tip.text] = {'freq': {}, 'tf': {}, 'idf': {},'tf-idf': {}, 'tokens': []}
for token in ftokens:
#The frequency computed for each tip
docs[tip.Text]['freq'][token] = freq(token, ftokens)
#The term-frequency (Nozmalized Frequency)
docs[tip.text]['tf'][token] = tf(token, ftokens)
docs[tip.text]['tokens'] = ftokens
vocabulary.append(ftokens)
for doc in docs:
for token in docs [doc]['tf']:
#The Inverse-Document-Frequency
docs [doc] ['idf'][token] = idf (token, vocabulary)
#The tf-idf
docs[doc]['idf’][token] = tf_idf(token, docs[doc]['tokens’],vocabulary)
#NOW let's find out the most relevant words by tt-idf.
words = {}
for doc in docs:
for token in docs[doc]['tf-idf']:
if token not in words:
words[token] = docs [doc]['tf-idf'][token]
else:
if docs[doc]['tf-idf'][token] > words[token]:
words[token] = docs[doc]['tf-idf'] [token]
for item in sorted (words.items(), key=lambda x: x[1], reverse=True) :
print ("%f <= %s” % (iten[1], item[0]))
映射关键词到向量维数:
def getVectkeyIndex (self, documentList) :
vocabString = "”. join (documentList)
vocabList = self.parser.tokenise (vocabString)
vocabList = self.parser.removestopWords (vocabList)
uniquevocabList = util.removeDuplicates (vocabList)
vectorIndex = {}
offset = 0
for word in uniquevocablist:
vector Index [word] = offset
offset += 1
return vectorIndex
映射文档字符串到向量:
def makeVect (self, wordstring) :
vector=[0]*len (self.vectorkeywordIndex)
wordList=self .parser. tokenise (wordString)
wordlist = self.parser.removeStopWords (wordList)
for word in wordList:
vector[self .vectorkeywordIndex [word]1+=1;
return vector
8.2向量空间评分及查询操作符关联
向量大小用于表示我们使用的代表了特定上下文的向量的大小。对于上下文建模,可以使用基于窗口的方法和基于依赖的方法:
基于窗口 | 根据特定大小的窗口内出现的单词来确定上下文 |
---|---|
基于依赖 | 当存在一个单词与其相应的目标词具有特定的句法关系时,就可以确定上下文 |
8.3使用隐性语义索引开发IR系统
隐性语义索引可以认为是一种信息检索和索引的方法,它使用了一种被称为奇异值矩阵分解(Singular Value Decomposition,SVD)的数学方法。SVD用于模式识别。
隐性语义索引的一些应用:
- 信息探索;
- 文本自动分类与文本摘要;
- 关系探索;
- 自动生成个人和组织的链接图表;
- 将技术论文和资助与审阅者相匹配;
- 在线客服;
- 确定文档作者身份;
- 自动标注图像关键词;
- 理解软件源代码;
- 过滤垃圾邮件;
- 信息可视化;
- 论文评分;
- 基于文献的知识探索。
8.4文本摘要
文本摘要是为一个给定的长文本生成摘要的过程。
执行文本摘要:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
from string import punctuation
from heapq import nlargest
class Summarize_ Frequency:
def_init (self, cut_min=0.2, cut_max=0.8) :
“””
Initilize the text summarizer. Words that have a frequency term lower than cut_min or higer than cut max will be ignored.
“””
self._cut_min = cut_min
self._cut_max = cut_max
self._stopwords = set (stopwords.words('english') +list (punctuation) )
def _compute_frequencies (self, word_sent) :
“””
Compute the frequency of each of word.
Input:
word_sent, a list of sentences already tokenized.
Output:
freq, a dictionary where treq[w] is the frequency of w.
“””
freq = defaultdict (int)
for s in word sent:
for word in s:
if word not in self.stopwords:
freg[word] += 1
# frequencies normalization and filtering
m = float(max (freq.values()))
for w in freq.keys():
freq[w] = freq[w] /m
if freq[w] >= self._cut_max or freq[w] <= self._cut_min:
del freq[w]
return freq
def suromarize(self, text, n) :
“””
list of (n) sentences are returned.
summary of text is returned.
“””
sents = sent_tokenize (text)
assert n <= len(sents)
word_sent = [word_tokenize (s.lower()) for s in sents]
self._freq = self._compute_frequencies (word_sent)
ranking = defaultdict (int)
for i,sent in enumerate (word_sent) :
for w in sent:
if w in self._freq:
ranking[i] += self._freq[w]
sents_idx = self._rank (ranking, n)
return [sents[j] for j in sents_idx]
def_rank(self, ranking, n):
“”"return the first n sentences with highest ranking "””
return nlargest(n, ranking, key = ranking.get)
8.5问答系统
问答系统涉及众多问题:
- 如何在系统中表示问题和答案;
- 如何在知识库中表示问题及其相应的答案。
问答系统包含三个阶段:
- 提取事实;
- 理解问题;
- 生成答案。
接受用户的查询:
import nltk
from nltk import *
import string
print( "Enter your question")
ques = raw input()
ques = ques.lower ()
stopwords = nltk.corpus.stopwords.words('english')
cont = nltk.word_tokenize (question)
analysis_keywords = list( set (cont) - set (stopwords) )
“”"***笔者的话:整理了《精通Python自然语言处理》的第八章内容:信息检索。对于一个系统来说,信息检索的结果会直接影响用户的体验,同时也显示了系统的性能。检索出最为合适的信息是至关重要的。后续会整理这本书的后面章节。本博客记录了书中的每段代码。希望对阅读这本书的人有所帮助。FIGHTING...(热烈欢迎大家批评指正,互相讨论)
(Good things are worth the wait.
) ***"""
(第七章):情感分析(https://blog.csdn.net/cjx14060307101/article/details/88580981)
(第六章):语义分析(https://blog.csdn.net/cjx14060307101/article/details/88541214)
(第五章):语法分析(https://blog.csdn.net/cjx14060307101/article/details/88378177)
(第四章):词性标注(https://blog.csdn.net/cjx14060307101/article/details/88357016)
(第三章):形态学(https://blog.csdn.net/cjx14060307101/article/details/88316108)
(第二章):统计语言建模(https://blog.csdn.net/cjx14060307101/article/details/88087305)
(第一章):字符串操作(https://blog.csdn.net/cjx14060307101/article/details/87980631)