《精通Python自然语言处理（ Deepti Chopra)》读书笔记（第八章）：信息检索

最新推荐文章于 2023-08-03 15:26:40 发布

搞点學術的研究生

最新推荐文章于 2023-08-03 15:26:40 发布

阅读量644

点赞数

分类专栏：ＮＬＰ # 中文分词大数据文章标签：信息检索中文分词自然语言处理（NLP） TF-IDF

本文链接：https://blog.csdn.net/cjx14060307101/article/details/88595396

版权

ＮＬＰ同时被 3 个专栏收录

15 篇文章 3 订阅

订阅专栏

中文分词

13 篇文章 0 订阅

订阅专栏

大数据

7 篇文章 0 订阅

订阅专栏

《精通Python自然语言处理》

Deepti Chopra(印度)
王威译

第八章信息检索：访问信息

8.1信息检索简介

信息检索可以定义为检索最合适的信息作为用户查询相应的过程。
信息检索任务的准确度是依据精确率和召回率来度量的。

召回率	Recall（R）=(XnY)/Y
精确率	Precision（P）=(XnY)/x
F-值	F-Measure=2*(XnY)/(X+Y)

8.1.1停止词删除

获取英文中可以被检测到的停止词集合：

import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))

找出不是停止词的单词个数：

def not_stopwords(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() not in stopwords]
    return len(content) / len(text)
print(not_stopwords(nltk.corpus.reuters.words()))

将大写字母转化为小写字母，然后再去除停止词：

import nltk
from collections import Counter
import string
from nltk.corpus import stopwords

def get_tokens() :
	with open(' /home/d/TRY /NLTK/STOP.txt') as stopl:
		tokens = nltk .word_tokenize (stopl.read() . lower() .translate (None, string. punctuation) )
	return tokens
if __name__ == ”__main__ ”:
	tokens = get_tokens()
	print ("tokens[:20]=is")% (tokens[:20])

	count1 = Counter (tokens)
	print("before: len(count1) = 8s") %(len (count1))

	filtered1 = [w for W in tokens if not W in stopwords.words ('english')]
	print("filteredl tokens[:20]=8s")  % (filtered1[:20])
	count1 = Counter (filtered1)
	print("after: len(count1) = %s")  % (len (count1))
	print("most_common = %s")  % (count.most_common(10))

	tagged1 = nltk.pos_tag(filtered1)
	print(“tagged1[:2] = %s”) % (tagged1[:2])

8.1.2使用向量空间模型进行信息检索

将文档表示成向量的方法之一是使用TF-IDF（词频-反文档频率，Term Frequency-Inverse Document Frequency）。

TF(词频)	一个给定的标识符在文档中出现的次数除以该文档中标识符的总数。也可以被定义为给定文档中某些特定项出现的频率。
IDF(反文档频率)	可以认为是反文档频率。也可认为是语料库中包含给定特征项的文档数目

对语料库中的每个文档进行切分：

authen = OAuthHandler (CLIENT_ID, CLIENT_SECRET, CALBACK)
authen.set_access_token (ACCESS_TOKEN)
ap = API (authen)
venue = ap.venues(id = '4bc47eeb5631c9b69672a230’)
stopwords = nltk.corpus.stopwords.words('english')
tokenizer.RegexpTokenizer("[\w']+",  flags = re.UNICODE)

def freq(word, tokens) :
	return tokens.count (word)

#Compute the frequency for each term.
vocabulary = []
docs = {}
all_tips = []
for tip in (venue.tips()) :
	tokens = tokenizer.tokenize(tip.text)

bitokens = bigrams (tokens)
tritokens.trigrams (tokens)
tokens = [tcken.lower() for token in tokens if len(token) > 2]
tokens = [coken for token in tokens if token not in stopwords]

bitokens = [' '.join(token) . lower() for token in bitokens]
bitokens.[token for token in bitokens if token not in stopwords]

tritokens = [' '.join(token).lower() for token in tritokens]
tritokens = [token for token in tritokens if token not in stopwords]

ftokens = []
ftokens.extend (tokens)
ftokens.extend (bitokens)
ftokens.extend (tritokens)
docs[tip.text] = {'freq':  {} }

for token in ftokens:
	docs[tip.text]['freq][token] = freq(token, ftokens)

print (docs)

执行tf向量标准化：

authen.OAuthHandler(CLIENT_ID, CLIENT_SECRET, CALLBACK)
authen.set_access_token (ACCESS _TOKEN)
ap = API(auth)

venue = ap.venues (id='4bd47e95631c9b69672a230')
stopwords = nltk.corpus. stopwords.words('english')
tokenizer = RegexpTokeokenizer("[\w']+", flags=re.UNICODE)
def freq(word, tokens):
	return tokens.count (word)

def word_count (tokens) :
	return len(tokens)

def tf (word, tokens):
	return (freq(word, tokens) / float (word_count (tokens)))

#Compute the frequency for each term.
vocabulary = []
docs = {}
all_tips = []
for tip in (venue.tips()):
	tokens.tokenizer.tokenize(tip.text)

bitokens = bigrams (tokens)
tritokens = trigrams (tokens)
tokens = [token.lower() for token in tokens if len(token) > 2]
tokens = [token for token in tokens if token not in stopwords]

bitokens  = [' ' .join(token).lower() for token in bitokens]
bitokens  = [token for token in bitokens if token not in stopwords]

tritokens = [' ' .join(token).lower() for token in tritokens]
tritokens = [token for token in tritokens if token not in stopwords]

ftokens = []
ftokens.extend (tokens)
ftokens.extend (bitokens)
ftokens.extend (tritokens)
docs[tip.text] = {'freq': {}, 'tf': {}}

for token in ftokens:
	#The Computed Erequency
	docs[tip.text]['freq'] [token] = freq(token, ftokens)
	#Normalized Frequercy
	docs[tip.text] ['tf'] [token] = tf (token, ftokens)

print (docs)

计算TF-IDF值：

authen.OAuthHandler (CLIENT_ID, CLIENT_SECRET, CALLBACK)
authen.set_access_token (ACCESS_TOKEN)
ap = API (authen)

venue = ap.venues(id='4bd47eeb5631c9b69672a230')
stopwords = nltk.corpus.stopwords.words('english')
tokenizer = RegexpTokenizer ("[\w']+", flags=re.UNICODE)

def freq(word, doc) :
	return doc.count(word)

def word_count(doc) :
	return len(doc)

def tf(word, doc):
	return (freqlword, doc)  /  float (word_count (doc)))

def num_docs.containing (word, list_of_docs):
	count = 0
	for document in list_of_docs:
		if freq(word, document) > 0:
			count += 1
	return 1 + count

det idf(word, list_0f_docs):
	return math.log(len(list_of_docs)  /  float (num _docs _containing(word, list_of_docs))) 

#Compute the frequency for each term.
vocabulary = []
docs = {}
all_tips = []
for tip in (venue.tips()):
	tokens.tokenizer.tokenize (tip.text)

bitokens.bigrams (tokens)
tritokens.trigrams (tokens)
tokens = [token.lower() for token in tokens if len(token) > 2]
tokens = [token for token in tokens if token not in stopwords]

bitokens = [' '.join(token).lower() for token in bitokens]
bitokens = [token for token in bitokens if token not in stopwords]

tritokens = [' '.join(token) .lower() for token in tritokens]
tritokens = [token for token in tritokens if token not in stopwords] 

ftokens = []

ftokens.extend (tokens)
ftokens.extend (bitokens)
ftokens.extend (tritokens)
docs[tip.text] = {'freq': {}, 'tf': {}, 'idf': {}}

for token in ftokens:
	#The frequency computed for each tip
	docs[tip.text]['freq'][token] = freq(token, ftokens)
	#The term-frequency (Normalized Frequencyl
	docs[tip.text]['tf'][token] = tf(token, ftokens)

vocabulary.append (ftokens)

for doc in docs:
	for token in docs [doc]['tf']:
		#The Inverse-Document-Frequency
		docs[doc]['idf'] [token] = idf (token, vocabulary)

print (docs)

计算文档中每个特征项的TF-IDF值：

authen = OAuthHandler (CLIENT_ID, CLIENT_SECRET, CALLBACK)
authen.set_aceess_token (ACCESS_TOKEN)
ap = API (authen)

venue = ap.venues (id='4bd47eeb5631c9b69672a230')
stopwords = nltk.corpus.stopwords.words('english')
tokenizer = RegexpTokenizer("[\w']+", flags=re.UNICODE)

def freq(word, doc) :
	return doc.count(word)

def word_count (doc) :
	return len(doc)

def tf(word, doc) ;
	return (freg(word, doc)  /  float (word_count (doc)))

def num_docs_containing(word, list_of_docs):
	count = 0
	for document in list_of_docs:
		if freq(word, document) > 0:
			count += 1
	return 1 + count

def idf(word, list_of_docs) :
	return math.log(len(list_of_docs)  /  float (num_docs_containing(word, list_of_docs))

def tf_idf(word, doc, list_of_docs):
	return (tf(word, doc) * idf(word, list_of_docs))

#Compute the frequency for each term.
vocabulary = [ ]
docs = {}
all_tips = []
for tip in (venue.tips()):
	tokens.tokenizer.tokenize(tip.text)

betokens = bigrams(tokens)
tritokens = trigrams(tokens)
tokens = [token.lower() for token in tokens if len(token) > 2]
tokens = [token for token in tokens if token nct in stopwords]

bitokens = [' '.join(token).lower() for token in bitokens]
bitokens = [token for token in bitokens if token not in stopwords]

tritokens = [' '.join(token) .lower() for token in tritokens]
tritokens = [token for token in tritokens if token not in stopwords]

ftokens = []
ftokens.extend (tokens)
ftokens.extend (bitokens)
ftokens.extend (tritokens)
docs[tip.text] = {'freq': {}, 'tf': {}, 'idf': {},'tf-idf': {}, 'tokens': []}

for token in ftokens:
	#The frequency computed for each tip
	docs[tip.Text]['freq'][token] = freq(token, ftokens)
	#The term-frequency (Nozmalized Frequency)
	docs[tip.text]['tf'][token] = tf(token, ftokens)
	docs[tip.text]['tokens'] = ftokens
	vocabulary.append(ftokens)

for doc in docs:
	for token in docs [doc]['tf']:
		#The Inverse-Document-Frequency
		docs [doc] ['idf'][token] = idf (token, vocabulary)
		#The tf-idf
		docs[doc]['idf’][token] = tf_idf(token, docs[doc]['tokens’],vocabulary)

#NOW let's find out the most relevant words by tt-idf.
words = {}
for doc in docs:
	for token in docs[doc]['tf-idf']:
		if token not in words:
			words[token] = docs [doc]['tf-idf'][token]
		else:
			if docs[doc]['tf-idf'][token] > words[token]:
				words[token] = docs[doc]['tf-idf'] [token] 

for item in sorted (words.items(), key=lambda x: x[1], reverse=True) :
	print ("%f  <=  %s” % (iten[1], item[0]))

映射关键词到向量维数：

def getVectkeyIndex (self, documentList) :
	vocabString = "”. join (documentList)
	vocabList = self.parser.tokenise (vocabString)
	vocabList = self.parser.removestopWords (vocabList)
	uniquevocabList = util.removeDuplicates (vocabList)
	vectorIndex = {}
	offset = 0
	for word in uniquevocablist:
		vector Index [word] = offset
		offset += 1
	return vectorIndex

映射文档字符串到向量：

def makeVect (self, wordstring) :
	vector=[0]*len (self.vectorkeywordIndex)
	wordList=self .parser. tokenise (wordString)
	wordlist = self.parser.removeStopWords (wordList)
	for word in wordList:
		vector[self .vectorkeywordIndex [word]1+=1;
return vector

8.2向量空间评分及查询操作符关联

向量大小用于表示我们使用的代表了特定上下文的向量的大小。对于上下文建模，可以使用基于窗口的方法和基于依赖的方法：

基于窗口	根据特定大小的窗口内出现的单词来确定上下文
基于依赖	当存在一个单词与其相应的目标词具有特定的句法关系时，就可以确定上下文

8.3使用隐性语义索引开发IR系统

隐性语义索引可以认为是一种信息检索和索引的方法，它使用了一种被称为奇异值矩阵分解（Singular Value Decomposition，SVD）的数学方法。SVD用于模式识别。
隐性语义索引的一些应用：

信息探索；
文本自动分类与文本摘要；
关系探索；
自动生成个人和组织的链接图表；
将技术论文和资助与审阅者相匹配；
在线客服；
确定文档作者身份；
自动标注图像关键词；
理解软件源代码；
过滤垃圾邮件；
信息可视化；
论文评分；
基于文献的知识探索。

8.4文本摘要

文本摘要是为一个给定的长文本生成摘要的过程。

执行文本摘要：

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
from string import punctuation
from heapq import nlargest

class Summarize_ Frequency:
	def_init (self, cut_min=0.2, cut_max=0.8) :
	“””
	Initilize the text summarizer. Words that have a frequency term lower than cut_min or 		higer than cut max will be ignored.
	“””
	self._cut_min = cut_min
	self._cut_max = cut_max
	self._stopwords = set (stopwords.words('english') +list (punctuation) )

	def _compute_frequencies (self, word_sent) :
		“””
		Compute the frequency of each of word.
		Input:
		word_sent, a list of sentences already tokenized.	
		Output:
		freq, a dictionary where treq[w] is the frequency of w. 
		“””

		freq = defaultdict (int)
		for s in word sent:
			for word in s:
				if word not in self.stopwords:
					freg[word] += 1
		# frequencies normalization and filtering
		m = float(max (freq.values()))
		for w in freq.keys():
			freq[w] = freq[w] /m
			if freq[w] >= self._cut_max or freq[w] <= self._cut_min:
				del freq[w]
		return freq

	def suromarize(self, text, n) :
		“””
		list of (n) sentences are returned.
		summary of text is returned.
		“””
		sents = sent_tokenize (text)
		assert n <= len(sents)
		word_sent = [word_tokenize (s.lower()) for s in sents]
		self._freq = self._compute_frequencies (word_sent)
		ranking = defaultdict (int)
		for i,sent in enumerate (word_sent) :
			for w in sent:
				if w in self._freq:
					ranking[i] += self._freq[w]
		sents_idx = self._rank (ranking, n)
		return [sents[j] for j in sents_idx]

	def_rank(self, ranking, n):
		“”"return the first n sentences with highest ranking "””
		return nlargest(n, ranking, key = ranking.get)

8.5问答系统

问答系统涉及众多问题：

如何在系统中表示问题和答案；
如何在知识库中表示问题及其相应的答案。

问答系统包含三个阶段：

提取事实；
理解问题；
生成答案。

接受用户的查询：

import nltk
from nltk import *
import string
print( "Enter your question")
ques = raw input()
ques = ques.lower ()
stopwords = nltk.corpus.stopwords.words('english')
cont = nltk.word_tokenize (question)
analysis_keywords = list(  set (cont)  -  set (stopwords) )

“”"***笔者的话：整理了《精通Python自然语言处理》的第八章内容：信息检索。对于一个系统来说，信息检索的结果会直接影响用户的体验，同时也显示了系统的性能。检索出最为合适的信息是至关重要的。后续会整理这本书的后面章节。本博客记录了书中的每段代码。希望对阅读这本书的人有所帮助。ＦＩＧＨＴＩＮＧ．．．（热烈欢迎大家批评指正，互相讨论）
（Good things are worth the wait.） ***"""

（第七章）：情感分析（https://blog.csdn.net/cjx14060307101/article/details/88580981）
（第六章）：语义分析（https://blog.csdn.net/cjx14060307101/article/details/88541214）
（第五章）：语法分析（https://blog.csdn.net/cjx14060307101/article/details/88378177）
（第四章）：词性标注（https://blog.csdn.net/cjx14060307101/article/details/88357016）
（第三章）：形态学（https://blog.csdn.net/cjx14060307101/article/details/88316108）
（第二章）：统计语言建模（https://blog.csdn.net/cjx14060307101/article/details/88087305）
（第一章）：字符串操作（https://blog.csdn.net/cjx14060307101/article/details/87980631）

搞点學術的研究生

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
《精通Python自然语言处理（ Deepti Chopra)》读书笔记（第八章）：信息检索

《精通Python自然语言处理》Deepti Chopra(印度)王威译第八章信息检索：访问信息8.1信息检索简介信息检索可以定义为检索最合适的信息作为用户查询相应的过程。信息检索任务的准确度是依据精确率和召回率来度量的。召回率Recall（R）=(XnY)/Y精确率Precision（P）=(XnY)/xF-值F-Measure=2*(XnY)...
复制链接

扫一扫