《精通Python自然语言处理( Deepti Chopra)》读书笔记(第八章):信息检索

《精通Python自然语言处理》

Deepti Chopra(印度)
王威 译


第八章 信息检索:访问信息


8.1信息检索简介

信息检索可以定义为检索最合适的信息作为用户查询相应的过程。
信息检索任务的准确度是依据精确率和召回率来度量的。

召回率Recall(R)=(XnY)/Y
精确率Precision(P)=(XnY)/x
F-值F-Measure=2*(XnY)/(X+Y)
8.1.1停止词删除
获取英文中可以被检测到的停止词集合:
import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))
找出不是停止词的单词个数:
def not_stopwords(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() not in stopwords]
    return len(content) / len(text)
print(not_stopwords(nltk.corpus.reuters.words()))
将大写字母转化为小写字母,然后再去除停止词:
import nltk
from collections import Counter
import string
from nltk.corpus import stopwords

def get_tokens() :
	with open(' /home/d/TRY /NLTK/STOP.txt') as stopl:
		tokens = nltk .word_tokenize (stopl.read() . lower() .translate (None, string. punctuation) )
	return tokens
if __name__ == ”__main__ ”:
	tokens = get_tokens()
	print ("tokens[:20]=is")% (tokens[:20])

	count1 = Counter (tokens)
	print("before: len(count1) = 8s") %(len (count1))

	filtered1 = [w for W in tokens if not W in stopwords.words ('english')]
	print("filteredl tokens[:20]=8s")  % (filtered1[:20])
	count1 = Counter (filtered1)
	print("after: len(count1) = %s")  % (len (count1))
	print("most_common = %s")  % (count.most_common(10))

	tagged1 = nltk.pos_tag(filtered1)
	print(“tagged1[:2] = %s”) % (tagged1[:2])
8.1.2使用向量空间模型进行信息检索

将文档表示成向量的方法之一是使用TF-IDF(词频-反文档频率,Term Frequency-Inverse Document Frequency)。

TF(词频)一个给定的标识符在文档中出现的次数除以该文档中标识符的总数。也可以被定义为给定文档中某些特定项出现的频率。
IDF(反文档频率)可以认为是反文档频率。也可认为是语料库中包含给定特征项的文档数目
对语料库中的每个文档进行切分:
authen = OAuthHandler (CLIENT_ID, CLIENT_SECRET, CALBACK)
authen.set_access_token (ACCESS_TOKEN)
ap = API (authen)
venue = ap.venues(id = '4bc47eeb5631c9b69672a230’)
stopwords = nltk.corpus.stopwords.words('english')
tokenizer.RegexpTokenizer("[\w']+",  flags = re.UNICODE)

def freq(word, tokens) :
	return tokens.count (word)

#Compute the frequency for each term.
vocabulary = []
docs = {}
all_tips = []
for tip in (venue.tips()) :
	tokens = tokenizer.tokenize(tip.text)

bitokens = bigrams (tokens)
tritokens.trigrams (tokens)
tokens = [tcken.lower() for token in tokens if len(token) > 2]
tokens = [coken for token in tokens if token not in stopwords]

bitokens = [' '.join(token) . lower() for token in bitokens]
bitokens.[token for token in bitokens if token not in stopwords]

tritokens = [' '.join(token).lower() for token in tritokens]
tritokens = [token for token in tritokens if token not in stopwords]

ftokens = []
ftokens.extend (tokens)
ftokens.extend (bitokens)
ftokens.extend (tritokens)
docs[tip.text] = {'freq':  {} }

for token in ftokens:
	docs[tip.text]['freq][token] = freq(token, ftokens)

print (docs)
执行tf向量标准化:
authen.OAuthHandler(CLIENT_ID, CLIENT_SECRET, CALLBACK)
authen.set_access_token (ACCESS _TOKEN)
ap = API(auth)

venue = ap.venues (id='4bd47e95631c9b69672a230')
stopwords = nltk.corpus. stopwords.words('english')
tokenizer = RegexpTokeokenizer("[\w']+", flags=re.UNICODE)
def freq(word, tokens):
	return tokens.count (word)

def word_count (tokens) :
	return len(tokens)

def tf (word, tokens):
	return (freq(word, tokens) / float (word_count (tokens)))

#Compute the frequency for each term.
vocabulary = []
docs = {}
all_tips = []
for tip in (venue.tips()):
	tokens.tokenizer.tokenize(tip.text)

bitokens = bigrams (tokens)
tritokens = trigrams (tokens)
tokens = [token.lower() for token in tokens if len(token) > 2]
tokens = [token for token in tokens if token not in stopwords]

bitokens  = [' ' .join(token).lower() for token in bitokens]
bitokens  = [token for token in bitokens if token not in stopwords]

tritokens = [' ' .join(token).lower() for token in tritokens]
tritokens = [token for token in tritokens if token not in stopwords]

ftokens = []
ftokens.extend (tokens)
ftokens.extend (bitokens)
ftokens.extend (tritokens)
docs[tip.text] = {'freq': {}, 'tf': {}}

for token in ftokens:
	#The Computed Erequency
	docs[tip.text]['freq'] [token] = freq(token, ftokens)
	#Normalized Frequercy
	docs[tip.text] ['tf'] [token] = tf (token, ftokens)

print (docs)
计算TF-IDF值:
authen.OAuthHandler (CLIENT_ID, CLIENT_SECRET, CALLBACK)
authen.set_access_token (ACCESS_TOKEN)
ap = API (authen)

venue = ap.venues(id='4bd47eeb5631c9b69672a230')
stopwords = nltk.corpus.stopwords.words('english')
tokenizer = RegexpTokenizer ("[\w']+", flags=re.UNICODE)

def freq(word, doc) :
	return doc.count(word)

def word_count(doc) :
	return len(doc)

def tf(word, doc):
	return (freqlword, doc)  /  float (word_count (doc)))

def num_docs.containing (word, list_of_docs):
	count = 0
	for document in list_of_docs:
		if freq(word, document) > 0:
			count += 1
	return 1 + count

det idf(word, list_0f_docs):
	return math.log(len(list_of_docs)  /  float (num _docs _containing(word, list_of_docs))) 

#Compute the frequency for each term.
vocabulary = []
docs = {}
all_tips = []
for tip in (venue.tips()):
	tokens.tokenizer.tokenize (tip.text)

bitokens.bigrams (tokens)
tritokens.trigrams (tokens)
tokens = [token.lower() for token in tokens if len(token) > 2]
tokens = [token for token in tokens if token not in stopwords]

bitokens = [' '.join(token).lower() for token in bitokens]
bitokens = [token for token in bitokens if token not in stopwords]

tritokens = [' '.join(token) .lower() for token in tritokens]
tritokens = [token for token in tritokens if token not in stopwords] 

ftokens = []

ftokens.extend (tokens)
ftokens.extend (bitokens)
ftokens.extend (tritokens)
docs[tip.text] = {'freq': {}, 'tf': {}, 'idf': {}}

for token in ftokens:
	#The frequency computed for each tip
	docs[tip.text]['freq'][token] = freq(token, ftokens)
	#The term-frequency (Normalized Frequencyl
	docs[tip.text]['tf'][token] = tf(token, ftokens)

vocabulary.append (ftokens)

for doc in docs:
	for token in docs [doc]['tf']:
		#The Inverse-Document-Frequency
		docs[doc]['idf'] [token] = idf (token, vocabulary)

print (docs)
计算文档中每个特征项的TF-IDF值:
authen = OAuthHandler (CLIENT_ID, CLIENT_SECRET, CALLBACK)
authen.set_aceess_token (ACCESS_TOKEN)
ap = API (authen)

venue = ap.venues (id='4bd47eeb5631c9b69672a230')
stopwords = nltk.corpus.stopwords.words('english')
tokenizer = RegexpTokenizer("[\w']+", flags=re.UNICODE)

def freq(word, doc) :
	return doc.count(word)

def word_count (doc) :
	return len(doc)

def tf(word, doc) ;
	return (freg(word, doc)  /  float (word_count (doc)))

def num_docs_containing(word, list_of_docs):
	count = 0
	for document in list_of_docs:
		if freq(word, document) > 0:
			count += 1
	return 1 + count

def idf(word, list_of_docs) :
	return math.log(len(list_of_docs)  /  float (num_docs_containing(word, list_of_docs))

def tf_idf(word, doc, list_of_docs):
	return (tf(word, doc) * idf(word, list_of_docs))

#Compute the frequency for each term.
vocabulary = [ ]
docs = {}
all_tips = []
for tip in (venue.tips()):
	tokens.tokenizer.tokenize(tip.text)

betokens = bigrams(tokens)
tritokens = trigrams(tokens)
tokens = [token.lower() for token in tokens if len(token) > 2]
tokens = [token for token in tokens if token nct in stopwords]

bitokens = [' '.join(token).lower() for token in bitokens]
bitokens = [token for token in bitokens if token not in stopwords]

tritokens = [' '.join(token) .lower() for token in tritokens]
tritokens = [token for token in tritokens if token not in stopwords]

ftokens = []
ftokens.extend (tokens)
ftokens.extend (bitokens)
ftokens.extend (tritokens)
docs[tip.text] = {'freq': {}, 'tf': {}, 'idf': {},'tf-idf': {}, 'tokens': []}

for token in ftokens:
	#The frequency computed for each tip
	docs[tip.Text]['freq'][token] = freq(token, ftokens)
	#The term-frequency (Nozmalized Frequency)
	docs[tip.text]['tf'][token] = tf(token, ftokens)
	docs[tip.text]['tokens'] = ftokens
	vocabulary.append(ftokens)

for doc in docs:
	for token in docs [doc]['tf']:
		#The Inverse-Document-Frequency
		docs [doc] ['idf'][token] = idf (token, vocabulary)
		#The tf-idf
		docs[doc]['idf’][token] = tf_idf(token, docs[doc]['tokens’],vocabulary)

#NOW let's find out the most relevant words by tt-idf.
words = {}
for doc in docs:
	for token in docs[doc]['tf-idf']:
		if token not in words:
			words[token] = docs [doc]['tf-idf'][token]
		else:
			if docs[doc]['tf-idf'][token] > words[token]:
				words[token] = docs[doc]['tf-idf'] [token] 

for item in sorted (words.items(), key=lambda x: x[1], reverse=True) :
	print ("%f  <=  %s” % (iten[1], item[0]))
映射关键词到向量维数:
def getVectkeyIndex (self, documentList) :
	vocabString = "”. join (documentList)
	vocabList = self.parser.tokenise (vocabString)
	vocabList = self.parser.removestopWords (vocabList)
	uniquevocabList = util.removeDuplicates (vocabList)
	vectorIndex = {}
	offset = 0
	for word in uniquevocablist:
		vector Index [word] = offset
		offset += 1
	return vectorIndex
映射文档字符串到向量:
def makeVect (self, wordstring) :
	vector=[0]*len (self.vectorkeywordIndex)
	wordList=self .parser. tokenise (wordString)
	wordlist = self.parser.removeStopWords (wordList)
	for word in wordList:
		vector[self .vectorkeywordIndex [word]1+=1;
return vector

8.2向量空间评分及查询操作符关联

向量大小用于表示我们使用的代表了特定上下文的向量的大小。对于上下文建模,可以使用基于窗口的方法和基于依赖的方法:

基于窗口根据特定大小的窗口内出现的单词来确定上下文
基于依赖当存在一个单词与其相应的目标词具有特定的句法关系时,就可以确定上下文

8.3使用隐性语义索引开发IR系统

隐性语义索引可以认为是一种信息检索和索引的方法,它使用了一种被称为奇异值矩阵分解(Singular Value Decomposition,SVD)的数学方法。SVD用于模式识别。
隐性语义索引的一些应用:

  • 信息探索;
  • 文本自动分类与文本摘要;
  • 关系探索;
  • 自动生成个人和组织的链接图表;
  • 将技术论文和资助与审阅者相匹配;
  • 在线客服;
  • 确定文档作者身份;
  • 自动标注图像关键词;
  • 理解软件源代码;
  • 过滤垃圾邮件;
  • 信息可视化;
  • 论文评分;
  • 基于文献的知识探索。

8.4文本摘要

文本摘要是为一个给定的长文本生成摘要的过程。

执行文本摘要:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
from string import punctuation
from heapq import nlargest

class Summarize_ Frequency:
	def_init (self, cut_min=0.2, cut_max=0.8) :
	“””
	Initilize the text summarizer. Words that have a frequency term lower than cut_min or 		higer than cut max will be ignored.
	“””
	self._cut_min = cut_min
	self._cut_max = cut_max
	self._stopwords = set (stopwords.words('english') +list (punctuation) )

	def _compute_frequencies (self, word_sent) :
		“””
		Compute the frequency of each of word.
		Input:
		word_sent, a list of sentences already tokenized.	
		Output:
		freq, a dictionary where treq[w] is the frequency of w. 
		“””

		freq = defaultdict (int)
		for s in word sent:
			for word in s:
				if word not in self.stopwords:
					freg[word] += 1
		# frequencies normalization and filtering
		m = float(max (freq.values()))
		for w in freq.keys():
			freq[w] = freq[w] /m
			if freq[w] >= self._cut_max or freq[w] <= self._cut_min:
				del freq[w]
		return freq

	def suromarize(self, text, n) :
		“””
		list of (n) sentences are returned.
		summary of text is returned.
		“””
		sents = sent_tokenize (text)
		assert n <= len(sents)
		word_sent = [word_tokenize (s.lower()) for s in sents]
		self._freq = self._compute_frequencies (word_sent)
		ranking = defaultdict (int)
		for i,sent in enumerate (word_sent) :
			for w in sent:
				if w in self._freq:
					ranking[i] += self._freq[w]
		sents_idx = self._rank (ranking, n)
		return [sents[j] for j in sents_idx]

	def_rank(self, ranking, n):
		“”"return the first n sentences with highest ranking "””
		return nlargest(n, ranking, key = ranking.get)

8.5问答系统

问答系统涉及众多问题:

  • 如何在系统中表示问题和答案;
  • 如何在知识库中表示问题及其相应的答案。

问答系统包含三个阶段:

  1. 提取事实;
  2. 理解问题;
  3. 生成答案。
接受用户的查询:
import nltk
from nltk import *
import string
print( "Enter your question")
ques = raw input()
ques = ques.lower ()
stopwords = nltk.corpus.stopwords.words('english')
cont = nltk.word_tokenize (question)
analysis_keywords = list(  set (cont)  -  set (stopwords) )

“”"***笔者的话:整理了《精通Python自然语言处理》的第八章内容:信息检索。对于一个系统来说,信息检索的结果会直接影响用户的体验,同时也显示了系统的性能。检索出最为合适的信息是至关重要的。后续会整理这本书的后面章节。本博客记录了书中的每段代码。希望对阅读这本书的人有所帮助。FIGHTING...(热烈欢迎大家批评指正,互相讨论)
Good things are worth the wait.
***"""


(第七章):情感分析(https://blog.csdn.net/cjx14060307101/article/details/88580981)
(第六章):语义分析(https://blog.csdn.net/cjx14060307101/article/details/88541214)
(第五章):语法分析(https://blog.csdn.net/cjx14060307101/article/details/88378177)
(第四章):词性标注(https://blog.csdn.net/cjx14060307101/article/details/88357016)
(第三章):形态学(https://blog.csdn.net/cjx14060307101/article/details/88316108)
(第二章):统计语言建模(https://blog.csdn.net/cjx14060307101/article/details/88087305)
(第一章):字符串操作(https://blog.csdn.net/cjx14060307101/article/details/87980631)

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值