倒排索引(Inverted index),也常被称为反向索引,是一种索引方法,用来存储某个单词存在于哪些文档之中。是信息检索系统中最常用的数据结构。通过倒排索引,可以根据单词快速获取包含这个单词的文档列表。
主要完成以下三个功能:
(1). 建立索引:首先输入100行字符串,用于构建倒排索引,每行字符串由若干不含标点符号的、全部小写字母组成的单词构成,每个单词之间以空格分隔。依次读入每个单词,并组成一个由<</span>单词, 每个单词出现的行号集合>构成的字典,其中行号从1开始计数。
(2). 打印索引:按照字母表顺序依次输出每个单词及其出现的位置,每个单词出现的位置则按行号升序输出。例如,如果“created”出现在第3, 20行,“dead”分别出现在14, 20, 22行。则输出结果如下(冒号和逗号后面都有一个空格,行号不重复):
…
created: 3, 20
dead: 14, 20, 22
…
(3). 检索:接下来输入查询(Query)字符串,每行包含一个查询,每个查询由若干关键字(Keywords)组成,每个关键字用空格分隔且全部为小写字母单词。要求输出包含全部单词行的行号(升序排列),每个查询输出一行。若某一关键字在全部行中从没出现过或没有一行字符串包含全部关键字,则输出“None”。遇到空行表示查询输入结束。如对于上面创建的索引,当查询为“created”时,输出为“3, 20”;当查询为“created dead”时,输出为“20”;当查询为“abcde dead”时,输出为“None”;
以下代码仅供于感兴趣的人参考,有好的意见请不吝赐教。
源代码:
# -*- coding: utf-8 -*-
'''Part 1 : Setup index'''
dict = {} # a emtry dictionary.
n = 100
for row in range(0,n):
information = raw_input()
line_words = information.split()
# split the information inputed into lines by '/n'
for word in line_words : # Judge every word in every lines .
# If the word appear first time .
if word not in dict :
item = set() # set up a new set .
item.add(row+1) # now rows
dict[word] = item # Add now rows into keys(item).
# THe word have appeared before .
else:
dict[word].add(row+1) # Add now rows into keys(item).
# print dict we can get the information dictionary.
'''Part 2 : Print index'''
word_list = dict.items() # Get dict's items .
word_list.sort( key = lambda items : items[0] ) # Sort by word in dict.
for word , row in word_list : # Ergodic word and row in word_list .
list_row = list(row)
list_row.sort()
# Change int row into string row .
for i in range ( 0 , len(list_row) ):
list_row[i] = str(list_row[i])
# print result the part 2 needed .
print word + ':' , ', '.join(list_row)
''' Part 3 : Query '''
# define judger to judger if all querys are in dict.
def judger(dict , query):
list_query = query.split()
for word in list_query :
if word not in dict :
return 0 # for every query ,if there is one not in dict,return 0
return 1 # all query in dict .
query_list = []
# for input , meet '' ,stop input.
while True:
query = raw_input()
if query == '' :
break
elif len(query) != 0 :
query_list.append(query) # append query inputed to a list query_list .
# Ergodic every query in query_list.
for list_query in query_list :
# if judger return 0.
if judger(dict , list_query) == 0 :
print 'None'
else:
list_query = list_query.split()
query_set = set() # get a empty set
# union set to get rows .
for isquery in list_query :
query_set = query_set | dict[isquery]
# intersection to get common rows .
for isquery in list_query :
query_set = query_set & dict[isquery]
# if intersection == 0
if len(query_set) == 0 :
print 'None'
else:
query_result = list(query_set)
query_result.sort()
for m in range(len(query_result)) :
query_result[m] = str(query_result[m])
print ', '.join(query_result)