使用python实现短语查询

原创 2015年11月20日 09:33:31

searchPhrase函数实现了短语查询功能:

def word_split(text):
    """
    Split a text in words. Returns a list of tuple that contains
    (word, location) location is the starting words position of the word.
    alse do the job of normalization
    """
    word_list = []
    wcurrent = []
    windex = 0
    #enumerate can get the index and the specific content of a string
    for i, c in enumerate(text):
        if c.isalnum():
            wcurrent.append(c)
        elif wcurrent:
            windex = windex + 1
            word = u''.join(wcurrent).lower()
            word_list.append((windex, word))
            wcurrent = []

    if wcurrent:
        windex = windex + 1
        word = u''.join(wcurrent).lower()
        word_list.append((windex, word))

    return word_list

def inverted_index(text):
    """
    Create an Inverted-Index of the specified text document.
        {word:[locations]}
    """
    inverted = {}

    for index, word in word_split(text):
        # setdefault func is similar with the func get,but it can add new key and set default value to the dic when the key does not exist
        locations = inverted.setdefault(word, [])
        locations.append(index)

    return inverted

def inverted_index_add(inverted, doc_id, doc_index):
    """
    Add Invertd-Index doc_index of the document doc_id to the 
    Multi-Document Inverted-Index (inverted), 
    using doc_id as document identifier.
        {word:{doc_id:[locations]}}
    """
    for word, locations in doc_index.iteritems():
        indices = inverted.setdefault(word, {})
        indices[doc_id] = locations
    return inverted

def search(inverted, query):
    """
    Returns a set of documents id that contains all the words in your query.
    """
    words = [word for _, word in word_split(query) if word in inverted]
    results = [set(inverted[word].keys()) for word in words]#Duplicate remove
    return reduce(lambda x, y: x & y, results) if results else []#find the doc in common

def searchPhrase(inverted,query):
    """
    Returns a set of documents id that contains phrase in your query.
    """
    words = [word for _, word in word_split(query) if word in inverted]
    tempDic = {}
    doc_return = []
    for word in words:
        word_doc_ids =  inverted[word].keys()
        tempDic.setdefault(word,{})
        for ID in word_doc_ids:
            word_doc_position =  inverted[word][ID]
            tempDic[word].setdefault(ID,word_doc_position)
    #print tempDic
    if len(words)>1:
        minKey = {}
        for i in range(0,len(words)):
            tempKeys = tempDic[words[i]].keys()
            minKey.setdefault(i,tempKeys)
        minKeyNew = minKey[0]
        for i in range(1,len(words)):
            minKeyNew = [val for val in minKeyNew if val in minKey[i]]
        for key in minKeyNew:
            list1 = tempDic[words[0]][key]
            tempPosition = []
            for i in range(1,len(words)):
                listN = tempDic[words[i]][key]
                index1 = 0
                indexN = 0
                while listN[indexN]-list1[index1] != i:
                    if listN[indexN]>list1[index1]:
                        index1 = index1+1
                        if index1 == len(list1):
                            index1 = index1 -1
                            break
                    else:
                        indexN = indexN + 1
                        if indexN == len(listN):
                            indexN = indexN - 1
                            break
                if list1[index1] not in tempPosition and listN[indexN]-list1[index1] == i:
                    tempPosition.append(list1[index1])
            #print tempPosition,"tempPosition"
            isAdd = []
            for i in range(0,len(tempPosition)):
                isAddForOneGroup = []
                for m in range(1,len(words)):
                    if tempPosition[i]+m not in tempDic[words[m]][key]:
                        isAddForOneGroup.append(0)
                if 0 in isAddForOneGroup:
                    isAdd.append(0)
                else:
                    isAdd.append(1)
            if 1 in isAdd:
                doc_return.append(key)                          
    else:
        doc_return.append(tempDic[words[0]].keys()[0])
    results = []
    for doc_id in doc_return:
        if doc_id not in results:
            results.append(doc_id)
    return results

doc1 = """
Niners head coach Mike Singletary will let Alex Smith remain his starting 
quarterback, but his vote of confidence is anything but a long-term mandate.
Smith now will work on a week-to-week basis, because Singletary has voided 
his year-long lease on the job.
"I think from this point on, you have to do what's best for the football team,"
Singletary said Monday, one day after threatening to bench Smith during a 
27-24 loss to the visiting Eagles.
"""

doc2 = """
The fifth edition of West Coast Green, a conference focusing on "green" home 
innovations and products, rolled into San Francisco's Fort Mason last week 
intent, per usual, on making our living spaces more environmentally friendly 
- one used-tire house at a time.Zero-rated buildings 
To that end, there were presentations on topics such as water efficiency and 
the burgeoning future of Net Zero-rated buildings that consume no energy and 
produce no carbon emissions.on a job,on the job
"""
inverted = {}
documents = {'doc1':doc1, 'doc2':doc2}
for doc_id, text in documents.iteritems():
    doc_index = inverted_index(text)
    inverted_index_add(inverted, doc_id, doc_index)

# Print Inverted-Index
#for word, doc_locations in inverted.iteritems():
    #print word, doc_locations

#search common words
print "*****search common words*****"
queries = ['Week', 'Niners', 'coast']
for query in queries:
    result_docs = search(inverted, query)
    print "Search for '%s': %r" % (query, result_docs)

#search phrases
print 
print "*****search phrases*****"
newQueries = ['Zero-rated buildings', 'on the job', 'West Coast']
for query in newQueries:
    result_docs = searchPhrase(inverted, query)
    print "Search for '%s': %r" % (query, result_docs)
版权声明:原创文章~盗版必究~~

相关文章推荐

Lucene使用单字分词及短语查询实现类似全模糊查询效果

lucene使用全模糊查询效率慢,现在建索引时用单字分词,查询时用短语查询可以实现该功能。但对于大数据量的数字和英文查询效率慢。 一、新建MyNGramAnalyzer类,实现单字分词器 pu...

python 短语查询(中文版本+英文版本)

python实现的中文和英文的短语查询

CTE和WITH AS短语结合使用提高SQL查询性能

原文链接:  http://database.51cto.com/art/201107/274675_all.htm 如果WITH AS短语所定义的表名被调用两次以上,则优化器会...

lucene使用PhraseQuery设置slop进行短语查询

所谓PhraseQuery,就是通过短语来检索。 例如现在有一个字符串,“the quick brown fox jumped over the lazy dog”,我们不知道其中的精确的短语,...
  • yyunix
  • yyunix
  • 2011-10-20 10:06
  • 1761

南邮NOJ 1029 短语搜索

好久不曾刷题,凌晨睡不着,找了道比较简单的练下手,久违的AC感觉,等这阵子忙完等级考试、比赛、论文的一堆破事,就全心投入刷题,准备PAT考试中去,再参加最后一次校赛。其实挺后悔没入校队的,真想到区域赛...

短语搜索

短语搜索 时间限制(普通/Java):1000MS/3000MS          运行内存限制:65536KByte 总提交:535            测试通过:170 描述 常见文...

搜狗输入法自定义短语--关于时间戳

参考原文:http://jingyan.baidu.com/article/e9fb46e185e0097521f76614.html 在开发中经常要使用到添加时间戳,但是搜狗输入法未自带这种形式,...

英语短语

1 a  [ei, ə, æn, ən] art.一(个);任何一(个);每一(个) 2 abandon  [ə'bændən] vt.离弃,丢弃;遗弃,抛弃;放弃 3 abandon one...
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:深度学习:神经网络中的前向传播和反向传播算法推导
举报原因:
原因补充:

(最多只允许输入30个字)