Train2: 查找最相似的英文句子

最新推荐文章于 2024-04-23 12:09:59 发布

gogoout123

最新推荐文章于 2024-04-23 12:09:59 发布

阅读量312

点赞数

分类专栏：训练题文章标签：文本搜索 Python 倒排索引

本文链接：https://blog.csdn.net/vancl_wang/article/details/80951059

版权

训练题专栏收录该内容

3 篇文章 0 订阅

订阅专栏

题目描述##

已有文件sentence.dat是由很多行英文句子构成，请你对这个文件进行处理，构造相应的数据结构，开发出一个快速查找最相似句子的系统（相似度计算方法见附注），该系统能实现的功能为：

在控制台（console）中可以随时输入一个英文句子，单词长度小于等于8，运算后立刻在console中打印出在sentence.dat中与输入的英文句子最相似的前十个英文句子，和对应的相似度。
每执行依次相似句子查询后，在控制台打印出本次执行的总时长，并要求从每次从console获得输入到执行完毕，总时长小于100毫秒。

相似度计算公式

$\frac{|A \cap B|}{|A \cup B|}$
分子表示两个句子中相同单词的数量，分母子表示两个句子中总共的单词数量，例如句子A=“we are the world”，句子B=“they are the students”，则 $A\cap B$ 为2（注：相同单词为 are、the）， $\cup B$ 为6（注：总共单词为we, are, the, world, they, students），则A和B的相似度为2/6 = 0.3333，两个完全相同的句子的相似度为1，完全不同的句子的相似度为0.

##题目分析##

Version1##

解决该题目最基本的一个想法是：

将英文句子库的句子进行处理，将每个英语句子化成单词数组；
输入一个查询Query，然后对Query进行分词，变成词数组；
循环计算Query与每个句子的相似度；
依据相似度进行排序，输出前10名的句子。
注：在计算相似度时，可以通过set类型将Query词数组与句子库中词数组转化为集合，然后利用集合的取交、取并运算进行计算。

这种方法基于循环计算，因此速度较慢，一次查询时间约为500毫秒。

#!/usr/bin/env python2
#-*- coding: utf-8 -*-

import numpy as np
import datetime as dt
def FindSimilar(Query, Sentences_File):
    # 处理
    query = set(Query.split(' '))

    # 文件地址
    Sentences_Connection = open(Sentences_File, 'r')
    Sentences = Sentences_Connection.readlines()
    Sentences_Connection.close()
    
    # 相似度数组
    Similarity  = {}
    # 循环
    for line in Sentences:
        line = line.strip('\n')
        sentence = set(line.split(' '))
        # 交集
        joint = query & sentence
        # 并集
        union = query | sentence
        # 相似度
        similarity = round(1.0*len(joint) / len(union),3)
        # Append
        Similarity[line] = similarity
    # Order
    Similarity = sorted(Similarity.items(), key = lambda x : x[1], reverse = True)
    # 排序
    return Similarity[0:10]

# 文件库地址
Sentences_File = '/Users/vancl/Desktop/达观数据_新兵训练营/train2/f.dat.txt'

# 查询
while True:
    # 读取串
    Query = raw_input('Please input your query (length < 8):')
    begin = dt.datetime.now()
    # 计算
    Result = FindSimilar(Query, Sentences_File)
    # Output
    for item in Result:
        print(item[0] + ': ' + str(item[1]))
        
    end = dt.datetime.now()
    print("The time of building inverted index is :" + str(end-begin))

Version1: 运行结果##

基于循环的查询

##Version2：改进方法##
在搜索引擎中，一种最常用的技术是倒排索引（Inverted Index）结构，该方法的核心思路是基于语料库建立词-句关系矩阵 $I_{m*n}$ ，其中矩阵元素 $I_{i,j}$ 表示词 $i$ 在句子 $j$ 中出现。完成倒排索引建立后，可以重复调用该模块，进行查询（搜索）。例：一个简单的倒排索引结构：

|词\句 |Sentences1 | Sentences2|Sentences3|
| - | - | - |
|Word1|1|0|0|
|Word2|0|0|1|
|Word3|1|1|1|
|Word4|0|1|0|
|…|
可以看到，在构建倒排索引后，可以从两个方面加速程序：一是相似度的计算转化为矩阵或向量计算；二是该结构可以重复调用。

在程序中，为了避免出现词的重复及加速程序，我们对Query及句子分词后的数组均进行去重处理，代码如下：倒排索引建立时间约为4秒钟，查询时间降至50毫秒。

#!/usr/bin/env python2
# -*- coding: utf-8 -*-

import numpy as np
# 寻找最大的10个数
def FindTop(Array, K):
    count = 0
    Id = []
    Array = list(Array)
    while count < K:
        Id.append(Array.index(max(Array)))
        Array[Id[count]] = 0
        count = count + 1
    return Id

# 建立词典
def Dictionary(Sentences_File):
    # 链接文件
    File_Connection = open(Sentences_File, 'rb')
    Sentences = File_Connection.readlines()
    File_Connection.close()
    # 处理
    Sentences_Process = [item.strip('\n').split() for item in Sentences]
    # 句子长度
    Sentences_Length = np.array([len(np.unique(item)) for item in Sentences_Process])

    # 词索引
    Wordlist = {}
    # 
    count = 0

    for sentence in Sentences_Process:
        # 加入索引列
        for word in sentence:
            if word not in Wordlist:
                Wordlist[word] = count
                count = count + 1
    return Wordlist, Sentences, Sentences_Process, Sentences_Length

# 基于矩阵建立索引
def InvertedIndex(Wordlist, Sentences_Process):
    # 建立索引矩阵
    Index = np.mat(np.zeros((len(Wordlist), len(Sentences_Process))))

    count = 0
    for sentence in Sentences_Process:
        # 加入索引列
        for word in sentence:
            word_id = Wordlist[word]
            Index[word_id,count] = 1
        count = count + 1
        #print(count)
    # 返回
    return Index

# 基于倒序索引进行查询
def SearchQuery(Query, Index, Wordlist, Sentences_Length, Sentences):
    # 交集
    Joint = np.array(np.zeros(len(Sentences_Length)))
    # 去重
    Query = np.unique(Query)
    # For loop
    for word in Query:
        if word in Wordlist:
            word_id = Wordlist[word]
            Joint = Joint + np.array(Index[word_id, ])[0]
    # 相似度
    Similarity = Joint/(Sentences_Length + len(Query) - Joint)
    # 寻找前10个
    Top_10_Id = FindTop(Similarity, 10)
    # Print
    for Id in Top_10_Id:
        print(Sentences[Id].strip('\n') + ': ' + str(round(Similarity[Id], 3)))

###运行函数###

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
Created on Wed Jul  4 15:15:52 2018

@author: vancl
"""

# 调用函数
from Train2_V6_Function import *
import datetime as dt

begin = dt.datetime.now()
# 英语句子库文件
Sentences_File = '/Path/To/File/f.dat.txt'
# 词典与句子数组
Wordlist, Sentences, Sentences_Process, Sentences_length = Dictionary(Sentences_File)
# 倒序索引
Index = InvertedIndex(Wordlist, Sentences_Process)

end = dt.datetime.now()
print('The preparing time is :' + str(end-begin))

# 查询
while True:
    # 输入字符串
    Query = raw_input('Please input your query (length < 8):')
    begin = dt.datetime.now()
    Query = Query.split()
    
    if len(Query) <= 8:
        # 打印结果
        SearchQuery(Query, Index, Wordlist, Sentences_length, Sentences)
        # 判断
        end = dt.datetime.now()
        print('The searching time is :' + str(end-begin))
    else:
        print('The query is too long.')
    # 循环是否终止
    End = raw_input('Continue or not (y/n)?')
    if End == 'n':
        break

###Version2: 运行结果###
这里写图片描述

gogoout123

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
Train2: 查找最相似的英文句子

题目描述已有文件sentence.dat是由很多行英文句子构成，请你对这个文件进行处理，构造相应的数据结构，开发出一个快速查找最相似句子的系统（相似度计算方法见附注），该系统能实现的功能为：在控制台（console）中可以随时输入一个英文句子，单词长度小于等于8，运算后立刻在console中打印出在sentence.dat中与输入的英文句子最相似的前十个英文句子，和对应的相似度。每执行...
复制链接

扫一扫