python中反向索引,python-从反向索引Elasticsearch中按高频术语的顺序对字符串进行排序...

我已经使用https://stackoverflow.com/a/15174569/61903计算两个字符串的cosine similarity([email protected])作为相似性的基本算法.通常,我将所有字符串放入列表中.然后,我将索引参数i设置为0并在i上循环,只要它在列表长度的范围内即可.在该循环中,我将位置p从i 1迭代到length(list).然后我找到list [i]和list [p]之间的最大余弦值.这两个文本字符串都将放入列表中,因此以后的相似度计算中将不会考虑它们.这两个文本字符串将与余弦值一起放入结果列表,数据结构为VectorResult.

之后,列表按余弦值排序.现在,我们有了具有降序余弦(也就是相似性值)的唯一字符串对. HTH.

import re

import math

import timeit

from collections import Counter

WORD = re.compile(r'\w+')

def get_cosine(vec1, vec2):

intersection = set(vec1.keys()) & set(vec2.keys())

numerator = sum([vec1[x] * vec2[x] for x in intersection])

sum1 = sum([vec1[x] ** 2 for x in vec1.keys()])

sum2 = sum([vec2[x] ** 2 for x in vec2.keys()])

denominator = math.sqrt(sum1) * math.sqrt(sum2)

if not denominator:

return 0.0

else:

return float(numerator) / denominator

def text_to_vector(text):

words = WORD.findall(text)

return Counter(words)

class VectorResult(object):

def __init__(self, cosine, text_1, text_2):

self.cosine = cosine

self.text_1 = text_1

self.text_2 = text_2

def __eq__(self, other):

if self.cosine == other.cosine:

return True

return False

def __le__(self, other):

if self.cosine <= other.cosine:

return True

return False

def __ge__(self, other):

if self.cosine >= other.cosine:

return True

return False

def __lt__(self, other):

if self.cosine < other.cosine:

return True

return False

def __gt__(self, other):

if self.cosine > other.cosine:

return True

return False

def main():

start = timeit.default_timer()

texts = []

with open('data.txt', 'r') as f:

texts = f.readlines()

cosmap = []

i = 0

out = []

while i < len(texts):

max_cosine = 0.0

current = None

for p in range(i + 1, len(texts)):

if texts[i] in out or texts[p] in out:

continue

vector1 = text_to_vector(texts[i])

vector2 = text_to_vector(texts[p])

cosine = get_cosine(vector1, vector2)

if cosine > max_cosine:

current = VectorResult(cosine, texts[i], texts[p])

max_cosine = cosine

if current:

out.extend([current.text_1, current.text_2])

cosmap.append(current)

i += 1

cosmap = sorted(cosmap)

for item in reversed(cosmap):

print(item.cosine, item.text_1, item.text_2)

end = timeit.default_timer()

print("Similarity Sorting of {} strings lasted {} s.".format(len(texts), end - start))

if __name__ == '__main__':

main()

结果

1.0000000000000002 NO 15& 16 1ST FLOOR,2ND MAIN ROAD,KHB COLONY,GANDINAGAR YELAHANKA

NO 15& 16 1ST FLOOR,2ND MAIN ROAD,KHB COLONY,GANDINAGAR YELAHANKA

1.0 # 51/3 AGRAHARA YELAHANKA

#51/3 AGRAHARA YELAHANKA

0.9999999999999999 # C M C ROAD,YALAHANKA

# C M C ROAD,YALAHANKA

0.8728715609439696 # 1002/B B B ROAD,YELAHANKA

0,B B ROAD,YELAHANKA

0.8432740427115678 # LAKSHMI COMPLEX C M C ROAD,YALAHANKA

# SRI LAKSHMAN COMPLEX C M C ROAD,YALAHANKA

0.8333333333333335 # 85/1 B B M P OFFICE ROAD,KOGILU YELAHANKA

#85/1 B B M P OFFICE NEAR KOGILU YALAHANKA

0.8249579113843053 # 689 3RD A CROSS SHESHADRIPURAM CALLEGE OPP YELAHANKA

# 715 3RD CROSS A SECTUR SHESHADRIPURAM CALLEGE OPP YELAHANKA

0.8249579113843053 # 10 RAMAIAIA COMPLEX B B ROAD,YALAHANKA

# JAMATI COMPLEX B B ROAD,YALAHANKA

[ SNIPPED ]

Similarity Sorting of 702 strings lasted 8.955146235887025 s.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值