Elasticsearch实现大规模数据存储跟搜索，可用于以图搜图功能

最新推荐文章于 2024-05-22 21:43:30 发布

yangdeshun888

最新推荐文章于 2024-05-22 21:43:30 发布

阅读量3.6k

点赞数

分类专栏：深度学习 python pycharm

本文链接：https://blog.csdn.net/yangdashi888/article/details/92628948

版权

深度学习同时被 3 个专栏收录

86 篇文章 3 订阅

订阅专栏

python

58 篇文章 2 订阅

订阅专栏

pycharm

8 篇文章 1 订阅

订阅专栏

1、要实现实现大规模数据搜索主要解决的两个问题：

第一：就是数据处理，要把原生数据进行处理成合适的数据格式，例如以图搜图则需要把图片处理成我们常用的hash码进行保存，大大减少数据量，而且其也具有代表性。

第二：就是要实现如何从茫茫数据里最快找到相似的匹配，这里使用Elasticsearch数据库，其是用于大数据存储跟搜索的数据库，可以很好的实现查找功能

2、数据处理：

由于数据库大数据搜索时是一种模糊搜索，所以如果我们的数据存储只是存储hash码的话是有问题的，因为汉明码太过详细，导致数据库几乎没有跟我们要查找的一样编码，此时其会查找最相近的，由于其认为的相近跟我们进行汉明码匹配的方法是不一样的，这就会导致其查找的数据不够准确。

此时，我们需要做的就是为这些汉明码再生成一些词属性。这里是一个获得词属性的代码：

def get_words(array, k, N):
"""    
eg：

    [0, 1, 2]
    [2, 0, -1]
    [-1, -2, 0]
    [0, 1]

    Args:
        array (numpy.ndarray): array to split into words
        k (int): word length
        N (int): number of words

    Returns:
        an array with N rows of length k

    """
    # generate starting positions of each word
    word_positions = np.linspace(0, array.shape[0],
                                 N, endpoint=False).astype('int')

    # check that inputs make sense
    if k > array.shape[0]:
        raise ValueError('Word length cannot be longer than array length')
    if word_positions.shape[0] > array.shape[0]:
        raise ValueError('Number of words cannot be more than array length')

    # create empty words array
    words = np.zeros((N, k)).astype('int8')

    for i, pos in enumerate(word_positions):
        if pos + k <= array.shape[0]:
            words[i] = array[pos:pos+k]
        else:
            temp = array[pos:].copy()
            temp.resize(k,refcheck=False)
            words[i] = temp

    return words

此时生成的词是[64,16]的，即有64个新词，这样我们就可以把这些词用于数据库模糊搜索的条件。这样子可以保证查找的结果更符合实际情况。

3、进行数据库里的数据搜索方法：

   def search_single_record(self, rec, pre_filter=None):
        path = rec.pop('path')
        signature = rec.pop('signature')
        if 'metadata' in rec:
            rec.pop('metadata')

        # build the 'should' list
        should = [{'term': {word: rec[word]}} for word in rec]
        body = {
            'query': {
                   'bool': {'should': should}
            },
            '_source': {'excludes': ['simple_word_*']}
        }

        if pre_filter is not None:
            body['query']['bool']['filter'] = pre_filter

        res = self.es.search(index=self.index,
                              doc_type=self.doc_type,
                              body=body,
                              size=self.size,
                              timeout=self.timeout)['hits']['hits']

        sigs = np.array([x['_source']['signature'] for x in res])

        if sigs.size == 0:
            return []
        #对搜索处理的结果跟实际数据的编码求距离
        dists = normalized_distance(sigs, np.array(signature))

        formatted_res = [{'id': x['_id'],
                          'score': x['_score'],
                          'metadata': x['_source'].get('metadata'),
                          'path': x['_source'].get('url', x['_source'].get('path'))}
                         for x in res]

        for i, row in enumerate(formatted_res):
            row['dist'] = dists[i]
        #其中filter的用于排除距离过于远的数据。返回的是一个list
        formatted_res = filter(lambda y: y['dist'] < self.distance_cutoff, formatted_res)

        return formatted_res

yangdeshun888

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
Elasticsearch实现大规模数据存储跟搜索，可用于以图搜图功能

1、要实现实现大规模数据搜索主要解决的两个问题：第一：就是数据处理，要把原生数据进行处理成合适的数据格式，例如以图搜图则需要把图片处理成我们常用的hash码进行保存，大大减少数据量，而且其也具有代表性。第二：就是要实现如何从茫茫数据里最快找到相似的匹配，这里使用Elasticsearch数据库，其是用于大数据存储跟搜索的数据库，可以很好的实现查找功能2、数据处理：由于数据库大数据搜...
复制链接

扫一扫

专栏目录