PySparNN学习笔记——稀疏数据求Ｋ近邻-CSDN博客

本文链接：https://blog.csdn.net/SuiXin_123/article/details/83385552

最近在学习求取数据的K近邻，接触到了PySparNN，在这里记录一下~
　　使用Python求取数据的K近邻时，当你的数据不稀疏的时候，faiss和annoy比较合适。但是，当你的数据维度较高，且为稀疏数据的时候，可以考虑使用PySparNN
使用前提：numpy and scipy

下面借助官方的两个栗子来说明PySparNN的用法：
栗子1：

import pysparnn.cluster_index as ci
import numpy as np
from scipy.sparse import csr_matrix

features = np.random.binomial(1, 0.01, size=(1000, 20000))
features = csr_matrix(features)    #对数据采用csr进行压缩

# build the search index!   创建索引
data_to_return = range(1000)
cp = ci.MultiClusterIndex(features, data_to_return)
#进行K近邻查询
cp.search(features[:5], k=1, return_distance=False)
>> [[0], [1], [2], [3], [4]]

分别返回features前五个元素最近的1个近邻，很显然是它们自己。

栗子2：

import pysparnn.cluster_index as ci
from sklearn.feature_extraction.text import TfidfVectorizer

data = [
    'hello world',
    'oh hello there',
    'Play it',
    'Play it again Sam',
]    

tv = TfidfVectorizer()
tv.fit(data)

features_vec = tv.transform(data)

# build the search index!  创建索引
cp = ci.MultiClusterIndex(features_vec, data)

# search the index with a sparse matrix
search_data = [
    'oh there',
    'Play it again Frank'
]

search_features_vec = tv.transform(search_data)
#进行K近邻查询
cp.search(search_features_vec, k=1, k_clusters=2, return_distance=False)
>> [['oh hello there'], ['Play it again Sam']]

分别返回与’oh there’和’Play it again Frank’最近的1个近邻，为’oh hello there’和’Play it again Sam’
　　从以上两个例子可以看出，寻找K近邻，主要是两个步骤：1、创建索引（ci.MultiClusterIndex）；2、进行查询（cp.search）。其中创建索引有两种方式：ci.ClusterIndex和ci.MultiClusterIndex，后者比前者的最终结果准确率要高，但是更加耗时，当耗时可以接受的情况下建议选择后者来创建索引。ci.MultiClusterIndex是对num_indexs组cluster求当前目标与每个leader的距离，然后求取当前目标与合适的leader所在的簇中每个元素求距离，然后再从这些全部求出的距离中进行选择，选出最近的K个。
　　PS：当你的稀疏矩阵维度很高数据量很大的时候，可以考虑只存储1所在的位置，然后用csr_matrix((data, indices, indptr), shape=(row,col))进行压缩（csr_matrix之后再写篇学习笔记，好奇者可先自行上网学习）
　　另外，栗子２中进行Ｋ近邻查询时，加入了参数k_clusters=2，PySparNN一开始是将数据分成很多cluster，每个cluster有一个具有代表性的leader，每次查询前计算当前目标与每个clusters中leader的距离代表当前目标和每个cluster的距离，然后查询时参数k_clusters决定从几个k_clusters中对其每个数据进行查询，寻找K近邻，k_clusters默认为1。显然k_clusters越大得到的K近邻越准确，但k_clusters越大越接近暴力搜索查询，查询过程也就越耗时。
　　k参数，代表返回几个近邻，即K Nearest中的K。
　　return_distance参数好理解，即是否返回两者之间的距离，False不返回，True返回。

先写这么多，如有补充之后再加，如有错误，欢迎指出！