python做内存辅助_Python中的辅助内存中索引表示形式

I am searching for an efficient solution to build a secondary in-memory index in Python using a high-level optimised mathematical package such as numpy and arrow. I am excluding pandas for performance reasons.

Definition

"A secondary index contains an entry for each existing value of the attribute to be indexed. This entry can be seen as a key/value pair with the attribute value as key and as value a list of pointers to all records in the base table that have this value." - JV. D'Silva et al. (2017)

Let's take a simple example, we can scale this later on to produce some benchmarks:

import numpy as np

pk = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='uint32')

val = np.array([15.5, 3.75, 142.88, 142.88, None, None, None, 7.2, 2.1], dtype='float32')

Interestingly pyarrow.Array.dictionary_encode method can transform the value array into a dictionary encoded representation that is close to a secondary index.

val.dictionary_encode()

Out[55]:

-- dictionary:

[

15.5,

3.75,

142.88,

nan,

7.2,

2.1

]

-- indices:

[

0,

1,

2,

2,

3,

3,

3,

4,

5

]

I have opened an issue here

So, the question is about how fast you can build a secondary index in memory using Python data structures to hold efficiently values and indices. But this is half the story as the index will be useful if it serves well both filtering queries (point, range) and transformations - reconstruction of row, column and association a.k.a hyperedge in TRIADB. And even this quick description here does not cover how easy it will be to update this kind of index.

For many reasons, I have started investigating a possible PyArrow open-source solution. A sorted dictionary-encoded representation should generally meet the requirements of the problem with an excellent combination of smaller memory footprint and faster/flexible zero copy I/O processing.

解决方案

Solution

I have searched both in the past and in the present for an open-source solution to this problem but I have not found one that satisfies my appetite. This time I decided to start building my own and discuss openly its implementation that also covers the null case, i.e. missing data scenario.

Do notice that secondary index is very close to adjacency list representation, a core element in my TRIADB project and that is the main reason behind searching for a solution.

Let's start with one line code using numpy

idx = np.sort(np.array(list(zip(pk, val)), dtype=struct_type), order='val')

idx['val']

Out[68]:

array([ 2.1 , 3.75, 7.2 , 15.5 , 142.88, 142.88, nan, nan,

nan], dtype=float32)

idx['pk']

Out[69]: array([8, 1, 7, 0, 2, 3, 4, 5, 6], dtype=uint32)

Faster solution (less generic)

this is the special but perfectly valid case where pk has values in range(n)

idx_pk = np.argsort(val)

idx_pk

Out[91]: array([8, 1, 7, 0, 2, 3, 4, 5, 6])

idx_val = val[idx_pk]

idx_val

Out[93]: array([ 2.1 , 3.75, 7.2 , 15.5 , 142.88, 142.88, nan, nan, nan], dtype=float32)

There are a few more steps to get a secondary index representation according to the definition of JV. D'Silva et al.

Get rid of nan

Calculate unique values of secondary index

For each unique value calculate the list of primary key indices to all rows of the table that contain that value

Unique Secondary Index with adjacency lists

def secondary_index_with_adjacency_list(arr):

idx_pk = np.argsort(arr)

idx_val = arr[idx_pk]

cnt = np.count_nonzero(~np.isnan(idx_val))

usec_ndx, split_ndx, cnt_arr = np.unique(idx_val[:cnt], return_index=True, return_counts=True)

adj_list = np.split(idx_pk[:cnt], split_ndx)[1:]

return usec_ndx, cnt_arr, adj_list

ndx, freq, adj = secondary_index_with_adjacency_list(val)

pd.DataFrame({'val': ndx, 'freq': freq, 'adj': adj})

Out[11]:

val freq adj

0 2.10 1 [8]

1 3.75 1 [1]

2 7.20 1 [7]

3 15.50 1 [0]

4 142.88 2 [2, 3]

Discussion

In practice it is faster to use the representation of secondary index with repeated values than the one with lists of pointers to records of a table but the second one has the interesting property of being closer to a hypergraph representation that I am using in TRIADB.

The kind of secondary index described in this solution is more suitable for analysis, filtering of big data sets that don't fit in memory but stored on disk with a column-store format. In that case for a specific set of columns it is possible to reconstruct a subset of records in memory (column-store) format and even present it on a hypergraph (stay tuned for the next release of TRIADB)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值