python做内存辅助_Python中的辅助内存中索引表示形式

最新推荐文章于 2024-02-06 15:58:48 发布

weixin_39870199

最新推荐文章于 2024-02-06 15:58:48 发布

阅读量151

点赞数

文章标签： python做内存辅助

I am searching for an efficient solution to build a secondary in-memory index in Python using a high-level optimised mathematical package such as numpy and arrow. I am excluding pandas for performance reasons.

Definition

"A secondary index contains an entry for each existing value of the attribute to be indexed. This entry can be seen as a key/value pair with the attribute value as key and as value a list of pointers to all records in the base table that have this value." - JV. D'Silva et al. (2017)

Let's take a simple example, we can scale this later on to produce some benchmarks:

import numpy as np

pk = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='uint32')

val = np.array([15.5, 3.75, 142.88, 142.88, None, None, None, 7.2, 2.1], dtype='float32')

Interestingly pyarrow.Array.dictionary_encode method can transform the value array into a dictionary encoded representation that is close to a secondary index.

val.dictionary_encode()

Out[55]:

-- dictionary:

[

15.5,

3.75,

142.88,

nan,

7.2,

2.1

]

-- indices:

[

]

I have opened an issue here

So, the question is about how fast you can build a secondary index in memory using Python data structures to hold efficiently values and indices. But this is half the story as the index will be useful if it serves well both filtering queries (point, range) and transformations - reconstruction of row, column and association a.k.a hyperedge in TRIADB. And even this quick description here does not cover how easy it will be to update this kind of index.

For many reasons, I have started investigating a possible PyArrow open-source solution. A sorted dictionary-encoded representation should generally meet the requirements of the problem with an excellent combination of smaller memory footprint and faster/flexible zero copy I/O processing.

解决方案

Solution

I have searched both in the past and in the present for an open-source solution to this problem but I have not found one that satisfies my appetite. This time I decided to start building my own and discuss openly its implementation that also covers the null case, i.e. missing data scenario.

Do notice that secondary index is very close to adjacency list representation, a core element in my TRIADB project and that is the main reason behind searching for a solution.

Let's start with one line code using numpy

idx = np.sort(np.array(list(zip(pk, val)), dtype=struct_type), order='val')

idx['val']

Out[68]:

array([ 2.1 , 3.75, 7.2 , 15.5 , 142.88, 142.88, nan, nan,

nan], dtype=float32)

idx['pk']

Out[69]: array([8, 1, 7, 0, 2, 3, 4, 5, 6], dtype=uint32)

Faster solution (less generic)

this is the special but perfectly valid case where pk has values in range(n)

idx_pk = np.argsort(val)

idx_pk

Out[91]: array([8, 1, 7, 0, 2, 3, 4, 5, 6])

idx_val = val[idx_pk]

idx_val

Out[93]: array([ 2.1 , 3.75, 7.2 , 15.5 , 142.88, 142.88, nan, nan, nan], dtype=float32)

There are a few more steps to get a secondary index representation according to the definition of JV. D'Silva et al.

Get rid of nan

Calculate unique values of secondary index

For each unique value calculate the list of primary key indices to all rows of the table that contain that value

Unique Secondary Index with adjacency lists

def secondary_index_with_adjacency_list(arr):

idx_pk = np.argsort(arr)

idx_val = arr[idx_pk]

cnt = np.count_nonzero(~np.isnan(idx_val))

usec_ndx, split_ndx, cnt_arr = np.unique(idx_val[:cnt], return_index=True, return_counts=True)

adj_list = np.split(idx_pk[:cnt], split_ndx)[1:]

return usec_ndx, cnt_arr, adj_list

ndx, freq, adj = secondary_index_with_adjacency_list(val)

pd.DataFrame({'val': ndx, 'freq': freq, 'adj': adj})

Out[11]:

val freq adj

0 2.10 1 [8]

1 3.75 1 [1]

2 7.20 1 [7]

3 15.50 1 [0]

4 142.88 2 [2, 3]

Discussion

In practice it is faster to use the representation of secondary index with repeated values than the one with lists of pointers to records of a table but the second one has the interesting property of being closer to a hypergraph representation that I am using in TRIADB.

The kind of secondary index described in this solution is more suitable for analysis, filtering of big data sets that don't fit in memory but stored on disk with a column-store format. In that case for a specific set of columns it is possible to reconstruct a subset of records in memory (column-store) format and even present it on a hypergraph (stay tuned for the next release of TRIADB)

weixin_39870199

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python做内存辅助_Python中的辅助内存中索引表示形式

I am searching for an efficient solution to build a secondary in-memory index in Python using a high-level optimised mathematical package such as numpy and arrow. I am excluding pandas for performance...
复制链接

扫一扫