python中型项目开发,在python中处理大型密集矩阵

Basically, what is the best way to go about storing and using dense matrices in python?

I have a project that generates similarity metrics between every item in an array.

Each item is a custom class, and stores a pointer to the other class and a number representing it's "closeness" to that class.

Right now, it works brilliantly up to about ~8000 items, after which it fails with a out-of-memory error.

Basically, if you assume that each comparison uses ~30 (seems accurate based on testing) bytes to store the similarity, that means the total required memory is:

numItems^2 * itemSize = Memory

So the memory usage is exponential based on the number of items.

In my case, the memory size is ~30 bytes per link, so:

8000 * 8000 * 30 = 1,920,000,000 bytes, or 1.9 GB

which is right at the memory limit for a single thread.

It seems to me that there has to be a more effective way of doing this. I've looked at memmapping, but it's already computationally intensive just to generate the similarity values, and bottlenecking it all through a harddrive seems a little ridiculous.

Edit

I've looked at numpy and scipy. Unfortunatly, they don't support very large arrays either.

>>> np.zeros((20000,20000), dtype=np.uint16)

Traceback (most recent call last):

File "", line 1, in

MemoryError

>>>

Further Edit

Numpy seems to be popular. However, numpy won't really do what I want, at least without another abstraction layer.

I don't want to store numbers, I want to store reference to classes. Numpy supports objects, but that doesn't really address the array size issues. I brought up numpy just as an example of what isn't working.

Any advice?

Edit Well, I wound up just rewriting all the logic so it no longer stores any redundant values, reducing the memory useage from O*n^2 to O*((n*(n-1))/2).

Basically, this whole affair is a version of the handshake problem, so I've switched from storing all links to only a single version of each link.

It's not a complete solution, but I generally don't have any datasets large enough to overflow it, so I think it will work out. PyTables is really interesting, but I don't know any SQL, and there doesn't appear to be any nice traditional slicing or index based way to access the table data. I may revisit the issue in the future.

解决方案

PyTables can handle tables of arbitrary size (Millions of columns!), through using memmap and some clever compression.

Ostensibly, it provides SQL like performance, to python. It will, however, require significant code modifications.

I'm not going to accept this answer until I've done a more thorough vetting, to ensure it can actually do what I want. Or someone provides a better solution.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值