python数据筛选小于100的数据_关键:在Python中的值存储可能会存储100 GB的数据,而无需客户端/服务器...

There are many solutions to serialize a small dictionary: json.loads/json.dumps, pickle, shelve, ujson, or even by using sqlite.

But when dealing with possibly 100 GB of data, it's not possible anymore to use such modules that would possibly rewrite the whole data when closing / serializing.

redis is not really an option because it uses a client/server scheme.

Question: Which key:value store, serverless, able to work with 100+ GB of data, are frequently used in Python?

I'm looking for a solution with a standard "Pythonic" d[key] = value syntax:

import mydb

d = mydb.mydb('myfile.db')

d['hello'] = 17 # able to use string or int or float as key

d[183] = [12, 14, 24] # able to store lists as values (will probably internally jsonify it?)

d.flush() # easy to flush on disk

Note: BsdDB (BerkeleyDB) seems to be deprecated. There seems to be a LevelDB for Python, but it doesn't seem well-known - and I haven't found a version which is ready to use on Windows. Which ones would be the most common ones?

解决方案

You can use sqlitedict which provides key-value interface to SQLite database.

SQLite limits page says that theoretical maximum is 140 TB depending on page_size and max_page_count. However, default values for Python 3.5.2-2ubuntu0~16.04.4 (sqlite3 2.6.0), are page_size=1024 and max_page_count=1073741823. This gives ~1100 GB of maximal database size which fits your requirement.

You can use the package like:

from sqlitedict import SqliteDict

mydict = SqliteDict('./my_db.sqlite', autocommit=True)

mydict['some_key'] = any_picklable_object

print(mydict['some_key'])

for key, value in mydict.items():

print(key, value)

print(len(mydict))

mydict.close()

Update

About memory usage. SQLite doesn't need your dataset to fit in RAM. By default it caches up to cache_size pages, which is barely 2MiB (the same Python as above). Here's the script you can use to check it with your data. Before run:

pip install lipsum psutil matplotlib psrecord sqlitedict

sqlitedct.py

#!/usr/bin/env python3

import os

import random

from contextlib import closing

import lipsum

from sqlitedict import SqliteDict

def main():

with closing(SqliteDict('./my_db.sqlite', autocommit=True)) as d:

for _ in range(100000):

v = lipsum.generate_paragraphs(2)[0:random.randint(200, 1000)]

d[os.urandom(10)] = v

if __name__ == '__main__':

main()

Run it like ./sqlitedct.py & psrecord --plot=plot.png --interval=0.1 $!. In my case it produces this chart:

And database file:

$ du -h my_db.sqlite

84M my_db.sqlite

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值