mongo + python
需求:
百万级数据无重复地insert / update到mongo
方法:
往mongoDB插入数据时,数据量小,不要求速度,直接用insert_one() / update_one();
数据量大(百万级)可以用下面的三种方法试一下
但!!! 还是没有解决需求啊,用了方法3还是很慢,最后还是将dataset切片了,多个进程upadte数据
用各种方法前,先连接数据库
import pymongo
## connect to mongo
host = 'xxxxxx'
port = 27017
conn = pymongo.MongoClient(host, port)
## get db and collection
db = conn.db_name
col = db.col_name
1. insert_one
逐一插入数据
for i in data:
col.insert_one(i)
print('Insert DONE!')
2. insert_many
批量插入会比逐一插入速度提升一点,但是这个不能在插入数据的同时去重
import pymongo
lst = []
for item in data:
lst.append(item)
if len(lst) == 50000:
col.insert_many(lst)
lst = []
print('Insert DONE!')
col.insert_many(lst)
3. bulk_write (UpdateOne)
量级的数据使用这个方法还是慢的,但这个可以批量地无重复地往集合里更新(插入)数据
import pymongo
from pymongo import UpdateOne
import datetime
lst = []
for item in data:
one = UpdateOne({filter_content},
{'$set': {dict_need_to_insert}},
upsert = True) ## note this
lst.append(one)
if len(lst) == 50000:
col.bulk_write(lst)
lst = []
print('Insert DONE!')
col.bulk_write(lst)
'''
filter_content:
{'url': url,
'domain_name': domain_name, ...}
dict_need_to_insert:
{'domain_name': domain_name,
'source': source,
'url': url,
'insert_date_time': datetime.datetime.now()}
'''
REF:
https://blog.csdn.net/qq_42401024/article/details/102562332?spm=1001.2014.3001.5502
https://blog.csdn.net/nihaoxiaocui/article/details/95060906?spm=1001.2014.3001.5502
https://www.cnblogs.com/lkd8477604/p/9848958.html and https://www.cnblogs.com/lkd8477604/p/10201137.html
https://www.cnpython.com/qa/1103795
https://www.cnblogs.com/sanduzxcvbnm/p/10276845.html