mongodb 版本3.4 所以有些pymongo的api是老的:
代码如下:
代码来自:https://github.com/zshameel/MySQL2MongoDB
1,代码不支持多线程,修改代码支持了多线程。2,代码是高版本pymongo,如果使用mongodb3.4,已改成低版本的API。3,代码不支持流式传输,下面有流式传输的版本。
600万行的mysql sql文件大小400M select出来很快,导入2分片的Mongo 需要 45分钟 。因为不断在balance平衡,数据迁移所以有点慢。
python mysql2mongo.py
Script started at: 2022-02-10 20:50:41.123675
Existing documents will not be deleted from collections
Connecting to MySQL server...
Connection to MySQL Server succeeded.
Connecting to MongoDB server...
Connection to MongoDB Server succeeded.
Migration started...
The database exists.
1
Processing table: sha256_4b...
Processing table: sha256_4b completed. 6093126 documents inserted.
Migration completed.
1 of 1 tables migrated successfully.
Script completed at: 2022-02-10 21:34:51.752212
Total execution time: 0:44:10.628537
如果不分片,只用18分钟
python mysql2mongo.py
Script started at: 2022-02-10 19:27:21.900784
Existing documents will not be deleted from collections
Connecting to MySQL server...
Connection to MySQL Server succeeded.
Connecting to MongoDB server...
Connection to MongoDB Server succeeded.
Migration started...
The database does not exist, it is being created.
1
Processing table: sha256_4b...
Processing table: sha256_4b completed. 6093126 documents inserted.
Migration completed.
1 of 1 tables migrated successfully.
Script completed at: 2022-02-10 19:46:11.934297
Total execution time: 0:18:50.033513
多线程+mongo分片+非流式传输:
2个线程,导4张600完行,400M的表,(2线程4张表,当然跑了两次)用时1小时52分。
相当于如果2个线程,半个小时一张表
python mysql-to-mongo.py
Script started at: 2022-02-14 19:22:10.102511
Delete existing documents from collections (y)es/(n)o/(a)bort?y
Are you sure (y)es/(n)?y
Existing documents will be deleted from collections
Connecting to MySQL server...
Connection to MySQL Server succeeded.
Connecting to MongoDB server...
Connection to MongoDB Server succeeded.
Migration started...
The database exists.
4
Processing table: sha256_1b...
Processing table: sha256_3b...
Processing table: sha256_1b completed.
Processing table: sha256_3b completed.
Processing table: sha256_4b...
Processing table: sha256_ab...
Processing table: sha256_4b completed.
Processing table: sha256_ab completed.
Migration completed.
4 of 4 tables migrated successfully.
Script completed at: 2022-02-14 21:14:34.487499
Total execution time: 1:52:24.384988
使用流式传输的版本:
python mysql-to-mongo.py
Script started at: 2022-02-15 15:03:28.231224
Delete existing documents from collections (y)es/(n)o/(a)bort?y
Are you sure (y)es/(n)?y
Existing documents will be deleted from collections
Connecting to MySQL server...
Connection to MySQL Server succeeded.
Connecting to MongoDB server...
Connection to MongoDB Server succeeded.
Migration started...
The database exists.
4
Processing table: sha256_1b...
Processing table: sha256_3b...
Processing table: sha256_4b...
Processing table: sha256_ab...
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib64/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib64/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "mysql-to-mongo.py", line 66, in migrate_table
for row in mycursor:
File "/usr/local/lib64/python3.6/site-packages/pymysql/cursors.py", line 429, in fetchone
row = self.read_next()
File "/usr/local/lib64/python3.6/site-packages/pymysql/cursors.py", line 424, in read_next
return self._conv_row(self._result._read_rowdata_packet_unbuffered())
File "/usr/local/lib64/python3.6/site-packages/pymysql/connections.py", line 1244, in _read_rowdata_packet_unbuffered
packet = self.connection._read_packet()
File "/usr/local/lib64/python3.6/site-packages/pymysql/connections.py", line 711, in _read_packet
recv_data = self._read_bytes(bytes_to_read)
File "/usr/local/lib64/python3.6/site-packages/pymysql/connections.py", line 749, in _read_bytes
CR.CR_SERVER_LOST, "Lost connection to MySQL server during query"
pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')
Exception ignored in: <bound method SSCursor.close of <pymysql.cursors.SSDictCursor object at 0x7f0c2e840b38>>
Traceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/pymysql/cursors.py", line 401, in close
self._result._finish_unbuffered_query()
File "/usr/local/lib64/python3.6/site-packages/pymysql/connections.py", line 1261, in _finish_unbuffered_query
packet = self.connection._read_packet()
File "/usr/local/lib64/python3.6/site-packages/pymysql/connections.py", line 692, in _read_packet
packet_header = self._read_bytes(4)
File "/usr/local/lib64/python3.6/site-packages/pymysql/connections.py", line 729, in _read_bytes
self._sock.settimeout(self._read_timeout)
AttributeError: 'NoneType' object has no attribute 'settimeout'
2, 这个问题应该是connect 和 cursor没有close。属于实验检测。
python mysql-to-mongo.py
Script started at: 2022-02-15 18:08:36.461916
Delete existing documents from collections (y)es/(n)o/(a)bort?y
Are you sure (y)es/(n)?y
Existing documents will be deleted from collections
Connecting to MySQL server...
Connection to MySQL Server succeeded.
Connecting to MongoDB server...
Connection to MongoDB Server succeeded.
Migration started...
The database exists.
4
Processing table: sha256_1b...
Processing table: sha256_3b...
Processing table: sha256_4b...
Processing table: sha256_ab...
/usr/local/lib64/python3.6/site-packages/pymysql/connections.py:799: UserWarning: Previous unbuffered result was left incomplete
warnings.warn("Previous unbuffered result was left incomplete")
效率结论,改了这两处代码后,程序完全跑完了。
但效率比非流式反而慢了。四个线程同时导4张表,流式传输,一共用了4个小时。而非流式2个线程4个表导两次只需要两个小时。
python mysql-to-mongo.py
Script started at: 2022-02-16 11:59:04.748957
Delete existing documents from collections (y)es/(n)o/(a)bort?y
Are you sure (y)es/(n)?y
Existing documents will be deleted from collections
Connecting to MySQL server...
Connection to MySQL Server succeeded.
Connecting to MongoDB server...
Connection to MongoDB Server succeeded.
Migration started...
The database exists.
4
Processing table: sha256_1b...
Processing table: sha256_3b...
Processing table: sha256_4b...
Processing table: sha256_ab...
Processing table: sha256_3b completed. at: 2022-02-16 15:45:42.770882
Processing table: sha256_1b completed. at: 2022-02-16 15:46:06.436514
Processing table: sha256_ab completed. at: 2022-02-16 15:46:30.114281
Processing table: sha256_4b completed. at: 2022-02-16 15:55:32.537300
Migration completed.
0 of 4 tables migrated successfully.
Script completed at: 2022-02-16 15:55:32.538377
Total execution time: 3:56:27.789420
mysql-connector 是 MySQL 官方提供的驱动器, 它在Python中重新实现MySQL协议,它比较慢,但不需要C库,因此可移植性好。
MySQLdb是一个对于_mysql的Python包装器。其中_mysql也是该作者开发的模块,它依赖C库,所以说MYSQLdb也是依赖C库的。因此它的可移植性不太好,但是由于是基于C库实现的,它的速度会快一些。一些开发者为了效率甚至直接使用_mysql模块。
pymysql是由yutaka.matsubara开发维护的纯python实现的驱动器。它相对于mysql.connector, MYSQLdb来说比较年轻。它的效率和可移植性和mysql-connector理论上是差不多的。
语法查询:
https://www.runoob.com/python3/python-mysql-connector.html
pymysql 官方文档:
https://pymysql.readthedocs.io/en/latest/modules/connections.html
https://pymysql.readthedocs.io/en/latest/modules/cursors.html
在执行select等sql语句后
cur.execute(sql)
可以通过for
循环遍历cur
,此时cur
相当于生成器,不会直接存储所有数据,而是在循环时一条一条生成数据。
for i in cur:
print(i)
import MySQLdb.cursors
connection=MySQLdb.connect(
host="thehost",user="theuser",
passwd="thepassword",db="thedb", cursorclass = MySQLdb.cursors.SSCursor) cursor=connection.cursor() cursor.execute(query) for row in cursor: print(row)
- 使用迭代器而不用 fetchall ,即省内存又能很快拿到数据。
- 因为 SSCursor 是没有缓存的游标,结果集只要没取完,这个 conn 是不能再处理别的 sql,包括另外生成一个 cursor 也不行的。如果需要干别的,请另外再生成一个连接对象。
-
每次读取后处理数据要快,不能超过 60 s,否则 mysql 将会断开这次连接,也可以修改 SET NET_WRITE_TIMEOUT = xx 来增加超时间隔。
- In fact, the MySQL server will not store results in server even if a
SSCursor
is used, it will produce the part of results before fill them into the network packages. So don't worry the memory use of the server when useSSCursor
. 实际上,MySQL server端也不存数据,而是直接发网络数据包里,这也是第3点NET_WRITE_TIMEOUT的原因。