使用python将mysql数据导入mongodb分片数据

mongodb 版本3.4  所以有些pymongo的api是老的:

代码如下:

代码来自:https://github.com/zshameel/MySQL2MongoDB

1,代码不支持多线程,修改代码支持了多线程。2,代码是高版本pymongo,如果使用mongodb3.4,已改成低版本的API。3,代码不支持流式传输,下面有流式传输的版本。

import sys
import mysql.connector
import pymongo
import datetime
import enum
import numpy
import threading
class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKCYAN = '\033[96m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'
class MsgType(enum.Enum):
    HEADER = 1
    OKBLUE = 2
    OKCYAN = 3
    OKGREEN = 4
    WARNING = 5
    FAIL = 6
    ENDC = 7
    BOLD = 8
    UNDERLINE = 9
#Pretty Print Function
def prettyprint(msg_text, msg_type):
    if msg_type == MsgType.HEADER:
        print(f"{bcolors.HEADER}{msg_text}{bcolors.ENDC}")
    elif msg_type == MsgType.OKBLUE:
        print(f"{bcolors.OKBLUE}{msg_text}{bcolors.ENDC}")
    elif msg_type == MsgType.OKCYAN:
        print(f"{bcolors.OKCYAN}{msg_text}{bcolors.ENDC}")
    elif msg_type == MsgType.OKGREEN:
        print(f"{bcolors.OKGREEN}{msg_text}{bcolors.ENDC}")
    elif msg_type == MsgType.WARNING:
        print(f"{bcolors.WARNING}{msg_text}{bcolors.ENDC}")
    elif msg_type == MsgType.FAIL:
        print(f"{bcolors.FAIL}{msg_text}{bcolors.ENDC}")
    elif msg_type == MsgType.BOLD:
        print(f"{bcolors.BOLD}{msg_text}{bcolors.ENDC}")
    elif msg_type == MsgType.UNDERLINE:
        print(f"{bcolors.UNDERLINE}{msg_text}{bcolors.ENDC}")
#Function migrate_table
def migrate_table(mysql_host, mysql_database, mysql_user, mysql_password, table_name):
    db = mysql.connector.connect(
        host=mysql_host,
        database=mysql_database,
        user=mysql_user,
        password=mysql_password
    )
    #TODO: Sanitize table name to conform to MongoDB Collection naming restrictions
    #For example, the $ sign is allowed in MySQL table names but not in MongoDB Collection names
    mycursor = db.cursor(dictionary=True)
    mycursor.execute("SELECT * FROM " + table_name + ";")
    myresult = mycursor.fetchall()
    mycol = mydb[table_name]
    if delete_existing_documents:
        #delete all documents in the collection
        mycol.delete_many({})
    #insert the documents
    if len(myresult) > 0:
        x = mycol.insert_many(myresult)
        return len(x.inserted_ids)
    else:
        return 0
begin_time = datetime.datetime.now()
abort = False
prettyprint(f"Script started at: {begin_time}", MsgType.HEADER)
delete_existing_documents = True;
mysql_host="192"
mysql_database="wh"
mysql_schema = "white"
mysql_user=""
mysql_password="pe"
#mongodb_host = "mongodb://1:27017/"
mongodb_host = "mongodb://:p@192:27017/admin"
mongodb_dbname = "w"
if (delete_existing_documents):
    confirm_delete = input("Delete existing documents from collections (y)es/(n)o/(a)bort?")
    if confirm_delete.lower() == "a":
        abort = True
    elif confirm_delete.lower() == "n":
        delete_existing_documents = False
    else:
        #Confirm again
        confirm_delete = input("Are you sure (y)es/(n)?")
        if confirm_delete.lower() == "y":
            delete_existing_documents = True
        else:
            abort = True
if abort:
    prettyprint("Script aborted by user", MsgType.FAIL)
else:
    if (delete_existing_documents):
        prettyprint("Existing documents will be deleted from collections", MsgType.FAIL)
    else:
        prettyprint("Existing documents will not be deleted from collections", MsgType.OKGREEN)
    #MySQL connection
    prettyprint("Connecting to MySQL server...", MsgType.HEADER)
    mysqldb = mysql.connector.connect(
        host=mysql_host,
        database=mysql_database,
        user=mysql_user,
        password=mysql_password
    )
    prettyprint("Connection to MySQL Server succeeded.", MsgType.OKGREEN)
    #MongoDB connection
    prettyprint("Connecting to MongoDB server...", MsgType.HEADER)
    myclient = pymongo.MongoClient(mongodb_host)
    mydb = myclient[mongodb_dbname]
    prettyprint("Connection to MongoDB Server succeeded.", MsgType.OKGREEN)
    #Start migration
    prettyprint("Migration started...", MsgType.HEADER)
    #dblist = myclient.list_database_names()
    dblist = myclient.database_names()
    if mongodb_dbname in dblist:
        prettyprint("The database exists.", MsgType.OKBLUE)
    else:
        prettyprint("The database does not exist, it is being created.", MsgType.WARNING)
    #Iterate through the list of tables in the schema
    table_list_cursor = mysqldb.cursor()
    table_list_cursor.execute("SELECT table_name FROM information_schema.tables WHERE table_schema = %s ORDER BY table_name LIMIT 15;", (mysql_schema,))
    tables = table_list_cursor.fetchall()
    total_count = len(tables)
    prettyprint(total_count, MsgType.OKGREEN)
    success_count = 0
    fail_count = 0
    tables_thread=numpy.array(tables).reshape(2,int(len(tables)/2))
    for table in tables_thread:
        try:
            prettyprint(f"Processing table: {table[0]}...", MsgType.OKCYAN)    
            #inserted_count = migrate_table(mysqldb, table[0])
            thread_0 = threading.Thread(target=migrate_table, args=(mysql_host, mysql_database, mysql_user, mysql_password, table[0]))
            thread_0.start()
            prettyprint(f"Processing table: {table[1]}...", MsgType.OKCYAN)    
            thread_1 = threading.Thread(target=migrate_table, args=(mysql_host, mysql_database, mysql_user, mysql_password, table[1]))
            thread_1.start()
            thread_0.join()
            thread_1.join()
            success_count += 2
            table_completed_time = datetime.datetime.now()
            prettyprint(f"Processing table: {table[0]} completed. at: {table_completed_time}", MsgType.OKGREEN)
            prettyprint(f"Processing table: {table[1]} completed. at: {table_completed_time}", MsgType.OKGREEN)
            #prettyprint(f"Processing table: {table[1]} completed. {inserted_count} documents inserted.", MsgType.OKGREEN)
        except Exception as e:
            fail_count += 2
            prettyprint(f"{e}", MsgType.FAIL)
    prettyprint("Migration completed.", MsgType.HEADER)
    prettyprint(f"{success_count} of {total_count} tables migrated successfully.", MsgType.OKGREEN)
    if fail_count > 0:
        prettyprint(f"Migration of {fail_count} tables failed. See errors above.", MsgType.FAIL)
end_time = datetime.datetime.now()
prettyprint(f"Script completed at: {end_time}", MsgType.HEADER)
prettyprint(f"Total execution time: {end_time-begin_time}", MsgType.HEADER)
运行效率:
单线程:

600万行的mysql sql文件大小400M select出来很快,导入2分片的Mongo 需要 45分钟 。因为不断在balance平衡,数据迁移所以有点慢。
python mysql2mongo.py
Script started at: 2022-02-10 20:50:41.123675
Existing documents will not be deleted from collections
Connecting to MySQL server...
Connection to MySQL Server succeeded.
Connecting to MongoDB server...
Connection to MongoDB Server succeeded.
Migration started...
The database exists.
1
Processing table: sha256_4b...
Processing table: sha256_4b completed. 6093126 documents inserted.
Migration completed.
1 of 1 tables migrated successfully.
Script completed at: 2022-02-10 21:34:51.752212
Total execution time: 0:44:10.628537

如果不分片,只用18分钟
python mysql2mongo.py
Script started at: 2022-02-10 19:27:21.900784
Existing documents will not be deleted from collections
Connecting to MySQL server...
Connection to MySQL Server succeeded.
Connecting to MongoDB server...
Connection to MongoDB Server succeeded.
Migration started...
The database does not exist, it is being created.
1
Processing table: sha256_4b...
Processing table: sha256_4b completed. 6093126 documents inserted.
Migration completed.
1 of 1 tables migrated successfully.
Script completed at: 2022-02-10 19:46:11.934297
Total execution time: 0:18:50.033513

多线程+mongo分片+非流式传输:

2个线程,导4张600完行,400M的表,(2线程4张表,当然跑了两次)用时1小时52分。

相当于如果2个线程,半个小时一张表

python mysql-to-mongo.py
Script started at: 2022-02-14 19:22:10.102511
Delete existing documents from collections (y)es/(n)o/(a)bort?y
Are you sure (y)es/(n)?y
Existing documents will be deleted from collections
Connecting to MySQL server...
Connection to MySQL Server succeeded.
Connecting to MongoDB server...
Connection to MongoDB Server succeeded.
Migration started...
The database exists.
4
Processing table: sha256_1b...
Processing table: sha256_3b...
Processing table: sha256_1b completed.
Processing table: sha256_3b completed.
Processing table: sha256_4b...
Processing table: sha256_ab...
Processing table: sha256_4b completed.
Processing table: sha256_ab completed.
Migration completed.
4 of 4 tables migrated successfully.
Script completed at: 2022-02-14 21:14:34.487499
Total execution time: 1:52:24.384988

 使用流式传输的版本:

from pickle import APPEND
import sys
import mysql.connector
import pymysql
import pymongo
import datetime
import enum
import numpy
import threading
class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKCYAN = '\033[96m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'
class MsgType(enum.Enum):
    HEADER = 1
    OKBLUE = 2
    OKCYAN = 3
    OKGREEN = 4
    WARNING = 5
    FAIL = 6
    ENDC = 7
    BOLD = 8
    UNDERLINE = 9
#Pretty Print Function
def prettyprint(msg_text, msg_type):
    if msg_type == MsgType.HEADER:
        print(f"{bcolors.HEADER}{msg_text}{bcolors.ENDC}")
    elif msg_type == MsgType.OKBLUE:
        print(f"{bcolors.OKBLUE}{msg_text}{bcolors.ENDC}")
    elif msg_type == MsgType.OKCYAN:
        print(f"{bcolors.OKCYAN}{msg_text}{bcolors.ENDC}")
    elif msg_type == MsgType.OKGREEN:
        print(f"{bcolors.OKGREEN}{msg_text}{bcolors.ENDC}")
    elif msg_type == MsgType.WARNING:
        print(f"{bcolors.WARNING}{msg_text}{bcolors.ENDC}")
    elif msg_type == MsgType.FAIL:
        print(f"{bcolors.FAIL}{msg_text}{bcolors.ENDC}")
    elif msg_type == MsgType.BOLD:
        print(f"{bcolors.BOLD}{msg_text}{bcolors.ENDC}")
    elif msg_type == MsgType.UNDERLINE:
        print(f"{bcolors.UNDERLINE}{msg_text}{bcolors.ENDC}")
#Function migrate_table
def migrate_table(mysql_host, mysql_database, mysql_user, mysql_password, table_name):
    mycol = mydb[table_name]
    if delete_existing_documents:
        #delete all documents in the collection
        mycol.delete_many({})
    db = pymysql.connect(host=mysql_host,user=mysql_user,password=mysql_password,database=mysql_database)
    #TODO: Sanitize table name to conform to MongoDB Collection naming restrictions
    #For example, the $ sign is allowed in MySQL table names but not in MongoDB Collection names
    mycursor = db.cursor(pymysql.cursors.SSDictCursor)
    mycursor.execute("SELECT * FROM " + table_name + ";")
    try:
        for row in mycursor:
            #print(row)
            #insert the documents
            mycol.insert_one(row)
    except Exception as e:
        db.ping(True)
        table_row_run_time = datetime.datetime.now()
        prettyprint(f"Processing table: {table_name} row {row} have wrong. at: {table_row_run_time}", MsgType.WARNING)      
        prettyprint(f"{e}", MsgType.FAIL)
    table_completed_time = datetime.datetime.now()
    prettyprint(f"Processing table: {table_name} completed. at: {table_completed_time}", MsgType.OKGREEN)      
    return 0
begin_time = datetime.datetime.now()
abort = False
prettyprint(f"Script started at: {begin_time}", MsgType.HEADER)
delete_existing_documents = True;
mysql_host="192."
mysql_database="w"
mysql_schema = "w"
mysql_user=""
mysql_password=""
#mongodb_host = "mongodb://1:27017/"
mongodb_host = "mongodb://:@197:27017/admin"
mongodb_dbname = ""
if (delete_existing_documents):
    confirm_delete = input("Delete existing documents from collections (y)es/(n)o/(a)bort?")
    if confirm_delete.lower() == "a":
        abort = True
    elif confirm_delete.lower() == "n":
        delete_existing_documents = False
    else:
        #Confirm again
        confirm_delete = input("Are you sure (y)es/(n)?")
        if confirm_delete.lower() == "y":
            delete_existing_documents = True
        else:
            abort = True
if abort:
    prettyprint("Script aborted by user", MsgType.FAIL)
else:
    if (delete_existing_documents):
        prettyprint("Existing documents will be deleted from collections", MsgType.FAIL)
    else:
        prettyprint("Existing documents will not be deleted from collections", MsgType.OKGREEN)
    #MySQL connection
    prettyprint("Connecting to MySQL server...", MsgType.HEADER)
    mysqldb = mysql.connector.connect(
        host=mysql_host,
        database=mysql_database,
        user=mysql_user,
        password=mysql_password
    )
    prettyprint("Connection to MySQL Server succeeded.", MsgType.OKGREEN)
    #MongoDB connection
    prettyprint("Connecting to MongoDB server...", MsgType.HEADER)
    myclient = pymongo.MongoClient(mongodb_host)
    mydb = myclient[mongodb_dbname]
    prettyprint("Connection to MongoDB Server succeeded.", MsgType.OKGREEN)
    #Start migration
    prettyprint("Migration started...", MsgType.HEADER)
    #dblist = myclient.list_database_names()
    dblist = myclient.database_names()
    if mongodb_dbname in dblist:
        prettyprint("The database exists.", MsgType.OKBLUE)
    else:
        prettyprint("The database does not exist, it is being created.", MsgType.WARNING)
    #Iterate through the list of tables in the schema
    table_list_cursor = mysqldb.cursor()
    table_list_cursor.execute("SELECT table_name FROM information_schema.tables WHERE table_schema = %s ORDER BY table_name LIMIT 15;", (mysql_schema,))
    tables = table_list_cursor.fetchall()
    total_count = len(tables)
    prettyprint(total_count, MsgType.OKGREEN)
    success_count = 0
    fail_count = 0
    #tables_thread=numpy.array(tables).reshape(1,int(len(tables)/1))
    thread_list = []
    for table in tables:
        prettyprint(f"Processing table: {table[0]}...", MsgType.OKCYAN)    
        t = threading.Thread(target=migrate_table, args=(mysql_host, mysql_database, mysql_user, mysql_password, table[0]))
        thread_list.append(t)
    for thread_start in thread_list:
        thread_start.start()
        #success_count += 1
    for thread_join in thread_list:
        thread_join.join()
    prettyprint("Migration completed.", MsgType.HEADER)
    prettyprint(f"{success_count} of {total_count} tables migrated successfully.", MsgType.OKGREEN)
    if fail_count > 0:
        prettyprint(f"Migration of {fail_count} tables failed. See errors above.", MsgType.FAIL)
end_time = datetime.datetime.now()
prettyprint(f"Script completed at: {end_time}", MsgType.HEADER)
prettyprint(f"Total execution time: {end_time-begin_time}", MsgType.HEADER)
使用流式的方式,在测试环境跑了两次,有两次错误,都没跑完,而全量导的时候次次都成功,说明流式方式对程序环境还是有要求的。
这两次报的错误分别是:
1,这个问题加了try catch 解决

python mysql-to-mongo.py
Script started at: 2022-02-15 15:03:28.231224
Delete existing documents from collections (y)es/(n)o/(a)bort?y
Are you sure (y)es/(n)?y
Existing documents will be deleted from collections
Connecting to MySQL server...
Connection to MySQL Server succeeded.
Connecting to MongoDB server...
Connection to MongoDB Server succeeded.
Migration started...
The database exists.
4
Processing table: sha256_1b...
Processing table: sha256_3b...
Processing table: sha256_4b...
Processing table: sha256_ab...
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib64/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib64/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "mysql-to-mongo.py", line 66, in migrate_table
for row in mycursor:
File "/usr/local/lib64/python3.6/site-packages/pymysql/cursors.py", line 429, in fetchone
row = self.read_next()
File "/usr/local/lib64/python3.6/site-packages/pymysql/cursors.py", line 424, in read_next
return self._conv_row(self._result._read_rowdata_packet_unbuffered())
File "/usr/local/lib64/python3.6/site-packages/pymysql/connections.py", line 1244, in _read_rowdata_packet_unbuffered
packet = self.connection._read_packet()
File "/usr/local/lib64/python3.6/site-packages/pymysql/connections.py", line 711, in _read_packet
recv_data = self._read_bytes(bytes_to_read)
File "/usr/local/lib64/python3.6/site-packages/pymysql/connections.py", line 749, in _read_bytes
CR.CR_SERVER_LOST, "Lost connection to MySQL server during query"
pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')

Exception ignored in: <bound method SSCursor.close of <pymysql.cursors.SSDictCursor object at 0x7f0c2e840b38>>
Traceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/pymysql/cursors.py", line 401, in close
self._result._finish_unbuffered_query()
File "/usr/local/lib64/python3.6/site-packages/pymysql/connections.py", line 1261, in _finish_unbuffered_query
packet = self.connection._read_packet()
File "/usr/local/lib64/python3.6/site-packages/pymysql/connections.py", line 692, in _read_packet
packet_header = self._read_bytes(4)
File "/usr/local/lib64/python3.6/site-packages/pymysql/connections.py", line 729, in _read_bytes
self._sock.settimeout(self._read_timeout)
AttributeError: 'NoneType' object has no attribute 'settimeout'

2, 这个问题应该是connect 和 cursor没有close。属于实验检测。

python mysql-to-mongo.py
Script started at: 2022-02-15 18:08:36.461916
Delete existing documents from collections (y)es/(n)o/(a)bort?y
Are you sure (y)es/(n)?y
Existing documents will be deleted from collections
Connecting to MySQL server...
Connection to MySQL Server succeeded.
Connecting to MongoDB server...
Connection to MongoDB Server succeeded.
Migration started...
The database exists.
4
Processing table: sha256_1b...
Processing table: sha256_3b...
Processing table: sha256_4b...
Processing table: sha256_ab...
/usr/local/lib64/python3.6/site-packages/pymysql/connections.py:799: UserWarning: Previous unbuffered result was left incomplete
warnings.warn("Previous unbuffered result was left incomplete")

效率结论,改了这两处代码后,程序完全跑完了。

但效率比非流式反而慢了。四个线程同时导4张表,流式传输,一共用了4个小时。而非流式2个线程4个表导两次只需要两个小时。

python mysql-to-mongo.py
Script started at: 2022-02-16 11:59:04.748957
Delete existing documents from collections (y)es/(n)o/(a)bort?y
Are you sure (y)es/(n)?y
Existing documents will be deleted from collections
Connecting to MySQL server...
Connection to MySQL Server succeeded.
Connecting to MongoDB server...
Connection to MongoDB Server succeeded.
Migration started...
The database exists.
4
Processing table: sha256_1b...
Processing table: sha256_3b...
Processing table: sha256_4b...
Processing table: sha256_ab...
Processing table: sha256_3b completed. at: 2022-02-16 15:45:42.770882
Processing table: sha256_1b completed. at: 2022-02-16 15:46:06.436514
Processing table: sha256_ab completed. at: 2022-02-16 15:46:30.114281
Processing table: sha256_4b completed. at: 2022-02-16 15:55:32.537300
Migration completed.
0 of 4 tables migrated successfully.
Script completed at: 2022-02-16 15:55:32.538377
Total execution time: 3:56:27.789420

 --------------------------------------------------------------------------------------------------------------------------------
python连mysql,可以使用三种方式:

mysql-connector 是 MySQL 官方提供的驱动器, 它在Python中重新实现MySQL协议,它比较慢,但不需要C库,因此可移植性好。

MySQLdb是一个对于_mysql的Python包装器。其中_mysql也是该作者开发的模块,它依赖C库,所以说MYSQLdb也是依赖C库的。因此它的可移植性不太好,但是由于是基于C库实现的,它的速度会快一些。一些开发者为了效率甚至直接使用_mysql模块。

pymysql是由yutaka.matsubara开发维护的纯python实现的驱动器。它相对于mysql.connector, MYSQLdb来说比较年轻。它的效率和可移植性和mysql-connector理论上是差不多的。

语法查询:

https://www.runoob.com/python3/python-mysql-connector.html

 pymysql 官方文档:

https://pymysql.readthedocs.io/en/latest/modules/connections.html

https://pymysql.readthedocs.io/en/latest/modules/cursors.html

导大表,使用python mysql流式游标:

在执行select等sql语句后

cur.execute(sql)

可以通过for循环遍历cur,此时cur相当于生成器,不会直接存储所有数据,而是在循环时一条一条生成数据。

for i in cur:
        print(i)
import MySQLdb.cursors
connection=MySQLdb.connect(
    host="thehost",user="theuser",
    passwd="thepassword",db="thedb", cursorclass = MySQLdb.cursors.SSCursor) cursor=connection.cursor() cursor.execute(query) for row in cursor: print(row)
  1. 使用迭代器而不用 fetchall ,即省内存又能很快拿到数据。
  2. 因为 SSCursor 是没有缓存的游标,结果集只要没取完,这个 conn 是不能再处理别的 sql,包括另外生成一个 cursor 也不行的。如果需要干别的,请另外再生成一个连接对象。
  3. 每次读取后处理数据要快,不能超过 60 s,否则 mysql 将会断开这次连接,也可以修改 SET NET_WRITE_TIMEOUT = xx 来增加超时间隔。

  4. In fact, the MySQL server will not store results in server even if a SSCursor is used, it will produce the part of results before fill them into the network packages. So don't worry the memory use of the server when use SSCursor. 实际上,MySQL server端也不存数据,而是直接发网络数据包里,这也是第3点NET_WRITE_TIMEOUT的原因。
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值