pymysql保存数组_从MySQL将数字数据加载到python / pandas / numpy数组的最快方法

最新推荐文章于 2021-08-18 14:28:25 发布

weixin_39892800

最新推荐文章于 2021-08-18 14:28:25 发布

阅读量144

点赞数

文章标签： pymysql保存数组

本文链接：https://blog.csdn.net/weixin_39892800/article/details/111812706

版权

I want to read some numeric (double, i.e. float64) data from a MySQL table. The size of the data is ~200k rows.

MATLAB reference:

tic;

feature accel off;

conn = database(...);

c=fetch(exec(conn,'select x,y from TABLENAME'));

cell2mat(c.data);

toc

Elapsed time is ~1 second.

Doing the same in python, using the several examples found in here (I have tried them all, i.e. using pandas read_frame, frame_query and the __processCursor function):

How to convert SQL Query result to PANDAS Data Structure?

Reference python code:

import pyodbc

import pandas.io.sql as psql

import pandas

connection_info = "DRIVER={MySQL ODBC 3.51 \

Driver};SERVER=;DATABASE=;USER=;PASSWORD=;OPTION=3;"

cnxn = pyodbc.connect(connection_info)

cursor = cnxn.cursor()

sql = "select x,y from TABLENAME"

#cursor.execute(sql)

#dataframe = __processCursor(cursor, dataframe=True)

#df = psql.frame_query(sql, cnxn, coerce_float=False)

df = psql.read_frame(sql, cnxn)

cnxn.close()

Takes ~6 seconds. Profiler says all the time spent was in read_frame.

I was wondering if anyone could give me some hints how could this be speedup to at least match the MATLAB code. And if that is possible at all in python.

EDIT:

The bottleneck seems to be inside the cursor.execute (in pymysql library) or cursor.fetchall() in pyodbc library. The slowest part is reading the returned MySQL data element by element (row by row, column by column) and converting it to the data type which it inferred previously by the same library.

So far I have managed to speed this up to close to MATLAB by doing this really dirty solution:

import pymysql

import numpy

conn = pymysql.connect(host='', port=, user='', passwd='', db='')

cursor = conn.cursor()

cursor.execute("select x,y from TABLENAME")

rez = cursor.fetchall()

resarray = numpy.array(map(float,rez))

finalres = resarray.reshape((resarray.size/2,2))

The above cur.execute IS NOT THE pymysql EXECUTE! I have modified it, inside the file "connections.py". First, the function def _read_rowdata_packet, now has instead of:

rows.append(self._read_row_from_packet(packet))

substituted with

self._read_string_from_packet(rows,packet)

Here _read_string_from_packet is a simplified version of _read_row_from_packet with the code:

def _read_string_from_packet(self, rows, packet):

for field in self.fields:

data = packet.read_length_coded_string()

rows.append(data)

This is an uber-dirty solution which gives a speedup down from 6 seconds to 2.5 seconds. I was wondering, if all of this could somehow be avoided by using a different library/passing some parameters?

Hence the solution would be to bulk-read the entire MySQL reply to a list of strings and then bulk-type converting to numerical data types, instead of doing that element-by-element. Does something like that already exist in python?

解决方案

The "problem" seems to have been the type conversion which occurs from MySQL's decimal type to python's decimal.Decimal that MySQLdb, pymysql and pyodbc does on the data. By changing the converters.py file (at the very last lines) in MySQLdb to have:

conversions[FIELD_TYPE.DECIMAL] = float

conversions[FIELD_TYPE.NEWDECIMAL] = float

instead of decimal.Decimal seems to completely solve the problem and now the following code:

import MySQLdb

import numpy

import time

t = time.time()

conn = MySQLdb.connect(host='',...)

curs = conn.cursor()

curs.execute("select x,y from TABLENAME")

data = numpy.array(curs.fetchall(),dtype=float)

print(time.time()-t)

Runs in less than a second!

What is funny, decimal.Decimal never appeared to be the problem in the profiler.

Similar solution should work in pymysql package. pyodbc is more tricky: it is all written in C++, hence you would have to recompile the entire package.

UPDATE

Here is a solution not requiring to modify the MySQLdb source code:

Python MySQLdb returns datetime.date and decimal

The solution then to load numeric data into pandas:

import MySQLdb

import pandas.io.sql as psql

from MySQLdb.converters import conversions

from MySQLdb.constants import FIELD_TYPE

conversions[FIELD_TYPE.DECIMAL] = float

conversions[FIELD_TYPE.NEWDECIMAL] = float

conn = MySQLdb.connect(host='',user='',passwd='',db='')

sql = "select * from NUMERICTABLE"

df = psql.read_frame(sql, conn)

Beats MATLAB by a factor of ~4 in loading 200k x 9 table!

weixin_39892800

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pymysql保存数组_从MySQL将数字数据加载到python / pandas / numpy数组的最快方法

I want to read some numeric (double, i.e. float64) data from a MySQL table. The size of the data is ~200k rows.MATLAB reference:tic;feature accel off;conn = database(...);c=fetch(exec(conn,'select x,y...
复制链接

扫一扫

pymysql保存数组_从MySQL将数字数据加载到python / pandas / numpy数组的最快方法

“相关推荐”对你有帮助么？