pymysql保存数组_从MySQL将数字数据加载到python / pandas / numpy数组的最快方法

I want to read some numeric (double, i.e. float64) data from a MySQL table. The size of the data is ~200k rows.

MATLAB reference:

tic;

feature accel off;

conn = database(...);

c=fetch(exec(conn,'select x,y from TABLENAME'));

cell2mat(c.data);

toc

Elapsed time is ~1 second.

Doing the same in python, using the several examples found in here (I have tried them all, i.e. using pandas read_frame, frame_query and the __processCursor function):

How to convert SQL Query result to PANDAS Data Structure?

Reference python code:

import pyodbc

import pandas.io.sql as psql

import pandas

connection_info = "DRIVER={MySQL ODBC 3.51 \

Driver};SERVER=;DATABASE=;USER=;PASSWORD=;OPTION=3;"

cnxn = pyodbc.connect(connection_info)

cursor = cnxn.cursor()

sql = "select x,y from TABLENAME"

#cursor.execute(sql)

#dataframe = __processCursor(cursor, dataframe=True)

#df = psql.frame_query(sql, cnxn, coerce_float=False)

df = psql.read_frame(sql, cnxn)

cnxn.close()

Takes ~6 seconds. Profiler says all the time spent was in read_frame.

I was wondering if anyone could give me some hints how could this be speedup to at least match the MATLAB code. And if that is possible at all in python.

EDIT:

The bottleneck seems to be inside the cursor.execute (in pymysql library) or cursor.fetchall() in pyodbc library. The slowest part is reading the returned MySQL data element by element (row by row, column by column) and converting it to the data type which it inferred previously by the same library.

So far I have managed to speed this up to close to MATLAB by doing this really dirty solution:

import pymysql

import numpy

conn = pymysql.connect(host='', port=, user='', passwd='', db='')

cursor = conn.cursor()

cursor.execute("select x,y from TABLENAME")

rez = cursor.fetchall()

resarray = numpy.array(map(float,rez))

finalres = resarray.reshape((resarray.size/2,2))

The above cur.execute IS NOT THE pymysql EXECUTE! I have modified it, inside the file "connections.py". First, the function def _read_rowdata_packet, now has instead of:

rows.append(self._read_row_from_packet(packet))

substituted with

self._read_string_from_packet(rows,packet)

Here _read_string_from_packet is a simplified version of _read_row_from_packet with the code:

def _read_string_from_packet(self, rows, packet):

for field in self.fields:

data = packet.read_length_coded_string()

rows.append(data)

This is an uber-dirty solution which gives a speedup down from 6 seconds to 2.5 seconds. I was wondering, if all of this could somehow be avoided by using a different library/passing some parameters?

Hence the solution would be to bulk-read the entire MySQL reply to a list of strings and then bulk-type converting to numerical data types, instead of doing that element-by-element. Does something like that already exist in python?

解决方案

The "problem" seems to have been the type conversion which occurs from MySQL's decimal type to python's decimal.Decimal that MySQLdb, pymysql and pyodbc does on the data. By changing the converters.py file (at the very last lines) in MySQLdb to have:

conversions[FIELD_TYPE.DECIMAL] = float

conversions[FIELD_TYPE.NEWDECIMAL] = float

instead of decimal.Decimal seems to completely solve the problem and now the following code:

import MySQLdb

import numpy

import time

t = time.time()

conn = MySQLdb.connect(host='',...)

curs = conn.cursor()

curs.execute("select x,y from TABLENAME")

data = numpy.array(curs.fetchall(),dtype=float)

print(time.time()-t)

Runs in less than a second!

What is funny, decimal.Decimal never appeared to be the problem in the profiler.

Similar solution should work in pymysql package. pyodbc is more tricky: it is all written in C++, hence you would have to recompile the entire package.

UPDATE

Here is a solution not requiring to modify the MySQLdb source code:

Python MySQLdb returns datetime.date and decimal

The solution then to load numeric data into pandas:

import MySQLdb

import pandas.io.sql as psql

from MySQLdb.converters import conversions

from MySQLdb.constants import FIELD_TYPE

conversions[FIELD_TYPE.DECIMAL] = float

conversions[FIELD_TYPE.NEWDECIMAL] = float

conn = MySQLdb.connect(host='',user='',passwd='',db='')

sql = "select * from NUMERICTABLE"

df = psql.read_frame(sql, conn)

Beats MATLAB by a factor of ~4 in loading 200k x 9 table!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值