文章目录
前言
场景:现有一个内含1亿行数据的超大csv格式文件(一般大于5GB),需要导入mysql数据库,最终需要从库中快速统计出销售额Top5的5天日期以及对应销售总额,方案简单高效最好。
csv文件数据结构如下:
销售日期 | 货物编号 | 生产序号 | 供应商编号 | 销售数量 | 成本价(元) | 销售价(元) |
---|---|---|---|---|---|---|
20211117 | A-7952 | F-1797 | P-1911 | 683 | 204.11 | 273.51 |
方案介绍
目前mysql公认最快导入方式应该是自带的命令:load data infile方式。其余方案还有如excutemany,sql拼接存储过程,sql批量提交等,亲测均不如load data infile方案。当然所有方案不限语言,本次采用python实现,代码量非常少。另在其他文章中还听闻采用R语言实现,可以将机器性能发挥到极致,实现极速导入,暂时没有实践,有兴趣的小伙伴可以试试,欢迎反馈。
方案优化
即便采用load data infile方式,如果无脑直接load整个5G文件,也不见得很快。受单个连接写入性能瓶颈影响,耗时远超10min,感觉不能接受。单个写入时可以发现cpu以及磁盘写入都还有很大利用空间,所以优化为分库20个表(当然可以更多,需要结合机器硬件指标以及后续数据怎么使用折中考虑),采用线程池初始化对应连接,也就是20个并发写入,导入时间可缩短到5min内。再者得益于拆分了20个表,统计Top5也变得简单。单个表也就500w数据,基本不用怎么设计,性能都不会太差,总统计耗时约10s左右。
执行步骤
一、csv文件拆分
需要使用 rb 模式读取文件,性能最快
def Main(argv):
print('拆分{}个文件'.format(argv[0]))
# 写清楚存储路径,否则该路径默认为程序运行路径
if not os.path.exists(argv[2]):
os.makedirs(argv[2])
# 计数器
flag = 0
# 文件名
name = 1
# 存放数据
dataList = []
# print("拆分文件开始时间")
# print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
with open(argv[1], 'rb') as f_source:
print('跳过首行;' + f_source.readline().decode())
for line in f_source:
flag += 1
dataList.append(line)
if flag == 100000000 / int(argv[0]): # 切割为1亿行 / 20一份的文件
with open(argv[2] + "/data_split_" + str(name) + ".csv", 'wb+') as f_target:
for data in dataList:
f_target.write(data)
name += 1
flag = 0
dataList = []
print("拆分文件完成时间")
print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
二、初始化连接池和线程池
pool = PooledDB(MySQLdb, 20, host=host, user=user, passwd=passwd, db=db,
port=3306)
class myThread(threading.Thread):
def __init__(self, threadID, name, fileName, tableName):
threading.Thread.__init__(self)
self.threadID = threadID
self.name = name
self.fileName = fileName
self.tableName = tableName
def run(self):
load_csv(self.fileName, self.tableName)
class splitThread(threading.Thread):
def __init__(self, threadID, name, argv):
threading.Thread.__init__(self)
self.threadID = threadID
self.name = name
self.argv = argv
def run(self):
mySplit.Main(self.argv)
三、load每个csv文件
如果数据库字段与csv文件列不是一一对应关系,需要在load时做简单转换,例如下面的 “per_profit” 列,可使用set关键字设值,具体可以参考mysql语法教程
def load_csv(csv_file_path, table_name, database=db):
conn = pool.connection()
cur = conn.cursor()
data_sql = "LOAD DATA LOCAL INFILE '%s' INTO TABLE %s CHARACTER SET utf8 FIELDS TERMINATED BY ',' LINES TERMINATED BY '\\n' (sale_date, goods_no, producer_no, agent_no,sale_amount,per_cost_price,per_sales_price) set per_profit = per_sales_price - per_cost_price;" % (
csv_file_path, table_name)
# 使用数据库
cur.execute('use %s' % database)
cur.execute(' SET GLOBAL local_infile = 1 ;')
# 执行data_sql,导入数据
cur.execute(data_sql)
conn.commit()
# 关闭连接
cur.close()
conn.close()
print('导入【' + csv_file_path + '】完成...', time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
四、统计Top5
针对每个表,从中聚合获取每一天的总销售额
def queryAll(num):
# 获取数据库连接
conn = pool.connection()
cursor = conn.cursor()
# 使用数据库
cursor.execute('use %s' % db)
sql = "select sale_date as date,sum(CAST(sale_amount AS DECIMAL(18,2))*CAST(per_profit AS DECIMAL(18,2))) as amount from sale_info_" + str(num) + " group by sale_date"
cursor.execute(sql)
d1 = dict(list(cursor.fetchall()))
dataArr.append(d1)
cursor.close()
conn.close()
待所有查询完后,合并20个结果集,按天累加并排序得到Top5
----Tip:此处应用到python字典类型的按key累加
result = {}
for dict_ in dataArr:
x, y = Counter(result), Counter(dict_)
result = dict(x + y)
list_ = sorted(result.items(), key=lambda kv: (kv[1], kv[0]))
print('top5为:')
for i in range(5):
list__ = list(list_[list_.__len__() - 1 - i])
print ("%s,%s" % (list__[0],list__[1].quantize(Decimal('0.00'))))
完整代码
README.md
#------安装python3环境,如果存在可跳过
wget https://www.python.org/ftp/python/3.8.0/Python-3.8.0.tgz
tar zxf Python-3.8.0.tgz
yum update -y
yum groupinstall -y 'Development Tools'
yum install -y gcc openssl-devel bzip2-devel libffi-devel
./configure prefix=/usr/local/python3 --enable-optimizations
make && make install
#------安装python3环境结束
#如果原设备存在python2.7版本,可使用如下命令临时添加环境变量
PATH=/usr/local/python3/bin:$PATH
export PATH
#使用pip3安装必要连接驱动等依赖
pip3 install mysqlclient
pip3 install DBUtils==1.3
#执行步骤
0、执行ddl.sql中建表脚本并初始化py文件中数据库连接信息
修改MyThread.py中数据库连接信息
修改parallelQuery.py中数据库连接信息
1、导入csv到数据库
shell命令: python3.8 myThread.py 20 source_csv_data_file target_temp_dir
参数说明:source_csv_data_file csv数据包文件地址;target_temp_dir 任意可写文件夹,用于缓存拆分后csv文件
可参考执行脚本:sh startLoad.sh
2、查询结果
shell命令: python3.8 parallelQuery.py
可参考执行脚本:sh startQuery.sh
ddl.sql
create database if not exists tournament;
use tournament;
create table if not exists `sale_info_1`
(
`ID` bigint(20) not null auto_increment COMMENT '自增ID',
`sale_date` date not null COMMENT '销售日期',
`goods_no` varchar(11) not null COMMENT '货物编号',
`producer_no` varchar(8) not null COMMENT '生产厂家编号',
`agent_no` varchar(8) not null COMMENT '代理商编号',
`sale_amount` int(4) not null COMMENT '售卖件数',
`per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
`per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
`per_profit` varchar(10) not null COMMENT '单件利润(元)',
primary key (`ID`)
using BTREE
) ENGINE = InnoDB;
create table if not exists `sale_info_2`
(
`ID` bigint(20) not null auto_increment COMMENT '自增ID',
`sale_date` date not null COMMENT '销售日期',
`goods_no` varchar(11) not null COMMENT '货物编号',
`producer_no` varchar(8) not null COMMENT '生产厂家编号',
`agent_no` varchar(8) not null COMMENT '代理商编号',
`sale_amount` int(4) not null COMMENT '售卖件数',
`per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
`per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
`per_profit` varchar(10) not null COMMENT '单件利润(元)',
primary key (`ID`)
using BTREE
) ENGINE = InnoDB;
create table if not exists `sale_info_3`
(
`ID` bigint(20) not null auto_increment COMMENT '自增ID',
`sale_date` date not null COMMENT '销售日期',
`goods_no` varchar(11) not null COMMENT '货物编号',
`producer_no` varchar(8) not null COMMENT '生产厂家编号',
`agent_no` varchar(8) not null COMMENT '代理商编号',
`sale_amount` int(4) not null COMMENT '售卖件数',
`per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
`per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
`per_profit` varchar(10) not null COMMENT '单件利润(元)',
primary key (`ID`)
using BTREE
) ENGINE = InnoDB;
create table if not exists `sale_info_4`
(
`ID` bigint(20) not null auto_increment COMMENT '自增ID',
`sale_date` date not null COMMENT '销售日期',
`goods_no` varchar(11) not null COMMENT '货物编号',
`producer_no` varchar(8) not null COMMENT '生产厂家编号',
`agent_no` varchar(8) not null COMMENT '代理商编号',
`sale_amount` int(4) not null COMMENT '售卖件数',
`per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
`per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
`per_profit` varchar(10) not null COMMENT '单件利润(元)',
primary key (`ID`)
using BTREE
) ENGINE = InnoDB;
create table if not exists `sale_info_5`
(
`ID` bigint(20) not null auto_increment COMMENT '自增ID',
`sale_date` date not null COMMENT '销售日期',
`goods_no` varchar(11) not null COMMENT '货物编号',
`producer_no` varchar(8) not null COMMENT '生产厂家编号',
`agent_no` varchar(8) not null COMMENT '代理商编号',
`sale_amount` int(4) not null COMMENT '售卖件数',
`per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
`per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
`per_profit` varchar(10) not null COMMENT '单件利润(元)',
primary key (`ID`)
using BTREE
) ENGINE = InnoDB;
create table if not exists `sale_info_6`
(
`ID` bigint(20) not null auto_increment COMMENT '自增ID',
`sale_date` date not null COMMENT '销售日期',
`goods_no` varchar(11) not null COMMENT '货物编号',
`producer_no` varchar(8) not null COMMENT '生产厂家编号',
`agent_no` varchar(8) not null COMMENT '代理商编号',
`sale_amount` int(4) not null COMMENT '售卖件数',
`per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
`per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
`per_profit` varchar(10) not null COMMENT '单件利润(元)',
primary key (`ID`)
using BTREE
) ENGINE = InnoDB;
create table if not exists `sale_info_7`
(
`ID` bigint(20) not null auto_increment COMMENT '自增ID',
`sale_date` date not null COMMENT '销售日期',
`goods_no` varchar(11) not null COMMENT '货物编号',
`producer_no` varchar(8) not null COMMENT '生产厂家编号',
`agent_no` varchar(8) not null COMMENT '代理商编号',
`sale_amount` int(4) not null COMMENT '售卖件数',
`per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
`per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
`per_profit` varchar(10) not null COMMENT '单件利润(元)',
primary key (`ID`)
using BTREE
) ENGINE = InnoDB;
create table if not exists `sale_info_8`
(
`ID` bigint(20) not null auto_increment COMMENT '自增ID',
`sale_date` date not null COMMENT '销售日期',
`goods_no` varchar(11) not null COMMENT '货物编号',
`producer_no` varchar(8) not null COMMENT '生产厂家编号',
`agent_no` varchar(8) not null COMMENT '代理商编号',
`sale_amount` int(4) not null COMMENT '售卖件数',
`per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
`per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
`per_profit` varchar(10) not null COMMENT '单件利润(元)',
primary key (`ID`)
using BTREE
) ENGINE = InnoDB;
create table if not exists `sale_info_9`
(
`ID` bigint(20) not null auto_increment COMMENT '自增ID',
`sale_date` date not null COMMENT '销售日期',
`goods_no` varchar(11) not null COMMENT '货物编号',
`producer_no` varchar(8) not null COMMENT '生产厂家编号',
`agent_no` varchar(8) not null COMMENT '代理商编号',
`sale_amount` int(4) not null COMMENT '售卖件数',
`per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
`per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
`per_profit` varchar(10) not null COMMENT '单件利润(元)',
primary key (`ID`)
using BTREE
) ENGINE = InnoDB;
create table if not exists `sale_info_10`
(
`ID` bigint(20) not null auto_increment COMMENT '自增ID',
`sale_date` date not null COMMENT '销售日期',
`goods_no` varchar(11) not null COMMENT '货物编号',
`producer_no` varchar(8) not null COMMENT '生产厂家编号',
`agent_no` varchar(8) not null COMMENT '代理商编号',
`sale_amount` int(4) not null COMMENT '售卖件数',
`per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
`per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
`per_profit` varchar(10) not null COMMENT '单件利润(元)',
primary key (`ID`)
using BTREE
) ENGINE = InnoDB;
create table if not exists `sale_info_11`
(
`ID` bigint(20) not null auto_increment COMMENT '自增ID',
`sale_date` date not null COMMENT '销售日期',
`goods_no` varchar(11) not null COMMENT '货物编号',
`producer_no` varchar(8) not null COMMENT '生产厂家编号',
`agent_no` varchar(8) not null COMMENT '代理商编号',
`sale_amount` int(4) not null COMMENT '售卖件数',
`per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
`per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
`per_profit` varchar(10) not null COMMENT '单件利润(元)',
primary key (`ID`)
using BTREE
) ENGINE = InnoDB;
create table if not exists `sale_info_12`
(
`ID` bigint(20) not null auto_increment COMMENT '自增ID',
`sale_date` date not null COMMENT '销售日期',
`goods_no` varchar(11) not null COMMENT '货物编号',
`producer_no` varchar(8) not null COMMENT '生产厂家编号',
`agent_no` varchar(8) not null COMMENT '代理商编号',
`sale_amount` int(4) not null COMMENT '售卖件数',
`per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
`per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
`per_profit` varchar(10) not null COMMENT '单件利润(元)',
primary key (`ID`)
using BTREE
) ENGINE = InnoDB;
create table if not exists `sale_info_13`
(
`ID` bigint(20) not null auto_increment COMMENT '自增ID',
`sale_date` date not null COMMENT '销售日期',
`goods_no` varchar(11) not null COMMENT '货物编号',
`producer_no` varchar(8) not null COMMENT '生产厂家编号',
`agent_no` varchar(8) not null COMMENT '代理商编号',
`sale_amount` int(4) not null COMMENT '售卖件数',
`per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
`per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
`per_profit` varchar(10) not null COMMENT '单件利润(元)',
primary key (`ID`)
using BTREE
) ENGINE = InnoDB;
create table if not exists `sale_info_14`
(
`ID` bigint(20) not null auto_increment COMMENT '自增ID',
`sale_date` date not null COMMENT '销售日期',
`goods_no` varchar(11) not null COMMENT '货物编号',
`producer_no` varchar(8) not null COMMENT '生产厂家编号',
`agent_no` varchar(8) not null COMMENT '代理商编号',
`sale_amount` int(4) not null COMMENT '售卖件数',
`per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
`per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
`per_profit` varchar(10) not null COMMENT '单件利润(元)',
primary key (`ID`)
using BTREE
) ENGINE = InnoDB;
create table if not exists `sale_info_15`
(
`ID` bigint(20) not null auto_increment COMMENT '自增ID',
`sale_date` date not null COMMENT '销售日期',
`goods_no` varchar(11) not null COMMENT '货物编号',
`producer_no` varchar(8) not null COMMENT '生产厂家编号',
`agent_no` varchar(8) not null COMMENT '代理商编号',
`sale_amount` int(4) not null COMMENT '售卖件数',
`per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
`per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
`per_profit` varchar(10) not null COMMENT '单件利润(元)',
primary key (`ID`)
using BTREE
) ENGINE = InnoDB;
create table if not exists `sale_info_16`
(
`ID` bigint(20) not null auto_increment COMMENT '自增ID',
`sale_date` date not null COMMENT '销售日期',
`goods_no` varchar(11) not null COMMENT '货物编号',
`producer_no` varchar(8) not null COMMENT '生产厂家编号',
`agent_no` varchar(8) not null COMMENT '代理商编号',
`sale_amount` int(4) not null COMMENT '售卖件数',
`per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
`per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
`per_profit` varchar(10) not null COMMENT '单件利润(元)',
primary key (`ID`)
using BTREE
) ENGINE = InnoDB;
create table if not exists `sale_info_17`
(
`ID` bigint(20) not null auto_increment COMMENT '自增ID',
`sale_date` date not null COMMENT '销售日期',
`goods_no` varchar(11) not null COMMENT '货物编号',
`producer_no` varchar(8) not null COMMENT '生产厂家编号',
`agent_no` varchar(8) not null COMMENT '代理商编号',
`sale_amount` int(4) not null COMMENT '售卖件数',
`per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
`per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
`per_profit` varchar(10) not null COMMENT '单件利润(元)',
primary key (`ID`)
using BTREE
) ENGINE = InnoDB;
create table if not exists `sale_info_18`
(
`ID` bigint(20) not null auto_increment COMMENT '自增ID',
`sale_date` date not null COMMENT '销售日期',
`goods_no` varchar(11) not null COMMENT '货物编号',
`producer_no` varchar(8) not null COMMENT '生产厂家编号',
`agent_no` varchar(8) not null COMMENT '代理商编号',
`sale_amount` int(4) not null COMMENT '售卖件数',
`per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
`per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
`per_profit` varchar(10) not null COMMENT '单件利润(元)',
primary key (`ID`)
using BTREE
) ENGINE = InnoDB;
create table if not exists `sale_info_19`
(
`ID` bigint(20) not null auto_increment COMMENT '自增ID',
`sale_date` date not null COMMENT '销售日期',
`goods_no` varchar(11) not null COMMENT '货物编号',
`producer_no` varchar(8) not null COMMENT '生产厂家编号',
`agent_no` varchar(8) not null COMMENT '代理商编号',
`sale_amount` int(4) not null COMMENT '售卖件数',
`per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
`per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
`per_profit` varchar(10) not null COMMENT '单件利润(元)',
primary key (`ID`)
using BTREE
) ENGINE = InnoDB;
create table if not exists `sale_info_20`
(
`ID` bigint(20) not null auto_increment COMMENT '自增ID',
`sale_date` date not null COMMENT '销售日期',
`goods_no` varchar(11) not null COMMENT '货物编号',
`producer_no` varchar(8) not null COMMENT '生产厂家编号',
`agent_no` varchar(8) not null COMMENT '代理商编号',
`sale_amount` int(4) not null COMMENT '售卖件数',
`per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
`per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
`per_profit` varchar(10) not null COMMENT '单件利润(元)',
primary key (`ID`)
using BTREE
) ENGINE = InnoDB;
mySplit.py
# -*- coding: utf-8 -*-
"""
Created on Mon Mar 30 10:24:03 2020
"""
import os
import sys
from datetime import datetime
def Main(argv):
print('拆分{}个文件'.format(argv[0]))
# 写清楚存储路径,否则该路径默认为程序运行路径
if not os.path.exists(argv[2]):
os.makedirs(argv[2])
# 计数器
flag = 0
# 文件名
name = 1
# 存放数据
dataList = []
# print("拆分文件开始时间")
# print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
with open(argv[1], 'rb') as f_source:
print('跳过首行;' + f_source.readline().decode())
for line in f_source:
flag += 1
dataList.append(line)
if flag == 100000000 / int(argv[0]): # 切割为1亿行 / 80一份的文件
with open(argv[2] + "/data_split_" + str(name) + ".csv", 'wb+') as f_target:
for data in dataList:
f_target.write(data)
name += 1
flag = 0
dataList = []
print("拆分文件完成时间")
print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
if __name__ == "__main__":
Main(sys.argv[1:])
myThread.py
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import os
import sys
import threading
import time
import MySQLdb
from DBUtils.PooledDB import PooledDB
import mySplit
host = 'localhost'
user = 'root'
passwd = '123456'
db = 'tournament'
pool = PooledDB(MySQLdb, 20, host=host, user=user, passwd=passwd, db=db,
port=3306)
class myThread(threading.Thread):
def __init__(self, threadID, name, fileName, tableName):
threading.Thread.__init__(self)
self.threadID = threadID
self.name = name
self.fileName = fileName
self.tableName = tableName
def run(self):
load_csv(self.fileName, self.tableName)
class splitThread(threading.Thread):
def __init__(self, threadID, name, argv):
threading.Thread.__init__(self)
self.threadID = threadID
self.name = name
self.argv = argv
def run(self):
mySplit.Main(self.argv)
# load_csv函数,参数分别为csv文件路径,表名称,数据库名称
def load_csv(csv_file_path, table_name, database=db):
conn = pool.connection()
cur = conn.cursor()
data_sql = "LOAD DATA LOCAL INFILE '%s' INTO TABLE %s CHARACTER SET utf8 FIELDS TERMINATED BY ',' LINES TERMINATED BY '\\n' (sale_date, goods_no, producer_no, agent_no,sale_amount,per_cost_price,per_sales_price) set per_profit = per_sales_price - per_cost_price;" % (
csv_file_path, table_name)
# 使用数据库
cur.execute('use %s' % database)
cur.execute(' SET GLOBAL local_infile = 1 ;')
# 执行data_sql,导入数据
cur.execute(data_sql)
conn.commit()
# 关闭连接
cur.close()
conn.close()
print('导入【' + csv_file_path + '】完成...', time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
def Main(argv):
# 开启拆分csv线程
thread = splitThread(1, "Thread-split", argv)
thread.start()
# 开启新线程
print('开始监听导入数据...', time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
threadList = []
n = 0
# 导入多少个文件
target = int(argv[0])
while (n < target):
fileName = argv[2] + '/data_split_' + str(n + 1) + '.csv'
if (os.path.exists(fileName)):
threadList.append(myThread(n + 1, "Thread-" + str(n + 1), fileName, 'sale_info_' + str((n % 20 + 1))))
threadList[n].start()
n += 1
else:
time.sleep(1)
for i in range(target):
if (threadList[i].is_alive()):
threadList[i].join()
# print("退出主线程")
print('导入结束...', time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
pool.close()
if __name__ == '__main__':
Main(sys.argv[1:])
parallelQuery.py
# -*- coding: utf-8 -*-
"""
Created on Mon Mar 30 10:24:03 2020
"""
from concurrent.futures._base import wait
from concurrent.futures.thread import ThreadPoolExecutor
from datetime import datetime
from collections import Counter
from decimal import Decimal
import MySQLdb
from DBUtils.PooledDB import PooledDB
host = 'localhost'
user = 'root'
passwd = '123456'
db = 'tournament'
pool = PooledDB(MySQLdb, 20, host=host, user=user, passwd=passwd, db=db,
port=3306)
def startBatchQuery():
executor = ThreadPoolExecutor(20)
future_list = [executor.submit(takePartIn, i + 1) for i in range(20)]
# 使用wait方法等待所有任务完成
wait(future_list)
# for f in future_list:
# r = f.result()
executor.shutdown()
def takePartIn(num):
queryAll(num)
return num
dataArr = []
def queryAll(num):
# 获取数据库连接
conn = pool.connection()
cursor = conn.cursor()
# 使用数据库
cursor.execute('use %s' % db)
sql = "select sale_date as date,sum(CAST(sale_amount AS DECIMAL(18,2))*CAST(per_profit AS DECIMAL(18,2))) as amount from sale_info_" + str(num) + " group by sale_date"
cursor.execute(sql)
d1 = dict(list(cursor.fetchall()))
dataArr.append(d1)
cursor.close()
conn.close()
def Main():
print("查询开始时间")
print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
startBatchQuery()
result = {}
for dict_ in dataArr:
x, y = Counter(result), Counter(dict_)
result = dict(x + y)
list_ = sorted(result.items(), key=lambda kv: (kv[1], kv[0]))
print('top5为:')
for i in range(5):
list__ = list(list_[list_.__len__() - 1 - i])
print ("%s,%s" % (list__[0],list__[1].quantize(Decimal('0.00'))))
print("查询结束时间")
print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
if __name__ == "__main__":
Main()
startLoad.sh
#!/bin/bash
current=`date "+%Y-%m-%d %H:%M:%S"`
timeStamp=`date -d "$current" +%s`
currentTimeStamp=$((timeStamp*1000+10#`date "+%N"`/1000000)) #将current转换为时间戳,精确到毫秒
echo 开始时间:$current
#PATH=/usr/local/python3/bin:$PATH
#export PATH
python3.8 myThread.py 20 /home/data.csv /home/tempDir
current1=`date "+%Y-%m-%d %H:%M:%S"`
timeStamp1=`date -d "$current1" +%s`
currentTimeStamp1=$((timeStamp1*1000+10#`date "+%N"`/1000000)) #将current1转换为时间戳,精确到毫秒
echo 结束时间:$current1
span=$[currentTimeStamp1-currentTimeStamp]
echo 导入耗时:$[span/1000]s
exit
startQuery.sh
#!/bin/bash
current=`date "+%Y-%m-%d %H:%M:%S"`
timeStamp=`date -d "$current" +%s`
currentTimeStamp=$((timeStamp*1000+10#`date "+%N"`/1000000)) #将current转换为时间戳,精确到毫秒
echo 开始时间:$current
#PATH=/usr/local/python3/bin:$PATH
#export PATH
python3.8 parallelQuery.py
current1=`date "+%Y-%m-%d %H:%M:%S"`
timeStamp1=`date -d "$current1" +%s`
currentTimeStamp1=$((timeStamp1*1000+10#`date "+%N"`/1000000)) #将current1转换为时间戳,精确到毫秒
echo 结束时间:$current1
span=$[currentTimeStamp1-currentTimeStamp]
echo 导入耗时:$[span/1000]s
exit
总结
受机器cpu以及磁盘性能影响,连接池和线程池参数需要做对应调优才能实现最佳效果。当然有时候最快的也不一定是最好的,还是需要根据后续数据真实场景是如何被使用,设计好表结构甚至一些中间字段。实际情况中最合适,最便捷的的才是最好的。