最简单最快速csv超大文件入库并统计Top5


前言

场景:现有一个内含1亿行数据的超大csv格式文件(一般大于5GB),需要导入mysql数据库,最终需要从库中快速统计出销售额Top5的5天日期以及对应销售总额,方案简单高效最好。
csv文件数据结构如下:

销售日期货物编号生产序号供应商编号销售数量成本价(元)销售价(元)
20211117A-7952F-1797P-1911683204.11273.51

方案介绍

目前mysql公认最快导入方式应该是自带的命令:load data infile方式。其余方案还有如excutemany,sql拼接存储过程,sql批量提交等,亲测均不如load data infile方案。当然所有方案不限语言,本次采用python实现,代码量非常少。另在其他文章中还听闻采用R语言实现,可以将机器性能发挥到极致,实现极速导入,暂时没有实践,有兴趣的小伙伴可以试试,欢迎反馈。

方案优化

即便采用load data infile方式,如果无脑直接load整个5G文件,也不见得很快。受单个连接写入性能瓶颈影响,耗时远超10min,感觉不能接受。单个写入时可以发现cpu以及磁盘写入都还有很大利用空间,所以优化为分库20个表(当然可以更多,需要结合机器硬件指标以及后续数据怎么使用折中考虑),采用线程池初始化对应连接,也就是20个并发写入,导入时间可缩短到5min内。再者得益于拆分了20个表,统计Top5也变得简单。单个表也就500w数据,基本不用怎么设计,性能都不会太差,总统计耗时约10s左右。


执行步骤

一、csv文件拆分

需要使用 rb 模式读取文件,性能最快

def Main(argv):
    print('拆分{}个文件'.format(argv[0]))
    # 写清楚存储路径,否则该路径默认为程序运行路径
    if not os.path.exists(argv[2]):
        os.makedirs(argv[2])
    # 计数器
    flag = 0
    # 文件名
    name = 1
    # 存放数据
    dataList = []

    # print("拆分文件开始时间")
    # print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))

    with open(argv[1], 'rb') as f_source:
        print('跳过首行;' + f_source.readline().decode())
        for line in f_source:
            flag += 1
            dataList.append(line)
            if flag == 100000000 / int(argv[0]):  # 切割为1亿行 / 20一份的文件
                with open(argv[2] + "/data_split_" + str(name) + ".csv", 'wb+') as f_target:
                    for data in dataList:
                        f_target.write(data)
                name += 1
                flag = 0
                dataList = []

    print("拆分文件完成时间")
    print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))

二、初始化连接池和线程池

pool = PooledDB(MySQLdb, 20, host=host, user=user, passwd=passwd, db=db,
                port=3306)

class myThread(threading.Thread):
    def __init__(self, threadID, name, fileName, tableName):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.name = name
        self.fileName = fileName
        self.tableName = tableName

    def run(self):
        load_csv(self.fileName, self.tableName)

class splitThread(threading.Thread):
    def __init__(self, threadID, name, argv):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.name = name
        self.argv = argv

    def run(self):
        mySplit.Main(self.argv)

三、load每个csv文件

如果数据库字段与csv文件列不是一一对应关系,需要在load时做简单转换,例如下面的 “per_profit” 列,可使用set关键字设值,具体可以参考mysql语法教程

def load_csv(csv_file_path, table_name, database=db):
    conn = pool.connection()
    cur = conn.cursor()

    data_sql = "LOAD DATA LOCAL INFILE '%s' INTO TABLE %s CHARACTER SET utf8 FIELDS TERMINATED BY ',' LINES TERMINATED BY '\\n'  (sale_date, goods_no, producer_no, agent_no,sale_amount,per_cost_price,per_sales_price) set per_profit = per_sales_price - per_cost_price;" % (
        csv_file_path, table_name)
    # 使用数据库
    cur.execute('use %s' % database)
    cur.execute(' SET GLOBAL local_infile = 1 ;')
    # 执行data_sql,导入数据
    cur.execute(data_sql)
    conn.commit()
    # 关闭连接
    cur.close()
    conn.close()
    print('导入【' + csv_file_path + '】完成...', time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))

四、统计Top5

针对每个表,从中聚合获取每一天的总销售额

def queryAll(num):
    # 获取数据库连接
    conn = pool.connection()
    cursor = conn.cursor()
    # 使用数据库
    cursor.execute('use %s' % db)
    sql = "select sale_date as date,sum(CAST(sale_amount AS DECIMAL(18,2))*CAST(per_profit AS DECIMAL(18,2))) as amount from sale_info_" + str(num) + " group by sale_date"
    cursor.execute(sql)
    d1 = dict(list(cursor.fetchall()))
    dataArr.append(d1)
    cursor.close()
    conn.close()

待所有查询完后,合并20个结果集,按天累加并排序得到Top5
----Tip:此处应用到python字典类型的按key累加

    result = {}
    for dict_ in dataArr:
        x, y = Counter(result), Counter(dict_)
        result = dict(x + y)
    list_ = sorted(result.items(), key=lambda kv: (kv[1], kv[0]))
    print('top5为:')
    for i in range(5):
        list__ = list(list_[list_.__len__() - 1 - i])
        print ("%s,%s" % (list__[0],list__[1].quantize(Decimal('0.00'))))

完整代码

README.md

#------安装python3环境,如果存在可跳过

wget https://www.python.org/ftp/python/3.8.0/Python-3.8.0.tgz

tar zxf Python-3.8.0.tgz

yum update -y

yum groupinstall -y 'Development Tools'

yum install -y gcc openssl-devel bzip2-devel libffi-devel

./configure prefix=/usr/local/python3 --enable-optimizations

make && make install
#------安装python3环境结束

#如果原设备存在python2.7版本,可使用如下命令临时添加环境变量

PATH=/usr/local/python3/bin:$PATH

export PATH

#使用pip3安装必要连接驱动等依赖

pip3 install mysqlclient

pip3 install DBUtils==1.3

#执行步骤

0、执行ddl.sql中建表脚本并初始化py文件中数据库连接信息

修改MyThread.py中数据库连接信息

修改parallelQuery.py中数据库连接信息

1、导入csv到数据库

shell命令: python3.8 myThread.py 20 source_csv_data_file target_temp_dir

参数说明:source_csv_data_file csv数据包文件地址;target_temp_dir 任意可写文件夹,用于缓存拆分后csv文件

可参考执行脚本:sh startLoad.sh

2、查询结果

shell命令: python3.8 parallelQuery.py

可参考执行脚本:sh startQuery.sh

ddl.sql

create database if not exists tournament;

use tournament;

create table if not exists `sale_info_1`
(
    `ID` bigint(20) not null auto_increment COMMENT '自增ID',
    `sale_date` date not null COMMENT '销售日期',
    `goods_no` varchar(11) not null COMMENT '货物编号',
    `producer_no` varchar(8) not null COMMENT '生产厂家编号',
    `agent_no` varchar(8) not null COMMENT '代理商编号',
    `sale_amount` int(4) not null COMMENT '售卖件数',
    `per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
    `per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
    `per_profit` varchar(10) not null COMMENT '单件利润(元)',
    primary key (`ID`)
	using BTREE
) ENGINE = InnoDB;

create table if not exists `sale_info_2`
(
    `ID` bigint(20) not null auto_increment COMMENT '自增ID',
    `sale_date` date not null COMMENT '销售日期',
    `goods_no` varchar(11) not null COMMENT '货物编号',
    `producer_no` varchar(8) not null COMMENT '生产厂家编号',
    `agent_no` varchar(8) not null COMMENT '代理商编号',
    `sale_amount` int(4) not null COMMENT '售卖件数',
    `per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
    `per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
    `per_profit` varchar(10) not null COMMENT '单件利润(元)',
    primary key (`ID`)
	using BTREE
) ENGINE = InnoDB;

create table if not exists `sale_info_3`
(
    `ID` bigint(20) not null auto_increment COMMENT '自增ID',
    `sale_date` date not null COMMENT '销售日期',
    `goods_no` varchar(11) not null COMMENT '货物编号',
    `producer_no` varchar(8) not null COMMENT '生产厂家编号',
    `agent_no` varchar(8) not null COMMENT '代理商编号',
    `sale_amount` int(4) not null COMMENT '售卖件数',
    `per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
    `per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
    `per_profit` varchar(10) not null COMMENT '单件利润(元)',
    primary key (`ID`)
	using BTREE
) ENGINE = InnoDB;

create table if not exists `sale_info_4`
(
    `ID` bigint(20) not null auto_increment COMMENT '自增ID',
    `sale_date` date not null COMMENT '销售日期',
    `goods_no` varchar(11) not null COMMENT '货物编号',
    `producer_no` varchar(8) not null COMMENT '生产厂家编号',
    `agent_no` varchar(8) not null COMMENT '代理商编号',
    `sale_amount` int(4) not null COMMENT '售卖件数',
    `per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
    `per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
    `per_profit` varchar(10) not null COMMENT '单件利润(元)',
    primary key (`ID`)
	using BTREE
) ENGINE = InnoDB;

create table if not exists `sale_info_5`
(
    `ID` bigint(20) not null auto_increment COMMENT '自增ID',
    `sale_date` date not null COMMENT '销售日期',
    `goods_no` varchar(11) not null COMMENT '货物编号',
    `producer_no` varchar(8) not null COMMENT '生产厂家编号',
    `agent_no` varchar(8) not null COMMENT '代理商编号',
    `sale_amount` int(4) not null COMMENT '售卖件数',
    `per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
    `per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
    `per_profit` varchar(10) not null COMMENT '单件利润(元)',
    primary key (`ID`)
	using BTREE
) ENGINE = InnoDB;

create table if not exists `sale_info_6`
(
    `ID` bigint(20) not null auto_increment COMMENT '自增ID',
    `sale_date` date not null COMMENT '销售日期',
    `goods_no` varchar(11) not null COMMENT '货物编号',
    `producer_no` varchar(8) not null COMMENT '生产厂家编号',
    `agent_no` varchar(8) not null COMMENT '代理商编号',
    `sale_amount` int(4) not null COMMENT '售卖件数',
    `per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
    `per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
    `per_profit` varchar(10) not null COMMENT '单件利润(元)',
    primary key (`ID`)
	using BTREE
) ENGINE = InnoDB;

create table if not exists `sale_info_7`
(
    `ID` bigint(20) not null auto_increment COMMENT '自增ID',
    `sale_date` date not null COMMENT '销售日期',
    `goods_no` varchar(11) not null COMMENT '货物编号',
    `producer_no` varchar(8) not null COMMENT '生产厂家编号',
    `agent_no` varchar(8) not null COMMENT '代理商编号',
    `sale_amount` int(4) not null COMMENT '售卖件数',
    `per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
    `per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
    `per_profit` varchar(10) not null COMMENT '单件利润(元)',
    primary key (`ID`)
	using BTREE
) ENGINE = InnoDB;

create table if not exists `sale_info_8`
(
    `ID` bigint(20) not null auto_increment COMMENT '自增ID',
    `sale_date` date not null COMMENT '销售日期',
    `goods_no` varchar(11) not null COMMENT '货物编号',
    `producer_no` varchar(8) not null COMMENT '生产厂家编号',
    `agent_no` varchar(8) not null COMMENT '代理商编号',
    `sale_amount` int(4) not null COMMENT '售卖件数',
    `per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
    `per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
    `per_profit` varchar(10) not null COMMENT '单件利润(元)',
    primary key (`ID`)
	using BTREE
) ENGINE = InnoDB;

create table if not exists `sale_info_9`
(
    `ID` bigint(20) not null auto_increment COMMENT '自增ID',
    `sale_date` date not null COMMENT '销售日期',
    `goods_no` varchar(11) not null COMMENT '货物编号',
    `producer_no` varchar(8) not null COMMENT '生产厂家编号',
    `agent_no` varchar(8) not null COMMENT '代理商编号',
    `sale_amount` int(4) not null COMMENT '售卖件数',
    `per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
    `per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
    `per_profit` varchar(10) not null COMMENT '单件利润(元)',
    primary key (`ID`)
	using BTREE
) ENGINE = InnoDB;

create table if not exists `sale_info_10`
(
    `ID` bigint(20) not null auto_increment COMMENT '自增ID',
    `sale_date` date not null COMMENT '销售日期',
    `goods_no` varchar(11) not null COMMENT '货物编号',
    `producer_no` varchar(8) not null COMMENT '生产厂家编号',
    `agent_no` varchar(8) not null COMMENT '代理商编号',
    `sale_amount` int(4) not null COMMENT '售卖件数',
    `per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
    `per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
    `per_profit` varchar(10) not null COMMENT '单件利润(元)',
    primary key (`ID`)
	using BTREE
) ENGINE = InnoDB;

create table if not exists `sale_info_11`
(
    `ID` bigint(20) not null auto_increment COMMENT '自增ID',
    `sale_date` date not null COMMENT '销售日期',
    `goods_no` varchar(11) not null COMMENT '货物编号',
    `producer_no` varchar(8) not null COMMENT '生产厂家编号',
    `agent_no` varchar(8) not null COMMENT '代理商编号',
    `sale_amount` int(4) not null COMMENT '售卖件数',
    `per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
    `per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
    `per_profit` varchar(10) not null COMMENT '单件利润(元)',
    primary key (`ID`)
	using BTREE
) ENGINE = InnoDB;

create table if not exists `sale_info_12`
(
    `ID` bigint(20) not null auto_increment COMMENT '自增ID',
    `sale_date` date not null COMMENT '销售日期',
    `goods_no` varchar(11) not null COMMENT '货物编号',
    `producer_no` varchar(8) not null COMMENT '生产厂家编号',
    `agent_no` varchar(8) not null COMMENT '代理商编号',
    `sale_amount` int(4) not null COMMENT '售卖件数',
    `per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
    `per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
    `per_profit` varchar(10) not null COMMENT '单件利润(元)',
    primary key (`ID`)
	using BTREE
) ENGINE = InnoDB;

create table if not exists `sale_info_13`
(
    `ID` bigint(20) not null auto_increment COMMENT '自增ID',
    `sale_date` date not null COMMENT '销售日期',
    `goods_no` varchar(11) not null COMMENT '货物编号',
    `producer_no` varchar(8) not null COMMENT '生产厂家编号',
    `agent_no` varchar(8) not null COMMENT '代理商编号',
    `sale_amount` int(4) not null COMMENT '售卖件数',
    `per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
    `per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
    `per_profit` varchar(10) not null COMMENT '单件利润(元)',
    primary key (`ID`)
	using BTREE
) ENGINE = InnoDB;

create table if not exists `sale_info_14`
(
    `ID` bigint(20) not null auto_increment COMMENT '自增ID',
    `sale_date` date not null COMMENT '销售日期',
    `goods_no` varchar(11) not null COMMENT '货物编号',
    `producer_no` varchar(8) not null COMMENT '生产厂家编号',
    `agent_no` varchar(8) not null COMMENT '代理商编号',
    `sale_amount` int(4) not null COMMENT '售卖件数',
    `per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
    `per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
    `per_profit` varchar(10) not null COMMENT '单件利润(元)',
    primary key (`ID`)
	using BTREE
) ENGINE = InnoDB;

create table if not exists `sale_info_15`
(
    `ID` bigint(20) not null auto_increment COMMENT '自增ID',
    `sale_date` date not null COMMENT '销售日期',
    `goods_no` varchar(11) not null COMMENT '货物编号',
    `producer_no` varchar(8) not null COMMENT '生产厂家编号',
    `agent_no` varchar(8) not null COMMENT '代理商编号',
    `sale_amount` int(4) not null COMMENT '售卖件数',
    `per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
    `per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
    `per_profit` varchar(10) not null COMMENT '单件利润(元)',
    primary key (`ID`)
	using BTREE
) ENGINE = InnoDB;

create table if not exists `sale_info_16`
(
    `ID` bigint(20) not null auto_increment COMMENT '自增ID',
    `sale_date` date not null COMMENT '销售日期',
    `goods_no` varchar(11) not null COMMENT '货物编号',
    `producer_no` varchar(8) not null COMMENT '生产厂家编号',
    `agent_no` varchar(8) not null COMMENT '代理商编号',
    `sale_amount` int(4) not null COMMENT '售卖件数',
    `per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
    `per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
    `per_profit` varchar(10) not null COMMENT '单件利润(元)',
    primary key (`ID`)
	using BTREE
) ENGINE = InnoDB;

create table if not exists `sale_info_17`
(
    `ID` bigint(20) not null auto_increment COMMENT '自增ID',
    `sale_date` date not null COMMENT '销售日期',
    `goods_no` varchar(11) not null COMMENT '货物编号',
    `producer_no` varchar(8) not null COMMENT '生产厂家编号',
    `agent_no` varchar(8) not null COMMENT '代理商编号',
    `sale_amount` int(4) not null COMMENT '售卖件数',
    `per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
    `per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
    `per_profit` varchar(10) not null COMMENT '单件利润(元)',
    primary key (`ID`)
	using BTREE
) ENGINE = InnoDB;

create table if not exists `sale_info_18`
(
    `ID` bigint(20) not null auto_increment COMMENT '自增ID',
    `sale_date` date not null COMMENT '销售日期',
    `goods_no` varchar(11) not null COMMENT '货物编号',
    `producer_no` varchar(8) not null COMMENT '生产厂家编号',
    `agent_no` varchar(8) not null COMMENT '代理商编号',
    `sale_amount` int(4) not null COMMENT '售卖件数',
    `per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
    `per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
    `per_profit` varchar(10) not null COMMENT '单件利润(元)',
    primary key (`ID`)
	using BTREE
) ENGINE = InnoDB;

create table if not exists `sale_info_19`
(
    `ID` bigint(20) not null auto_increment COMMENT '自增ID',
    `sale_date` date not null COMMENT '销售日期',
    `goods_no` varchar(11) not null COMMENT '货物编号',
    `producer_no` varchar(8) not null COMMENT '生产厂家编号',
    `agent_no` varchar(8) not null COMMENT '代理商编号',
    `sale_amount` int(4) not null COMMENT '售卖件数',
    `per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
    `per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
    `per_profit` varchar(10) not null COMMENT '单件利润(元)',
    primary key (`ID`)
	using BTREE
) ENGINE = InnoDB;

create table if not exists `sale_info_20`
(
    `ID` bigint(20) not null auto_increment COMMENT '自增ID',
    `sale_date` date not null COMMENT '销售日期',
    `goods_no` varchar(11) not null COMMENT '货物编号',
    `producer_no` varchar(8) not null COMMENT '生产厂家编号',
    `agent_no` varchar(8) not null COMMENT '代理商编号',
    `sale_amount` int(4) not null COMMENT '售卖件数',
    `per_cost_price` varchar(10) not null COMMENT '单件成本价(元)',
    `per_sales_price` varchar(10) not null COMMENT '单件销售价(元)',
    `per_profit` varchar(10) not null COMMENT '单件利润(元)',
    primary key (`ID`)
	using BTREE
) ENGINE = InnoDB;

mySplit.py

# -*- coding: utf-8 -*-
"""
Created on Mon Mar 30 10:24:03 2020

"""
import os
import sys
from datetime import datetime


def Main(argv):
    print('拆分{}个文件'.format(argv[0]))
    # 写清楚存储路径,否则该路径默认为程序运行路径
    if not os.path.exists(argv[2]):
        os.makedirs(argv[2])
    # 计数器
    flag = 0
    # 文件名
    name = 1
    # 存放数据
    dataList = []

    # print("拆分文件开始时间")
    # print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))

    with open(argv[1], 'rb') as f_source:
        print('跳过首行;' + f_source.readline().decode())
        for line in f_source:
            flag += 1
            dataList.append(line)
            if flag == 100000000 / int(argv[0]):  # 切割为1亿行 / 80一份的文件
                with open(argv[2] + "/data_split_" + str(name) + ".csv", 'wb+') as f_target:
                    for data in dataList:
                        f_target.write(data)
                name += 1
                flag = 0
                dataList = []

    print("拆分文件完成时间")
    print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))


if __name__ == "__main__":
    Main(sys.argv[1:])

myThread.py

#!/usr/bin/python3
# -*- coding: utf-8 -*-
import os
import sys
import threading
import time
import MySQLdb
from DBUtils.PooledDB import PooledDB

import mySplit

host = 'localhost'
user = 'root'
passwd = '123456'
db = 'tournament'

pool = PooledDB(MySQLdb, 20, host=host, user=user, passwd=passwd, db=db,
                port=3306)


class myThread(threading.Thread):
    def __init__(self, threadID, name, fileName, tableName):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.name = name
        self.fileName = fileName
        self.tableName = tableName

    def run(self):
        load_csv(self.fileName, self.tableName)


class splitThread(threading.Thread):
    def __init__(self, threadID, name, argv):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.name = name
        self.argv = argv

    def run(self):
        mySplit.Main(self.argv)


# load_csv函数,参数分别为csv文件路径,表名称,数据库名称
def load_csv(csv_file_path, table_name, database=db):
    conn = pool.connection()
    cur = conn.cursor()

    data_sql = "LOAD DATA LOCAL INFILE '%s' INTO TABLE %s CHARACTER SET utf8 FIELDS TERMINATED BY ',' LINES TERMINATED BY '\\n'  (sale_date, goods_no, producer_no, agent_no,sale_amount,per_cost_price,per_sales_price) set per_profit = per_sales_price - per_cost_price;" % (
        csv_file_path, table_name)
    # 使用数据库
    cur.execute('use %s' % database)
    cur.execute(' SET GLOBAL local_infile = 1 ;')
    # 执行data_sql,导入数据
    cur.execute(data_sql)
    conn.commit()
    # 关闭连接
    cur.close()
    conn.close()
    print('导入【' + csv_file_path + '】完成...', time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))


def Main(argv):
    # 开启拆分csv线程
    thread = splitThread(1, "Thread-split", argv)
    thread.start()
    # 开启新线程
    print('开始监听导入数据...', time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
    threadList = []
    n = 0
    # 导入多少个文件
    target = int(argv[0])
    while (n < target):
        fileName = argv[2] + '/data_split_' + str(n + 1) + '.csv'
        if (os.path.exists(fileName)):
            threadList.append(myThread(n + 1, "Thread-" + str(n + 1), fileName, 'sale_info_' + str((n % 20 + 1))))
            threadList[n].start()
            n += 1
        else:
            time.sleep(1)
    for i in range(target):
        if (threadList[i].is_alive()):
            threadList[i].join()

    # print("退出主线程")
    print('导入结束...', time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
    pool.close()


if __name__ == '__main__':
    Main(sys.argv[1:])

parallelQuery.py

# -*- coding: utf-8 -*-
"""
Created on Mon Mar 30 10:24:03 2020

"""
from concurrent.futures._base import wait
from concurrent.futures.thread import ThreadPoolExecutor
from datetime import datetime
from collections import Counter
from decimal import Decimal

import MySQLdb
from DBUtils.PooledDB import PooledDB

host = 'localhost'
user = 'root'
passwd = '123456'
db = 'tournament'
pool = PooledDB(MySQLdb, 20, host=host, user=user, passwd=passwd, db=db,
                port=3306)


def startBatchQuery():
    executor = ThreadPoolExecutor(20)
    future_list = [executor.submit(takePartIn, i + 1) for i in range(20)]
    # 使用wait方法等待所有任务完成
    wait(future_list)
    # for f in future_list:
    #     r = f.result()
    executor.shutdown()


def takePartIn(num):
    queryAll(num)
    return num


dataArr = []


def queryAll(num):
    # 获取数据库连接
    conn = pool.connection()
    cursor = conn.cursor()
    # 使用数据库
    cursor.execute('use %s' % db)
    sql = "select sale_date as date,sum(CAST(sale_amount AS DECIMAL(18,2))*CAST(per_profit AS DECIMAL(18,2))) as amount from sale_info_" + str(num) + " group by sale_date"
    cursor.execute(sql)
    d1 = dict(list(cursor.fetchall()))
    dataArr.append(d1)
    cursor.close()
    conn.close()


def Main():
    print("查询开始时间")
    print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
    startBatchQuery()
    result = {}
    for dict_ in dataArr:
        x, y = Counter(result), Counter(dict_)
        result = dict(x + y)
    list_ = sorted(result.items(), key=lambda kv: (kv[1], kv[0]))
    print('top5为:')
    for i in range(5):
        list__ = list(list_[list_.__len__() - 1 - i])
        print ("%s,%s" % (list__[0],list__[1].quantize(Decimal('0.00'))))
    print("查询结束时间")
    print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))


if __name__ == "__main__":
    Main()

startLoad.sh

#!/bin/bash

current=`date "+%Y-%m-%d %H:%M:%S"`
timeStamp=`date -d "$current" +%s`
currentTimeStamp=$((timeStamp*1000+10#`date "+%N"`/1000000)) #将current转换为时间戳,精确到毫秒
echo 开始时间:$current

#PATH=/usr/local/python3/bin:$PATH
#export PATH
python3.8 myThread.py 20 /home/data.csv /home/tempDir

current1=`date "+%Y-%m-%d %H:%M:%S"`
timeStamp1=`date -d "$current1" +%s`
currentTimeStamp1=$((timeStamp1*1000+10#`date "+%N"`/1000000)) #将current1转换为时间戳,精确到毫秒
echo 结束时间:$current1

span=$[currentTimeStamp1-currentTimeStamp]
echo 导入耗时:$[span/1000]s

exit

startQuery.sh

#!/bin/bash

current=`date "+%Y-%m-%d %H:%M:%S"`
timeStamp=`date -d "$current" +%s`
currentTimeStamp=$((timeStamp*1000+10#`date "+%N"`/1000000)) #将current转换为时间戳,精确到毫秒
echo 开始时间:$current

#PATH=/usr/local/python3/bin:$PATH
#export PATH
python3.8 parallelQuery.py

current1=`date "+%Y-%m-%d %H:%M:%S"`
timeStamp1=`date -d "$current1" +%s`
currentTimeStamp1=$((timeStamp1*1000+10#`date "+%N"`/1000000)) #将current1转换为时间戳,精确到毫秒
echo 结束时间:$current1

span=$[currentTimeStamp1-currentTimeStamp]
echo 导入耗时:$[span/1000]s

exit

总结

受机器cpu以及磁盘性能影响,连接池和线程池参数需要做对应调优才能实现最佳效果。当然有时候最快的也不一定是最好的,还是需要根据后续数据真实场景是如何被使用,设计好表结构甚至一些中间字段。实际情况中最合适,最便捷的的才是最好的。

致谢❀❀❀

文中实现方案有参考很多其他优秀博客,此处就不一一列举致谢了,统一对原作者表示感谢和真诚的respect。也是希望集各家之所成,共同搭建美好源区,内容如有不妥,欢迎留言斧正。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值