阿里云天池龙珠计划SQL训练营学习:https://tianchi.aliyun.com/specials/promotion/aicampsql
1 数据集导入的优化
因为阿里天池给的数据集脚本中使用了大量的INSERT INTO语句,所以在第一次导入1-7market data.sql时耗费了大量时间。
目前采用了两种优化方式:
①一条SQL语句插入多条数据
INSERT INTO `market data` (`SECURITY_ID`, `TICKER_SYMBOL`, `END_DATE`, `CLOSE_PRICE`, `TURNOVER_VOL`, `TURNOVER_VALUE`, `MARKET_VALUE`, `TYPE_ID`, `TYPE_NAME_EN`, `TYPE_NAME_CN`)
VALUES ('2', '000001', '2018-05-31', '10.18', '1826717783', '19652776154.27', '174794788052', '010303210101', 'Bank', '银行'),
('2', '000001', '2018-04-27', '10.85', '2349599053', '26556911800.599998', '186298963690', '010303210101', 'Bank', '银行')
...;
不过这种方法容易遇到[Err] 2006 - MySQL server has gone away,这是因为执行SQL语句太长,超过了max_allowed_packet的大小,这可以使用如下命令加大max_allowed_packet的值。
mysql> show global variables like 'max_allowed_packet';
+--------------------+---------+
| Variable_name | Value |
+--------------------+---------+
| max_allowed_packet | 4194304 |
+--------------------+---------+
1 row in set
mysql> set global max_allowed_packet = 100*1024*1024;
Query OK, 0 rows affected
mysql> show global variables like 'max_allowed_packet';
+--------------------+-----------+
| Variable_name | Value |
+--------------------+-----------+
| max_allowed_packet | 104857600 |
+--------------------+-----------+
1 row in set
修改max_allowed_packet即可顺利插入记录 333138行,耗时 29.827s。
不过有人指出需要修改配置文件,因为如果不修改配置文件,当重启数据库时,max_allowed_packet会恢复原始值。
修改配置文件--Windows
我们刚才的修改是针对运行环境,但如果我们的配置文件不变,则重启数据库后,这个值又恢复为原始值了,对于 Windows 下的 MySQL,我们需要修改其配置文件 my.ini , 这个文件在 MySQL 的安装根目录下。
注意: MySQL 配置文件分为几个段 (Section), 包括 [mysql], [mysqld], [mysqldump] 等, 我们需要在 [mysqld] 段下添加配置项 :
max_allowed_packet = 100M
修改配置文件--Linux
在 Linux 中,MySQL 对应的配置文件是 my.cnf , 我们在Linux终端输入如下命令 :
mysql --help | grep my.cnf
可以查看 MySQL 所使用的 my.cnf 列表(优先级列表,先找到的配置文件生效),MySQL 会逐个目录寻找这个文件,直到找到为止!
我们逐个目录找到我们系统 MySQL 所使用的 my.cnf 配置文件,找到对应的配置项,修改为 100M 即可。
②使用事务执行百万级的INSERT INTO语句
START TRANSACTION;
INSERT INTO `market data` (`SECURITY_ID`, `TICKER_SYMBOL`, `END_DATE`, `CLOSE_PRICE`, `TURNOVER_VOL`, `TURNOVER_VALUE`, `MARKET_VALUE`, `TYPE_ID`, `TYPE_NAME_EN`, `TYPE_NAME_CN`) VALUES ('2', '000001', '2018-05-31', '10.18', '1826717783', '19652776154.27', '174794788052', '010303210101', 'Bank', '银行');
INSERT INTO `market data` (`SECURITY_ID`, `TICKER_SYMBOL`, `END_DATE`, `CLOSE_PRICE`, `TURNOVER_VOL`, `TURNOVER_VALUE`, `MARKET_VALUE`, `TYPE_ID`, `TYPE_NAME_EN`, `TYPE_NAME_CN`) VALUES ('2', '000001', '2018-04-27', '10.85', '2349599053', '26556911800.599998', '186298963690', '010303210101', 'Bank', '银行');
INSERT INTO `market data` (`SECURITY_ID`, `TICKER_SYMBOL`, `END_DATE`, `CLOSE_PRICE`, `TURNOVER_VOL`, `TURNOVER_VALUE`, `MARKET_VALUE`, `TYPE_ID`, `TYPE_NAME_EN`, `TYPE_NAME_CN`) VALUES ('2', '000001', '2018-03-30', '10.9', '2312996915', '26925601133.240002', '187157484260', '010303210101', 'Bank', '银行');
...;
COMMIT;
虽然使用事务提高了INSERT INTO效率,但效果不高,在执行记录 333138行,耗时已经超过30min。
③使用存储过程处理INSERT INTO
待更新
④使用多线程处理INSERT INTO
待更新
⑤使用source 命令执行SQL脚本
因为遇到一个2G的5-8-10ccf_online_stage1_train.sql脚本文件,使用一般文件无法打开,所以尝试使用source命令,但执行效率依旧很慢,每条INSERT INTO语句执行需要0.05s不等的时间。
> CD wamp\bin\mysql\mysql5.6.17\bin
> mysql -u root -p /*然后回车 会让你输入mysql的密码,如果mysql有密码就输入密码,如果没有密码就不用输入直接回车。*/
mysql> show databases;
mysql> use database;
mysql> source 路径\5-8-10ccf_online_stage1_train.sql
⑥将脚本文件转换为csv文件后导入表单
所有使用INSERT INTO语句插入的耗时太久,对于2G的5-8-10ccf_online_stage1_train.sql转换为csv文件后大约500M,导入My SQL耗时3小时。
2 练习题
练习题1:多表联查
因为涉及到三表联查,为了提高查询效率,首先对涉及到了三表以TICKER_SYMBOL属性建立普通索引,最后执行耗时15.046s。
ALTER TABLE `Market Data` ADD INDEX TICKER_SYMBOL (TICKER_SYMBOL);
ALTER TABLE `income statement` ADD INDEX TICKER_SYMBOL (TICKER_SYMBOL);
ALTER TABLE `company operating` ADD INDEX TICKER_SYMBOL (TICKER_SYMBOL);
SELECT t1.TICKER_SYMBOL, t1.END_DATE, t1.T_REVENUE, t1.T_COGS, t1.N_INCOME, t2.TICKER_SYMBOL, t2.END_DATE, t2.CLOSE_PRICE, t3.TICKER_SYMBOL, t3.INDIC_NAME_EN,t3.END_DATE, t3.VALUE
FROM `Market Data` AS t2
INNER JOIN `Income Statement` AS t1 ON t2.TICKER_SYMBOL = t1.TICKER_SYMBOL
INNER JOIN `Company Operating` AS t3 ON t3.TICKER_SYMBOL = t1.TICKER_SYMBOL
WHERE t2.TICKER_SYMBOL = '600383' OR t2.TICKER_SYMBOL = '600048';
练习题2:排序题
因为My SQL版本较低,没有RANK()窗口函数,所以使用用户定义变量,进行排名。
SELECT PH
,DENSE_RANK() OVER (ORDER BY `citric acid`) AS dense_ranking
FROM `winequality-red`
WHERE PH = '3.03';
set @rank:=0;
set @temp_score:=NULL;
SELECT PH
, `citric acid`
, CASE
WHEN @temp_score = `citric acid` THEN @rank
WHEN @temp_score := `citric acid` THEN @rank :=@rank + 1
END `rank`
FROM `winequality-red` w
WHERE PH = '3.03'
ORDER BY `citric acid` ASC;
练习题3:最大值
/*优惠券数量最多的商家*/
SELECT Merchant_id , count(Coupon_id) AS Number
FROM ccf_offline_stage1_test_revised
WHERE Date_received BETWEEN '2016-07-01' AND '2016-07-31'
GROUP BY Merchant_id
ORDER BY Number DESC
LIMIT 0,1;
/*优惠券金额最多的商家*/
SELECT Merchant_id , sum(SUBSTRING_INDEX(`discount_rate`,':',-1)) as sum_money
FROM ccf_offline_stage1_test_revised
WHERE Date_received BETWEEN '2016-07-01' AND '2016-07-31'
GROUP BY Merchant_id
ORDER BY sum_money DESC
LIMIT 0,1;
但是这种代码不能求出并列最大值,待更新。
练习题4:计算
/*全社会用电量:第一产业:当月值的记录值为name_cn=Total Electricity Consumption: Primary Industry*/
SELECT *
FROM `macro industry`
WHERE name_cn LIKE '%Elec%';
SELECT MONTH(PERIOD_DATE) AS "用电最高峰月"
FROM `macro industry`
WHERE name_cn = "Total Electricity Consumption: Primary Industry"
AND YEAR(PERIOD_DATE) = 2015
ORDER BY DATA_VALUE DESC
LIMIT 0,1;
SELECT ROUND(((T1.now_value - T2.first_value)/T2.first_value)*100,2) AS "用电量增长百分比"
FROM (SELECT indic_id, PERIOD_DATE, DATA_VALUE AS now_value
FROM `macro industry`
WHERE name_cn = "Total Electricity Consumption: Primary Industry"
AND (YEAR(PERIOD_DATE) = 2015 AND MONTH(PERIOD_DATE) = 8)) AS T1
LEFT JOIN (SELECT indic_id, PERIOD_DATE, DATA_VALUE AS first_value
FROM `macro industry`
WHERE name_cn = "Total Electricity Consumption: Primary Industry"
AND (YEAR(PERIOD_DATE) = 2014 AND MONTH(PERIOD_DATE) = 8)) AS T2
ON T1.indic_id = T2.indic_id
练习题5:购物券弃用率
/*练习题5:方法1-19.747s*/
SELECT T1.NUM_NO_USE/T2.NUM_USE AS "弃用率"
FROM (SELECT COUNT(Coupon_id) AS NUM_NO_USE
FROM ccf_online_stage1_train
WHERE Coupon_id IS NOT NULL AND Date IS NULL AND Date_received between '2016-06-01' and '2016-06-30') AS T1
LEFT JOIN (SELECT COUNT(Coupon_id) AS NUM_USE
FROM ccf_online_stage1_train
WHERE Coupon_id IS NOT NULL AND Date_received between '2016-06-01' and '2016-06-30') AS T2
ON TRUE;
/*练习题5:方法2-10.238s*/
select
sum(case when Coupon_id is not null then 1 else 0 end) as '领取的优惠券数量',
sum(case when Date is null and Coupon_id is not null then 1 else 0 end) as '弃用的优惠券数量',
sum(case when Date is null and Coupon_id is not null then 1 else 0 end)/sum(case when Coupon_id is not null then 1 else 0 end) as '弃用率'
FROM `ccf_online_stage1_train`
where Date_received
between '2016-06-01' and '2016-06-30';
练习题6--英式排名,并列第1
SELECT pH, `residual sugar`, RANK() OVER (ORDER BY `residual sugar`) AS "排名"
FROM `winequality-white`
WHERE pH= 3.63;
练习题7--每组选前三名
/*练习题7*/
-- 2018年底市值最大的三个行业
SELECT TYPE_NAME_CN as "行业名称",
SUM(MARKET_VALUE) as "行业市值",
END_DATE as "时间"
FROM `market data`
WHERE YEAR(END_DATE)= 2018 AND MONTH(END_DATE)=5
GROUP BY TYPE_NAME_CN
ORDER BY SUM(MARKET_VALUE) DESC
LIMIT 3;
/*每个行业三个公司*/
SELECT *
FROM (
SELECT TYPE_NAME_CN AS "行业名称",
TICKER_SYMBOL AS "证券代码",
MARKET_VALUE AS "公司市值",
END_DATE AS "时间"
FROM `market data`
WHERE TYPE_NAME_CN = '银行' AND YEAR(END_DATE)= 2018 AND MONTH(END_DATE)=5
GROUP BY TICKER_SYMBOL
ORDER BY CAST(MARKET_VALUE AS SIGNED INTEGER) DESC
LIMIT 0,3) AS T1
UNION
SELECT *
FROM (
SELECT TYPE_NAME_CN AS "行业名称",
TICKER_SYMBOL AS "证券代码",
MARKET_VALUE AS "公司市值",
END_DATE AS "时间"
FROM `market data`
WHERE TYPE_NAME_CN = '非银金融' AND YEAR(END_DATE)= 2018 AND MONTH(END_DATE)=5
GROUP BY TICKER_SYMBOL
ORDER BY CAST(MARKET_VALUE AS SIGNED INTEGER) DESC
LIMIT 0,3) AS T2
UNION
SELECT *
FROM (
SELECT TYPE_NAME_CN AS "行业名称",
TICKER_SYMBOL AS "证券代码",
MARKET_VALUE AS "公司市值",
END_DATE AS "时间"
FROM `market data`
WHERE TYPE_NAME_CN = '医药生物' AND YEAR(END_DATE)= 2018 AND MONTH(END_DATE)=5
GROUP BY TICKER_SYMBOL
ORDER BY CAST(MARKET_VALUE AS SIGNED INTEGER) DESC
LIMIT 0,3) AS T3
练习题8
/*练习题8*/
ALTER TABLE `ccf_offline_stage1_train` ADD INDEX User_id (User_id);
ALTER TABLE `ccf_online_stage1_train` ADD INDEX User_id (User_id);
SELECT t3.User_id, COUNT(t3.Date) AS "使用次数"
FROM (SELECT t1.User_id, t1.Date
FROM ccf_offline_stage1_train as t1
WHERE SUBSTRING(t1.Date FROM 1 FOR 6) = 201606 AND( Coupon_id != 'NULL' AND Date != 'NULL') /*因数据导入方法不同,这里空值为字符串NULL*/
UNION
SELECT t2.User_id, t2.Date
FROM ccf_online_stage1_train as t2
WHERE YEAR(t2.Date) = 2016 AND MONTH(t2.Date) = 6 AND( Coupon_id is NOT NULL AND Date is NOT NULL)) as t3
GROUP BY User_id
ORDER BY COUNT(t3.Date)
LIMIT 0,1;
练习题9
SELECT *
FROM (SELECT QUARTER(END_DATE) AS "季度",YEAR(END_DATE) AS "年份", TICKER_SYMBOL AS "股票代码",SUM(`VALUE`) AS "吞吐量"
FROM `company operating`
WHERE INDIC_NAME_EN = "Baiyun Airport:Aircraft take-off and landing times"
GROUP BY TICKER_SYMBOL,YEAR(END_DATE),QUARTER(END_DATE)
ORDER BY SUM(`VALUE`)
LIMIT 0,1) AS T1
LEFT JOIN
(SELECT TICKER_SYMBOL,
YEAR(END_DATE) AS "年份",
QUARTER(END_DATE) AS "季度",
SUM(N_INCOME) AS "收入"
FROM `income statement`
GROUP BY TICKER_SYMBOL,YEAR(END_DATE),QUARTER(END_DATE) ) t2
ON T1.TICKER_SYMBOL = t2.TICKER_SYMBOL AND T1.Year = t2.Year
练习题10
/*练习题10*/
SELECT Merchant_id ,SUM(T1.number)AS number
FROM (SELECT Merchant_id ,SUM(SUBSTRING(Discount_rate,':',-1))AS number
FROM ccf_online_stage1_train
WHERE YEAR(Date) =2016 AND MONTH(Date) = 6 AND( Coupon_id is not NULL AND Date is not NULL)
GROUP BY Merchant_id
UNION
SELECT Merchant_id ,SUM(SUBSTRING(Discount_rate,':',-1))AS number
FROM ccf_offline_stage1_train
WHERE SUBSTRING(Date FROM 1 FOR 6)="201606" AND( Coupon_id != 'NULL' AND Date != 'NULL')
GROUP BY Merchant_id) AS T1
GROUP BY Merchant_id
ORDER BY SUM(T1.number) DESC
LIMIT 0,3;