正在学习利用R进行统计学相关知识的实验,实验数据计划采用北京市环境监测数据,此数据可以在这个网址“https://quotsoft.net/air/”中下载,目前可提供2013年12月至今的北京35个监测站点针对PM2.5、PM10、CO~2等监测数据,数据的格式如下:
date | hour | type | 东四 | 天坛 | 官园 | 万寿西宫 | 奥体中心 | 农展馆 |
20160101 | 0 | PM2.5 | 250 | 193 | 183 | 195 | 209 | 218 |
20160101 | 0 | PM2.5_24h | 70 | 64 | 63 | 68 | 65 | 68 |
20160101 | 0 | PM10 | 209 | 209 | 244 | 200 | 268 | |
20160101 | 0 | PM10_24h | 110 | 139 | 123 | 148 | 189 | |
20160101 | 0 | AQI | 94 | 95 | 87 | 100 | 120 | 92 |
20160101 | 1 | PM2.5 | 193 | 183 | 193 | 190 | 211 | 251 |
20160101 | 1 | PM2.5_24h | 77 | 72 | 70 | 76 | 73 | 77 |
20160101 | 1 | PM10 | 203 | 187 | 233 | 203 | 248 | |
20160101 | 1 | PM10_24h | 116 | 143 | 132 | 153 | 195 |
针对每个监测项目,每个小时获取一次数据,均为标准的csv文件,每年的文件为700多个,在统计研究中,计划采用一年的数据,并且按天对各个监测项目作均值处理,由于文件比较多,如果一个个单独处理,然后再合并,工作量比较大,因此想将这些数据导入到MySQL数据库中进行处理使用。
MySQL提供数据文件导入的方式有两种:
- LOAD DATA INFILE SQL命令
- mysqlimport 命令行命令(本质上是LOAD DATA的另一个实现)
导入LOAD DATA来完成由于所要导入的文件中存在空值,mysqlimport命令行无法处理(或者说我没找到处理办法),如果命令行能够支持就更方便了,根本不用JAVA来实现了,直接采用PowerShell脚本即可实现。
LOAD DATA 语法如下:
LOAD DATA
[LOW_PRIORITY | CONCURRENT] [LOCAL]
INFILE 'file_name'
[REPLACE | IGNORE]
INTO TABLE tbl_name
[PARTITION (partition_name [, partition_name] ...)]
[CHARACTER SET charset_name]
[{FIELDS | COLUMNS}
[TERMINATED BY 'string']
[[OPTIONALLY] ENCLOSED BY 'char']
[ESCAPED BY 'char']
]
[LINES
[STARTING BY 'string']
[TERMINATED BY 'string']
]
[IGNORE number {LINES | ROWS}]
[(col_name_or_user_var
[, col_name_or_user_var] ...)]
[SET col_name={expr | DEFAULT}
[, col_name={expr | DEFAULT}] ...]
导入数据文件的JAVA程序代码如下:
package loadData;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;
import java.sql.Statement;
import java.io.File;
public class loadData {
public static void main(String[] args) throws SQLException {
Connection conn = null;
Statement st =null;
String strDataDirectory = "D:\\Work\\data\\2016";
if (args.length > 0)
{
strDataDirectory = args[0];
strDataDirectory = strDataDirectory.replace("\\", "\\\\");
}
try
{
Connect to mysql
Class.forName("com.mysql.cj.jdbc.Driver");
conn = DriverManager.getConnection("jdbc:mysql://localhost:3306/test?useSSL=false&allowPublicKeyRetrieval=true&serverTimezone=UTC","root","passw0rd");
st = conn.createStatement();
//ResultSet rs;
Long rt;
// Read Files List from Directory
//String strDataDirectory ;
strDataDirectory = strDataDirectory;
File[] fList;
File f = new File(strDataDirectory);
//if (f.isDirectory()) {
fList = f.listFiles();
//}
// Load data to database
for (int i = 0;i< fList.length;i++) {
// Declare a filename string
String strName = fList[i].getAbsolutePath().toString();
strName = strName.replace("\\", "/").trim();
//String strSql = "LOAD DATA infile '" + strName + "' into table loadata fields terminated by ',' lines terminated by '\\n' (jcrq,jcsj,jcxm,@zd_xzm,@zd_dzm,@zd_dsm) set zd_xzm=if(@zd_xzm='',null,@zd_xzm),zd_dzm=if(@zd_dzm='',null,@zd_dzm),zd_dsm=if(@zd_dsm='',null,@zd_dsm)";
String strSql = "LOAD DATA infile '" ;
strSql = strSql + strName;
strSql = strSql + "' into table bj_2016 fields terminated by ',' lines terminated by '\\r\\n' ignore 1 lines ";
strSql = strSql + " (jcrq,jcsj,jcxm,@zd_ds,@zd_tt,@zd_gy,@zd_wsxg,@zd_atzx,@zd_nzg,@zd_wl,@zd_bbxq,@zd_zwy,@zd_fthy,@zd_yg,@zd_gc,@zd_fs,@zd_dx,@zd_yz,@zd_tz,@zd_sy,@zd_cp,@zd_mtg,@zd_pg,@zd_hr,@zd_my,@zd_yq,@zd_dl,@zd_bdl,@zd_mysk,@zd_dgc,@zd_yld,@zd_yf,@zd_llh,@zd_qm,@zd_ydmn,@zd_xzmb,@zd_nsh,@zd_dsh) ";
strSql = strSql + " set zd_ds=if(@zd_ds='',null,@zd_ds),zd_tt=if(@zd_tt='',null,@zd_tt),zd_gy=if(@zd_gy='',null,@zd_gy),zd_wsxg=if(@zd_wsxg='',null,@zd_wsxg),zd_atzx=if(@zd_atzx='',null,@zd_atzx),zd_nzg=if(@zd_nzg='',null,@zd_nzg),zd_wl=if(@zd_wl='',null,@zd_wl),zd_bbxq=if(@zd_bbxq='',null,@zd_bbxq),";
strSql = strSql + " zd_zwy=if(@zd_zwy='',null,@zd_zwy),zd_fthy=if(@zd_fthy='',null,@zd_fthy),zd_yg=if(@zd_yg='',null,@zd_yg),zd_gc=if(@zd_gc='',null,@zd_gc),zd_fs=if(@zd_fs='',null,@zd_fs),zd_dx=if(@zd_dx='',null,@zd_dx),zd_yz=if(@zd_yz='',null,@zd_yz),zd_tz=if(@zd_tz='',null,@zd_tz),";
strSql = strSql + " zd_sy=if(@zd_sy='',null,@zd_sy),zd_cp=if(@zd_cp='',null,@zd_cp),zd_mtg=if(@zd_mtg='',null,@zd_mtg),zd_pg=if(@zd_pg='',null,@zd_pg),zd_hr=if(@zd_hr='',null,@zd_hr),zd_my=if(@zd_my='',null,@zd_my),zd_yq=if(@zd_yq='',null,@zd_yq),zd_dl=if(@zd_dl='',null,@zd_dl),";
strSql = strSql + " zd_bdl=if(@zd_bdl='',null,@zd_bdl),zd_mysk=if(@zd_mysk='',null,@zd_mysk),zd_dgc=if(@zd_dgc='',null,@zd_dgc),zd_yld=if(@zd_yld='',null,@zd_yld),zd_yf=if(@zd_yf='',null,@zd_yf),zd_llh=if(@zd_llh='',null,@zd_llh),zd_qm=if(@zd_qm='',null,@zd_qm),";
strSql = strSql + " zd_ydmn=if(@zd_ydmn='',null,@zd_ydmn),zd_xzmb=if(@zd_xzmb='',null,@zd_xzmb),zd_nsh=if(@zd_nsh='',null,@zd_nsh),zd_dsh=if(@zd_dsh='',null,@zd_dsh)";
System.out.println("Start Load the file " + strName);
rt = st.executeLargeUpdate(strSql);
System.out.println("Successful Loaded " + rt.toString() + " Records");
}
st.close();
conn.close();
} catch (ClassNotFoundException e) {
System.console().printf("Occur a error");
e.printStackTrace();
} catch (SQLException throwables) {
throwables.printStackTrace();
}
finally {
if(st !=null) st.close();
if(conn != null) conn.close();
}
}
}
这里主要说明下主要使用的LOAD DATA的参数 :
- fields terminated by : 设置数据文件中列的分割符,csv文件为逗号;
- lines terminated by :设置数据文件中行的分割符,csv文件在Windows下为‘\r\n’,这点要注意,我开始使用'\n',造成最后一列如果有空值无法导入;
- ignore 1 lines:忽略标题行;
- (jcrq,jcsj,jcxm,@zd_ds,@zd_tt,……):设置插入的列名,如果列存在空值或其他需要特殊处理的值,请将列名用变量方式表示,即@+列名;
- set zd_ds=if(@zd_ds='',null,@zd_ds),……:利用用户变量(@开头)对数据文件的数据进行预处理,特别是在文件中存在空值或其他特殊值时非常有用,本例中是判断数据文件的数据是否是空串,如是则插入NULL,否则插入实际的值。
导入数据的表名称为bj_2016,其结构如下:
mysql> desc bj_2016;
+---------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------+-------------+------+-----+---------+-------+
| jcrq | varchar(20) | YES | | NULL | |
| jcsj | varchar(2) | YES | | NULL | |
| jcxm | varchar(20) | YES | | NULL | |
| zd_ds | int | YES | | NULL | |
| zd_tt | int | YES | | NULL | |
| zd_gy | int | YES | | NULL | |
| zd_wsxg | int | YES | | NULL | |
| zd_atzx | int | YES | | NULL | |
| zd_nzg | int | YES | | NULL | |
| zd_wl | int | YES | | NULL | |
| zd_bbxq | int | YES | | NULL | |
| zd_zwy | int | YES | | NULL | |
| zd_fthy | int | YES | | NULL | |
| zd_yg | int | YES | | NULL | |
| zd_gc | int | YES | | NULL | |
| zd_fs | int | YES | | NULL | |
| zd_dx | int | YES | | NULL | |
| zd_yz | int | YES | | NULL | |
| zd_tz | int | YES | | NULL | |
| zd_sy | int | YES | | NULL | |
| zd_cp | int | YES | | NULL | |
| zd_mtg | int | YES | | NULL | |
| zd_pg | int | YES | | NULL | |
| zd_hr | int | YES | | NULL | |
| zd_my | int | YES | | NULL | |
| zd_yq | int | YES | | NULL | |
| zd_dl | int | YES | | NULL | |
| zd_bdl | int | YES | | NULL | |
| zd_mysk | int | YES | | NULL | |
| zd_dgc | int | YES | | NULL | |
| zd_yld | int | YES | | NULL | |
| zd_yf | int | YES | | NULL | |
| zd_llh | int | YES | | NULL | |
| zd_qm | int | YES | | NULL | |
| zd_ydmn | int | YES | | NULL | |
| zd_xzmb | int | YES | | NULL | |
| zd_nsh | int | YES | | NULL | |
| zd_dsh | int | YES | | NULL | |
+---------+-------------+------+-----+---------+-------+
其中前三列为监测日期、监测时间、监测项目,后35列为具体的站点,执行完成后,几百个文件在一分钟内即可完成导入,简单快捷。
由于数据是以监测站点为列,监测项目为行,不能满足本次统计分析中不同站点间的分布、对比、发展趋势等分析需要,需要进行行列转换,将监测项目变为列,监测站点为行,并且将监测数据按天进行均值计算。
行列转换,可以利用聚集函数(如Max、Min、Sum、Avg等)及case语句、if函数来实现,具体的示例如下:
本例子中,直接将数据插到表中(bj_2016_by_item)。
1、使用case语句实现
insert into bj_2016_by_item (jcrq,jczd,pm2_5,pm10,so2,no2,o3,co,aqi)
select jcrq,'东四',
avg(case jcxm when 'PM2.5' then zd_ds else null end) as 'pm2.5',
avg(case jcxm when 'PM10' then zd_ds else null end) as 'pm10',
avg(case jcxm when 'SO2' then zd_ds else null end) as 'so2',
avg(case jcxm when 'NO2' then zd_ds else null end) as 'no2',
avg(case jcxm when 'O3' then zd_ds else null end) as 'o3',
avg(case jcxm when 'CO' then zd_ds else null end) as 'co',
avg(case jcxm when 'AQI' then zd_ds else null end) as 'aqi'
from bj_2016
group by jcrq ;
2、使用if函数实现
insert into bj_2016_by_item (jcrq,jczd,pm2_5,pm10,so2,no2,o3,co,aqi)
select jcrq,'东四',
avg(if(jcxm='PM2.5',zd_ds,null)) as 'pm2.5',
avg(if(jcxm='PM10',zd_ds,null)) as 'pm10',
avg(if(jcxm='SO2',zd_ds,null)) as 'so2',
avg(if(jcxm='NO2',zd_ds,null)) as 'no2',
avg(if(jcxm='O3',zd_ds,null)) as 'o3',
avg(if(jcxm='CO',zd_ds,null)) as 'co',
avg(if(jcxm='AQI',zd_ds,null)) as 'aqi'
from bj_2016
group by jcrq;
上述例子只是转换了一个监测站点,可以针对每个站点写一个这样的语句,然后用UNION ALL将各个语句连接起来即可,最终效果如下:
mysql> select * from bj_2016_by_item limit 10;
+----------+------+-------+------+-----+-----+----+----+-----+
| jcrq | jczd | pm2_5 | pm10 | so2 | no2 | o3 | co | aqi |
+----------+------+-------+------+-----+-----+----+----+-----+
| 20160101 | 东四 | 179 | 207 | 35 | 101 | 4 | 3 | 167 |
| 20160102 | 东四 | 278 | 349 | 41 | 118 | 5 | 4 | 277 |
| 20160103 | 东四 | 246 | 530 | 26 | 70 | 11 | 3 | 368 |
| 20160104 | 东四 | 46 | 13 | 5 | 30 | 46 | 1 | 175 |
| 20160105 | 东四 | 36 | 93 | 9 | 39 | 29 | 1 | 35 |
| 20160106 | 东四 | 26 | 38 | 8 | 38 | 31 | 1 | 65 |
| 20160107 | 东四 | 11 | 18 | 3 | 24 | 47 | 0 | 26 |
| 20160108 | 东四 | 35 | 51 | 8 | 37 | 34 | 1 | 25 |
| 20160109 | 东四 | 81 | 117 | 18 | 66 | 9 | 2 | 86 |
| 20160110 | 东四 | 49 | 63 | 19 | 55 | 16 | 2 | 95 |
+----------+------+-------+------+-----+-----+----+----+-----+
10 rows in set (0.09 sec)
至此,数据导入和转换完成,可以开始统计分析的正事儿了。