高级数据库之MySQL导入海量数据总结

最新推荐文章于 2024-07-11 17:19:55 发布

littleBoy777

最新推荐文章于 2024-07-11 17:19:55 发布

阅读量2.4k

点赞数 2

分类专栏：数据库原理文章标签：数据库

本文链接：https://blog.csdn.net/li123_123_/article/details/96104804

版权

数据库原理专栏收录该内容

4 篇文章 1 订阅

订阅专栏

使用MySQL来存储海量数据

1.背景：

将10个G的数据导入MySQL中，数据来自于北京出租车公司的GPS数据，字段描述：

数据以ASCII文本表示，以逗号为分隔符，以回车换行符(0x0D 0x0A)结尾。数据项及顺序：车辆标识、触发事件、运营状态、GPS时间、GPS经度、GPS纬度,、GPS速度、GPS方向、GPS状态
车辆标识：6个字符
触发事件：0=变空车，1=变载客，2=设防，3=撤防，4=其它
运营状态：0=空车，1=载客，2=驻车，3=停运，4=其它
GPS时间：格式yyyymmddhhnnss，北京时间
GPS经度：格式ddd.ddddddd，以度为单位。
GPS纬度：格式dd.ddddddd，以度为单位。
GPS速度：格式ddd，取值000-255内整数，以公里/小时为单位。
GPS方位：格式ddd，取值000-360内整数，以度为单位。
GPS状态：0=无效，1=有效
结束串:回车符+换行符

任务一：将10G的数据导入到MySQL中，并进行一些数据分析。

导入至少选择两种方式：

load data 命令
GUI操作方式
程序插入方式基于
对比不同插入方式的异同和效率
对出租车轨迹数据分析

任务二：基于Hbase/Phoenix的数据分析

自行安装Hbase/Phoenix环境
•http://phoenix.apache.org/
•https://hbase.apache.org/
实现数据从MySQL数据库导入Hbase数据库
对比SQL语句与Hbase查询语句的区别
使用Hbase查询统计记录数和出租车数
对比使用Phoenix组件和原生Hbase查询的时间性能
基于Phoenix组件实现对Hbase数据库的查询分析
尝试编程实现对Hbase数据库的访问

2.任务一的完成

数据库的创建

根据数据段的描述创建一张数据表：

CREATE DATABASE `taxi` DEFAULT CHARACTER SET utf8;  
 
create table taxi_infor(  
infor_id bigint unsigned not null auto_increment comment '信息的id' primary key,  
taxi_id char(6) not null comment '车辆标示',  
taxi_event tinyint not null DEFAULT '0' comment '0=变空车，1=变载客，2=设防，3=撤防，4=其他',  
taxi_status tinyint not null DEFAULT '0' comment '0=空车，1=载客，2=驻车，3=停运，4=其他',  
taxi_gps_time datetime not null DEFAULT current_timestamp comment 'Gps时间' ,  
taxi_longitude double comment 'Gps经度信息',  
taxi_dimension double comment 'Gps维度信息',  
taxi_speed int not null DEFAULT 0 comment 'Gps速度',  
taxi_orient int not null DEFAULT 0 comment 'Gps方位',  
taxi_gps_status tinyint not null DEFAULT 0 comment 'Gps状态'  
)engine=InnoDB;

首先使用python将数据进行导入：
因为数据有10个G大，无法将数据文件一次全部读入，考虑将文件分块读取，使用MySQLdb模块将数据插入到数据库中。
关键代码：

  f = pd.read_csv(path,iterator=True,header=None,infer_datetime_format=True,parse_dates=[3])  # infer_datetime_format和parse_dates来直接将格式转化为日期格式  
      print isinstance(f,Iterable)  
      print f.get_chunk(2)  
      try:  
          while True:  
              temp = f.get_chunk(100000)  
              data2mysql(temp)  
    
      except Exception, e:  
         print 'stop!'  

sql_insert = "INSERT INTO taxi_infor(taxi_id,taxi_event,taxi_status,taxi_gps_time,taxi_longitude,taxi_dimension,taxi_speed,taxi_orient,taxi_gps_status) " \  
                      "VALUES ('%s','%d','%d','%s','%f','%f','%d','%d','%d')" % (str(u[0]),u[1],u[2],str(u[3]),u[4],u[5],u[6],u[7],u[8])  
         print "check_insert_available:"+ sql_insert  
         cursor.execute(sql_insert)  #执行SQL语句

当引擎是InnoDB时使用select count(*) from taxi_infor;效率很低，消耗很长时间。在MySQL中创建表和数据库时默认是使用的InnoDB存储引擎，InnoDB具有较好的事务支持。
使用load data命令进行数据的导入：
这次将表命名为了taxi_infor_test，考虑到对于本表后期涉及的选择操作比较多，存储引擎选择使用的是MyISAM。
命令：

load data local infile "F:/workFile/2012.csv" into table taxi_infor_test fields terminated by ',' (taxi_id,taxi_event,taxi_status,taxi_gps_time,taxi_longitude,taxi_dimension,taxi_speed,taxi_orient,taxi_gps_status);

插入的结果：
load data导入

效率提升了不是一点点，仅仅用了24分钟就将所有数据都导入了数据库。
对比：不同方法导入异同的分析
不同点：
使用python导入数据时对数据的可操作性比较强，可以对插入的数据进行一些中间操作，格式的转换等等。但是效率比较低。
使用load data命令时效率很高，但是无法对数据进行中间操作，只能将数据安装原来的大小和格式进行导入。
相同点：
都可以将数据完整的插入数据库。

轨迹数据的分析

① 统计记录的总数，统计出租车数量
记录的总数：

1.  select count(*) from taxi_infor_test;

结果：
查询总数

当存储引擎是InnoDB时执行此命令要两分分钟，因为没有存储表的总数信息，要对表进行一次全表的扫描。当使用MyISAM时效率较高。
统计出租车总数：

结果：
出租车数量统计
可以看到，出租车的数量是12522辆，但是第一次查询所花费的时间为28.3秒，第二次仅仅用了0.38秒，数据库像是保存了上次的查询结果。

② 编写SQL语句进行条件查询。
查询某时刻有多少量车是空车：

1.  select count(distinct taxi_id) as taxi_num from taxi_infor_test where taxi_status=0 and taxi_gps_time like "2012-11-07 00:15%";

在这里插入图片描述
查询某时段有多少量车是载客：

1.  select count(distinct taxi_id) as taxi_num from taxi_infor_test where taxi_status=1 and taxi_gps_time between "2012-11-07 00:15:00" and "2012-11-07 00:16:00";

某时段载客数
查询有多少辆车载客路过了某个地区：

1.  select count(distinct taxi_id) from taxi_infor_test where  taxi_status=1  and taxi_longitude like "116.4508%" and taxi_dimension like "39.9594%";

在这里插入图片描述
③ 编程实现数据库连接，获取数据和简单数据分析
使用Python在程序中导入MySQLdb库，就可以建立数据库交互对象，建立数据库连接对象，就可以进行数据库的连接和操作。
关键步骤：

  conn = MySQLdb.Connect(host='127.0.0.1',  
                             port=3306,  
                             user='root',  
                             passwd='',  
                             db='taxi',  
                            charset='utf8')  # 建立一个数据库连接对象  
     cursor = conn.cursor()  # 建立一个数据库交互对象cursor  
     queryInforCount = "select count(*) from taxi_infor_test"  
     queryTaxiCount = "select count(distinct taxi_id) as taxi_num from taxi_infor_test"  
     cursor.execute(queryInforCount)  
     print cursor.fetchone()  
     cursor.execute(queryTaxiCount)  
     print cursor.fetchone()  
     cursor.close()  
     conn.close()