最近在做一个交通流的数据分析,需求是对于海量的城市交通数据,需要使用MapReduce清洗后导入到HBase中存储,然后使用Hive外部表关联HBase,对HBase中数据进行查询、统计分析,将分析结果保存在一张Hive表中,最后使用Sqoop将该表中数据导入到MySQL中。整个流程大概如下:
一、HBase数据库表
下面我主要介绍Hive关联HBase表——Sqoop导出Hive表到MySQL这些流程,原始数据集收集、MapReduce清洗及WEB界面展示此处不介绍。
创建一个名叫“transtable”的HBase表,列族是:“jtxx”。HBase中的部分数据如下:hbase(main):003:0> list TABLE transtable 1 row(s) in 0.0250 seconds => ["transtable"] hbase(main):004:0> describe 'transtable' DESCRIPTION ENABLED 'transtable', {NAME => 'jtxx', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VER true SIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLO CKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} 1 row(s) in 0.0480 seconds
hbase(main):008:0> get 'transtable','32108417000000013220140317000701' COLUMN CELL jtxx:cdbh timestamp=1429597736296, value=03 jtxx:clbj timestamp=1429597736296, value=0 jtxx:cllb timestamp=1429597736296, value=0 jtxx:cllx timestamp=1429597736296, value=3 jtxx:clsd timestamp=1429597736296, value=127.00 jtxx:hphm timestamp=1429597736296, value=\xE8\x8B\x8FKYV152 jtxx:wflx timestamp=1429597736296, value=0 jtxx:xsfx timestamp=1429597736296, value=03 8 row(s) in 0.1550 seconds
二、创建Hive外部表关联HBase表
create external table transJtxx_Hbase
(
clxxbh string,
xsfx string,
cdbh string,
hphm string,
clsd string,
cllx string,
clbj string,
cllb string,
wflx string
)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties ("hbase.columns.mapping" =":key,jtxx:xsfx,jtxx:cdbh,jtxx:hphm,jtxx:clsd,jtxx:cllx,jtxx:clbj,jtxx:cllb,jtxx:wflx") TBLPROPERTIES ("hbase.table.name" = "transtable");
hbase.columns.mapping要对应hbase数据库transtable表中列族下的列限定符。此处一定要是外部表
查看是否关联成功,如何执行一条语句能够查询出HBase表中数据,则关联成功。
hive> select * from transjtxx_hbase where clxxbh like '321084170000000132%';
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1428394594787_0007, Tracking URL = http://secondmgt:8088/proxy/application_1428394594787_0007/
Kill Command = /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop job -kill job_1428394594787_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-04-21 17:27:18,136 Stage-1 map = 0%, reduce = 0%
2015-04-21 17:27:35,029 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 12.31 sec
MapRed