背景
hbase
查询的确是不太方便,除了指定rowkey
,或者通过指定startkey
stopkey
进行scan
之外,没有更有效的查询方式 如果想通过列值过滤,只能全表扫描了 如果要搞什么group by
或者order by
(除非你的rowkey
做了相应设计) 更是没法弄 在传统的mysql/oracle
得心应手的查询在hbase
上就是束手束脚
当然可以通过写hadoop job
解决问题,但为了查询去写job
,代价未免有点高 于是hive
出现了
有两个方法可以集成hive和hbase
- 使用
HBaseStorageHandler
,这个会直接操作HBase,可能会对线上产生影响 - 将HBase定期导入到HDFS,再通过hive访问HDFS
下面将详述第二种方法
HDFS导入
-
使用
datax
将HBase表导入到HDFS上,比如/group/wireless-arctic/task/arctic_task
-
hive产生外部表,从而避免导入数据
CREATE EXTERNAL TABLE task_history ( biz_type string, cid string, content string, ctime string, gmt_create string, hostName string, item string, mtime string, otags string, priority string, retry string, result string, srcImages string, src_url string, status string, summary string, task_type string, title string, userId string, userNick string, utags string, writer string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LOCATION '/group/wireless-arctic/task';
location是云梯文件的目录
-
测试
select cid,result from task_history limit 10; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Selecting distributed mode: Input Size (= 2578823293 = 2 gigabytes 411 megabytes 366 kilobytes 125 bytes) is larger than hive.exec.mode.local.auto.inputbytes.max (= 134217728 = 128 megabytes 0 kilobytes 0 bytes) Starting Job = job_201311281255_6734353, Tracking URL = http://hdpjt2.alibaba-inc.com/jobdetails.jsp?jobid=job_201311281255_6734353 Kill Command = /home/hadoop/hadoop-current/bin/../bin/hadoop job -Dmapred.job.tracker=hdpjt:9001 -kill job_201311281255_6734353 Hadoop job information for Stage-1: number of mappers: 10; number of reducers: 0 2013-12-19 18:53:02,891 Stage-1 map = 0%, reduce = 0% 2013-12-19 18:53:11,017 Stage-1 map = 50%, reduce = 0% 2013-12-19 18:53:12,033 Stage-1 map = 90%, reduce = 0% 2013-12-19 18:53:19,394 Stage-1 map = 100%, reduce = 100% Ended Job = job_201311281255_6734353 OK 200011928538 success 200011928538 success 200011909281 success 200011928474 success 200011909281 success 200011928474 success 110010569498 failure:userId:1782836127,contentId:110010569498 ImageFlow,call error and ret:1 110010523403 success 110010523921 success 110010524299 success Time taken: 23.137 seconds = 23 seconds 137 milliseconds
添加分区及自动化
-
完成了上面的步骤,你就可以查询数据了,但面临一个问题,数据更新怎么办?
- 一个比较通用的做法就是每天跑一个定时任务将HBase表dump到HDFS,即每天一个快照
- 每天的快照可以存放在以日期命名的目录中,这样可以保存多份快照,出了问题也好追踪
-
hive如何利用这每天的快照?
- 那就是hive分区
- 分区的本意是数据量大了切分数据,但目前我们并未如此使用,而是利用分区来区分快照
- 删除之前的表
drop table task_history;
- 产生一张分区表
其实就是在之前的建表语句中加了一行CREATE EXTERNAL TABLE task_history ( biz_type string, cid string, content string, ctime string, gmt_create string, hostName string, item string, mtime string, otags string, priority string, retry string, result string, srcImages string, src_url string, status string, summary string, task_type string, title string, userId string, userNick string, utags string, writer string ) PARTITIONED BY (dt string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LOCATION '/group/wireless-arctic/task';
PARTITIONED BY (dt string)
- 添加分区
ALTER TABLE task_history ADD PARTITION(dt='20131223') LOCATION '/group/wireless-arctic/task/20131223';
-
如何自动化
- 通过工具比如
datax
或者其他导出工具将HBase表导出到HDFS,正如前面提到的每天一个目录(以日期命名) - 将每天的数据目录挂载到hive分区
hive -e "ALTER TABLE task_history ADD PARTITION(dt=`date -d yesterday +%Y%m%d`) LOCATION '/group/wireless-arctic/task/`date -d yesterday +%Y%m%d`';"
- 将前面2个步骤的脚本整合到crontab 中就可以做到自动化了
- 最后如何通过分区查询
即加上分区查询条件select * from task_history where dt='20131223' limit 10;
dt='20131223'
- 通过工具比如
版权声明:本文为博主原创文章,未经博主允许不得转载。