电商项目实战Hive实现
一、 外部表和内部表在Hive中的区别
- MANAGED_TABLE:内部表
- 删除表:HDFS和MYSQL中数据都被删除
- EXTERNAL_TABLE
- 删除表:HDFS上的数据不被删除,MYSQL中数据被删除
二、track_info分区表的创建
create external table trackinfo(
ip string,
country string,
province string,
city string,
url string,
page string
) partitioned by (day string)
row format delimited fields terminated by '\t'
location '/project/emp/trackinfo/';
三、将ETL的数据加载到Hive
-
数据格式
106.3.114.42 中国 北京市 null http://www.yihaodian.com/2/?tracker_u=10325451727&tg=boomuserlist%3A%3A2463680&pl=www.61baobao.com&creative=30392663360&kw=&gclid=CPC2idPRv7gCFQVZpQodFhcABg&type=2 2013-07-21 11:24:56 - 58.219.82.109 中国 江苏省 无锡市 http://www.yihaodian.com/5/?tracker_u=2225501&type=4 2013-07-21 13:57:11 - 58.219.82.109 中国 江苏省 无锡市 http://search.yihaodian.com/s2/c0-0/k%25E7%25A6%258F%25E4%25B8%25B4%25E9%2597%25A8%25E9%2587%2591%25E5%2585%25B8%25E7%2589%25B9%25E9%2580%2589%25E4%25B8%259C%25E5%258C%2597%25E7%25B1%25B35kg%2520%25E5%259B%25BD%25E4%25BA%25A7%25E5%25A4%25A7%25E7%25B1%25B3%2520%25E6%2599%25B6%25E8%258E%25B9%25E5%2589%2594%25E9%2580%258F%2520%25E8%2587%25AA%25E7%2584%25B6%2520%2520/5/ 2013-07-21 13:50:48 - 58.219.82.109 中国 江苏省 无锡市 http://search.yihaodian.com/s2/c0-0/k%25E8%258C%25B6%25E8%258A%25B1%25E8%2582%25A5%25E7%259A%2582%25E7%259B%2592%25202213%2520%25E5%258D%25AB%25E7%2594%259F%25E7%259A%2582%25E7%259B%2592%2520%25E9%25A6%2599%25E7%259A%2582%25E7%259B%2592%2520%25E9%25A2%259C%25E8%2589%25B2%25E9%259A%258F%25E6%259C%25BA%2520%2520/5/ 2013-07-21 13:57:16 - 58.219.82.109 中国 江苏省 无锡市 http://www.yihaodian.com/5/?tracker_u=2225501&type=4 2013-07-21 13:50:13 - ....
-
将数据上传至hdfs中
-
将hdfs数据加载到表中
load data inpath 'hdfs://192.168.64.147:9000/project/input/etl/' overwrite into table trackinfo partition(day='2013-07-21');
-
根据分区查询表数据
select *from trackinfo where day='2013-07-21' limit 3;
四、使用Hive完成统计分析功能
-
数据统计
select count(*) from trackinfo where day='2013-07-21';
-
省份统计
select province,count(*) from trackinfo where day='2013-07-21' group by province;
-
创建省份统计表
create external table trackinfo_province_stat( country string, cnt bigint ) partitioned by (day string) row format delimited fields terminated by '\t';
-
从现有总表中查询数据添加到省份统计表
insert overwrite table trackinfo_province_stat partition(day='2013-07-21') select province,count(*) from trackinfo where day='2013-07-21' group by province;
后续统计表表的数据可以直接从这个表中查询也可以将hive表的数据导入到RDBMS(sqoop)