# Hive 在多维统计分析中的应用 & 技巧总结

2626人阅读 评论(0) 收藏 举报
分类：

1、同属性的多维组合统计

（1）问题：

https://cwiki.apache.org/confluence 0 1 8 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
http://my.oschina.net/leejun2005/blog/83058 0 1 23 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
http://www.hao123.com/indexnt.html?sto 0 1 25 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
https://cwiki.apache.org/confluence 0 5 18 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
http://my.oschina.net/leejun2005/blog/83058 0 5 118 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
http://www.hao123.com/indexnt.html?sto 0 3 98 {"store":{"fruit":[{"weight":3,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
http://www.hao123.com/indexnt.html?sto 0 3 8 {"store":{"fruit":[{"weight":3,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
http://my.oschina.net/leejun2005/blog/83058 0 5 81 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
http://www.hao123.com/indexnt.html?sto 0 9 8 {"store":{"fruit":[{"weight":9,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}

（2）需求：

0 1 23 1 1
0 1 25 1 1
0 1 8 1 1
0 1 ALL 3 3
0 3 8 1 1
0 3 98 1 1
0 3 ALL 2 1
0 5 118 1 1
0 5 18 1 1
0 5 81 1 1
0 5 ALL 3 2
0 ALL ALL 8 3
ALL ALL ALL 8 3

（3）解决思路：
hive 中同属性多维统计问题通常用 union all 组合出各种维度然后 group by 进行求解：

 01 create EXTERNAL table IF NOT EXISTS t_log (
 02     url string, c0 string, c1 string, c2 string, unitparams string
 03 )  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' location '/tmp/decli/1';
 04
 05 select * from (
 06         select host, c0, c1, c2 from t_log t0
 07         LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host
 08         where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9  09  union all  10  select host, c0, c1, 'ALL' c2 from t_log t0  11  LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host  12  where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
 13     union all
 14         select host, c0, 'ALL' c1, 'ALL' c2 from t_log t0
 15         LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host
 16         where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9  17  union all  18  select host, 'ALL' c0, 'ALL' c1, 'ALL' c2 from t_log t0  19  LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host  20  where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
 21 ) test;
 22
 23 select c0, c1, c2, count(host) PV, count(distinct(host)) UV from (
 24         select host, c0, c1, c2 from t_log t0
 25         LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host
 26         where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9  27  union all  28  select host, c0, c1, 'ALL' c2 from t_log t0  29  LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host  30  where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
 31     union all
 32         select host, c0, 'ALL' c1, 'ALL' c2 from t_log t0
 33         LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host
 34         where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9  35  union all  36  select host, 'ALL' c0, 'ALL' c1, 'ALL' c2 from t_log t0  37  LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host  38  where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
 39 ) test group by c0, c1, c2;

2、不同属性的多维组合统计

Making Multiple Passes over the Same Data
Hive has a special syntax for producing multiple aggregations from a single pass
through a source of data, rather than rescanning it for each aggregation. This change
can save considerable processing time for large input data sets. We discussed the details
previously in Chapter 5.
For example, each of the following two queries creates a table from the same source
table, history:
hive> INSERT OVERWRITE TABLE sales
> SELECT * FROM history WHERE action='purchased';
hive> INSERT OVERWRITE TABLE credits
> SELECT * FROM history WHERE action='returned';
This syntax is correct, but inefficient. The following rewrite achieves the same thing,
but using a single pass through the source history table:
hive> FROM history
> INSERT OVERWRITE sales   SELECT * WHERE action='purchased'
> INSERT OVERWRITE credits SELECT * WHERE action='returned';

 1 FROM pv_users
 2 INSERT OVERWRITE TABLE pv_gender_sum
 3     SELECT pv_users.gender, count_distinct(pv_users.userid)
 4     GROUP BY pv_users.gender
 5
 6 INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
 7     SELECT pv_users.age, count_distinct(pv_users.userid)
 8     GROUP BY pv_users.age;
https://cwiki.apache.org/confluence/display/Hive/Tutorial

1、hive union all 的用法：不支持 top level，以及各个select字段名称、属性必须严格一致

2、结果的顺序问题，可以自己加字符控制排序

3、多重insert和union all一样也只扫描一次，但因为要insert到多个分区，所以做了很多其他的事情，导致消耗的时间非常长，其会产生多个job，union all 本身只有一个job

set hive.exec.parallel=true;   //打开任务并行执行
http://superlxw1234.iteye.com/blog/1703713

4、当前HIVE 不支持 not in 中包含查询子句的语法，形如如下的HQ语句是不被支持的:

select a.key from a where key not in(select key from b)  该语句在hive中不支持

select a.key from a left outer join b on a.key=b.key where b.key1 is null

5、left out join 不能连续3个以上使用，必须2个一组，2个一组包装起来使用。

 01 select p.ssi,p.pv,p.uv,p.nuv,p.visits,'2012-06-19 17:00:00' from (
 02     select * from (
 03         select * from (select ssi,count(1) pv,sum(visits) visits from FactClickAnalysis
 04         where logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi ) p1
 05         left outer join
 06         (
 07         select ssi,count(1) uv from (select ssi,cookieid from FactClickAnalysis
 08         where logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi,cookieid ) t1 group by ssi
 09         ) p2 on p1.ssi=p2.ssi
 10     ) p3
 11     left outer join
 12     (
 13         select ssi, count(1) nuv from FactClickAnalysis
 14         where logTime = insertTime and logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi
 15     ) p4 on p3.ssi=p4.ssi
 16 ) p

6、hive本地执行mr

7、hive动态分区创建过多遇到的一个错误

8、hive中巧用正则表达式的贪婪匹配

9、hive匹配全中文字段

name rlike '^[\\u4e00-\\u9fa5]+$' 判断一个字段是否全数字： select mobile from woa_login_log_his where pt = '2012-01-10' and mobile rlike '^\\d+$' limit 50;

11、hive优化之------控制hive任务中的map数和reduce数

12、hive中转义\$等特殊字符

13、日期处理：

select from_unixtime(unix_timestamp('20111102','yyyyMMdd') - N*86400,'yyyyMMdd') from t_lxw_test1 limit 1;

select ( unix_timestamp('2011-11-02','yyyy-MM-dd')-unix_timestamp('2011-11-01','yyyy-MM-dd') ) / 86400  from t_lxw_test limit 1;

14、删除 Hive 临时文件 hive.exec.scratchdir

REF：

http://superlxw1234.iteye.com/blog/1536440
http://liubingwwww.blog.163.com/blog/static/3048510720125201749323/
http://blog.csdn.net/azhao_dn/article/details/6921429

• 本文已收录于以下专栏：

## Hive 在多维统计分析中的应用 & 技巧总结

• wisgood
• 2013年12月07日 12:00
• 5567

• scut1135
• 2013年11月24日 16:59
• 1641

## Hive日志分析实践例子

• lzlchangqi
• 2014年01月23日 22:03
• 1767

## Hive 在多维统计分析中的应用 & 技巧总结

Hive 在多维统计分析中的应用 & 技巧总结 多维统计一般分两种，我们看看 Hive 中如何解决： 1、同属性的多维组合统计 （1）问题： 有如下数据，字段内容分别为：url, cate...
• nysyxxg
• 2015年05月14日 17:13
• 474

## 多维数据统计分析2

• 2017年07月21日 11:54
• 50MB
• 下载

## 多维数据统计分析3

• 2017年07月21日 11:56
• 50MB
• 下载

## 多维数据统计分析4

• 2017年07月21日 11:58
• 48.56MB
• 下载

## 多维数据统计分析

• 2017年07月21日 11:50
• 50MB
• 下载

## 005 执行mapreduce加强，利用hive统计分析电商网站用户行为指标

• shenfuli
• 2015年09月24日 18:00
• 1324

## 2001-使用Hive+MR统计分析网站指标

• shenfuli
• 2015年11月25日 11:02
• 2409

举报原因： 您举报文章：Hive 在多维统计分析中的应用 色情 政治 抄袭 广告 招聘 骂人 其他 (最多只允许输入30个字)