分类:
多维统计一般分两种,我们看看 Hive 中如何解决:
1、同属性的多维组合统计
(1)问题:
有如下数据,字段内容分别为:url, catePath0, catePath1, catePath2, unitparams
http://my.oschina.net/leejun2005/blog/83058 0 1 23 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
http://www.hao123.com/indexnt.html?sto 0 1 25 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
https://cwiki.apache.org/confluence 0 5 18 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
http://my.oschina.net/leejun2005/blog/83058 0 5 118 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
http://www.hao123.com/indexnt.html?sto 0 3 98 {"store":{"fruit":[{"weight":3,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
http://www.hao123.com/indexnt.html?sto 0 3 8 {"store":{"fruit":[{"weight":3,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
http://my.oschina.net/leejun2005/blog/83058 0 5 81 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
http://www.hao123.com/indexnt.html?sto 0 9 8 {"store":{"fruit":[{"weight":9,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
(2)需求:
计算 catePath0, catePath1, catePath2 这三种维度组合下,各个 url 对应的 pv、uv,如:
0 1 25 1 1
0 1 8 1 1
0 1 ALL 3 3
0 3 8 1 1
0 3 98 1 1
0 3 ALL 2 1
0 5 118 1 1
0 5 18 1 1
0 5 81 1 1
0 5 ALL 3 2
0 ALL ALL 8 3
ALL ALL ALL 8 3
(3)解决思路:
hive 中同属性多维统计问题通常用 union all 组合出各种维度然后 group by 进行求解:
01 |
create EXTERNAL table IF NOT EXISTS t_log ( |
02 |
url string, c0 string, c1 string, c2 string, unitparams string |
03 |
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' location '/tmp/decli/1' ; |
04 |
05 |
select * from ( |
06 |
select host, c0, c1, c2 from t_log t0 |
07 |
LATERAL VIEW parse_url_tuple(url, 'HOST' ) t1 as host |
08 |
where get_json_object(t0.unitparams, '$.store.fruit[0].weight' ) != 9 |
09 |
union all |
10 |
select host, c0, c1, 'ALL' c2 from t_log t0 |
11 |
LATERAL VIEW parse_url_tuple(url, 'HOST' ) t1 as host |
12 |
where get_json_object(t0.unitparams, '$.store.fruit[0].weight' |