需求: 用户访问web的top统计,比如用户1访问的top 3 web: web1, web2, web3
1) 准备测试数据/home/hadoop/hive-1.1.0/test/web.txt:
user1,web1
user1,web2
user1,web2
user1,web3
user1,web3
user1,web3
user1,web4
user1,web4
user1,web4
user1,web4
user2,web1
user2,web2
user2,web3
user2,web3
2) 编写UDF做rank排名
在/home/hadoop/hive-1.1.0/test/java下创建package目录com/example/hive/udf,并在该目录下创建Rank.java,内容如下:
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
public final class Rank extends UDF{
private int counter;
private Stringlast_key;
public intevaluate(final String key){
if ( !key.equalsIgnoreCase(this.last_key)) {
this.counter= 0;
this.last_key= key;
}
return this.counter++;
}
}
在/home/hadoop/hive-1.1.0/test/java下
执行javac –cp .:/home/hadoop/hive-1.1.0/lib/* com/example/hive/udf/Rank.java进行编译,
执行jar –cvf Rank.jar com/example/hive/udf/Rank.class 打jar包在hive下加载udf:
3) 建库建表加载数据:
4) 编写hql,统计每个用户top3的web
SELECT user, concat_ws(',',collect_set(web)) as topweb
FROM(
SELECT *
FROM
(
SELECT *,rank(user) as rank
FROM(
SELECT user,web,sum
FROM(
SELECT user,web,count(1) as sum from web group by user,web
)A
DISTRIBUTE BY user
SORT BY user,sum desc
)B
)C
WHERE rank<3
ORDER BY user,rank
)D
GROUP BY user
执行结果如下: