该面试题的原文地址:http://blog.csdn.net/zolalad/article/details/10819749#
解决思路:根据用户ID算出访问次数,然后根据访问次数算出fromurl和tourl
难点主要为计算用户访问次数,原文的计算方法看着有点复杂,于是就简单写了一个
import java.util.HashMap;
import org.apache.hadoop.hive.ql.exec.UDF;
public class UdfTest extends UDF {
HashMap<Integer,Integer> hm = new HashMap<Integer, Integer>();
public int evaluate(int id){
Integer count = hm.get(id);
if (count==null){
count=0;
}
count++;
hm.put(id, count);
return count;
}
}
把用户的ID及访问次数count写入map集合,最后返回count
打包上传,在hive中执行add jar /usr/local/udf.jar ,CREATE TEMPORARY FUNCTION num AS "udf.UdfTest";
SELECT t1.platform,t1.user_id,t1.n,t2.click_url FROM_URL,t1.click_url TO_URL FROM
(select *,num(USER_ID) n from trlog)t1
LEFT OUTER JOIN
(select *,num(USER_ID) n from trlog)t2
on t1.user_id = t2.user_id and t1.n = t2.n+1;
注释:当访问次数为1时,fromurl为null,此时t1.n为1,t2中应不存在次数为1的,所以t2中应该n+1
进行连表查询,刚开始报错java.io.FileNotFoundException(File does not exist/usr/local/......),于是手动把jar包传到hdfs,成功执行
注:
最近发现仅用hive的分析函数就可实现:ROW_NUMBER+LAG
select platform,
user_id,
CLICK_TIME,
ROW_NUMBER() OVER(PARTITION BY platform,user_id order by CLICK_TIME) AS rn,
lag(click_url,1) over(partition by platform,user_id order by CLICK_TIME) as from_url,
click_url from trlog;