|
给你个JAVA写的RowNumber方法
-
- public class RowNumber extends org.apache.hadoop.hive.ql.exec.UDF {
-
- private static int MAX_VALUE = 50;
- private static String comparedColumn[] = new String[MAX_VALUE];
- private static int rowNum = 1;
-
- public int evaluate(Object... args) {
- String columnValue[] = new String[args.length];
- for (int i = 0; i < args.length; i++)
- columnValue[i] = args[i].toString();
- if (rowNum == 1)
- {
-
- for (int i = 0; i < columnValue.length; i++)
- comparedColumn[i] = columnValue[i];
- }
-
- for (int i = 0; i < columnValue.length; i++)
- {
-
- if (!comparedColumn[i].equals(columnValue[i]))
- {
- for (int j = 0; j < columnValue.length; j++)
- {
- comparedColumn[j] = columnValue[j];
- }
- rowNum = 1;
- return rowNum++;
- }
- }
- return rowNum++;
- }
- }
-
复制代码
把这个JAVA打包,编译成JAR包,比如RowNumber.jar。这个你总会吧~~~ 然后放到HIVE的机器上 在HIVE SHELL里执行下面两条语句:
- add jar /root/RowNumber.jar;
- #把RowNumber.jar加载到HIVE的CLASSPATH中
- create temporary function row_number as 'RowNumber';
- #在HIVE里创建一个新函数,叫row_number ,引用的CLASS 就是JAVA代码里的RowNumber
-
复制代码
提示成功后,执行下面这条HIVE SQL
-
- #INSERT OVERWRITE TABLE ALLOG 如果要写入ALLOG表,可以把注释去掉
- SELECT t1.platform,t1.user_id,row_number(t1.user_id)seq,t2.click_url FROM_URL,t1.click_url TO_URL FROM
- (select *,row_number(user_id)seq from trlog)t1
- LEFT OUTER JOIN
- (select *,row_number(user_id)seq from trlog)t2
- on t1.user_id = t2.user_id and t1.seq = t2.seq + 1;
-
复制代码
第一题中的RN貌似是HIVE转译SQL的BUG,你可以把外层的ROW_NUMBER去掉,用T1的SEQ,就能发现问题了。 第二题:
-
- INSERT OVERWRITE TABLE ALLOG
- SELECT t1.platform,t1.user_id,t1.seq,t2.click_url FROM_URL,t1.click_url TO_URL FROM
- (SELECT platform,user_id,click_time,click_url,count(1) seq FROM (SELECT a.*,b.click_time click_time1,b.click_url click_url2 FROM trlog a left outer join trlog b on a.user_id = b.user_id)t WHERE click_time>=click_time1 GROUP BY platform,user_id,click_time,click_url)t1
- LEFT OUTER JOIN
- (SELECT platform,user_id,click_time,click_url,count(1) seq FROM (SELECT a.*,b.click_time click_time1,b.click_url click_url2 FROM trlog a left outer join trlog b on a.user_id = b.user_id)t WHERE click_time>=click_time1 GROUP BY platform,user_id,click_time,click_url )t2
- on t1.user_id = t2.user_id and t1.seq = t2.seq + 1;
-
复制代码
|
|