问题描述:hive的外部表test中,在若干字段上存在重复现象,现在需要将若干字段上值相同的多条记录,只保其中留一条,
舍弃其余的。
解决思路:
(1)group by的方法
首先新建与test表完全相同的新表test_pure,然后利用group by在有相同值的若干字段上进行分组统计,正常情况下,hive sql的select是无法取到非group by的字段,采用
collect_set(非group by字段)[0],从分组统计的非group by字段值集合中,取到其中一个,按此方式取得所有非group by字段,然后将
所有字段的值,insert 到新建的表中。
a、建表
set mapred.job.name = 'chachong';
set mapred.job.map.capacity = 300;
set mapred.job.reduce.capacity = 300;
set hive.exec.reducers.max = 300;
set hive.mapred.mode=nonstrict;
set hive.map.aggr=true;
set hive.groupby.skewindata=true;
set mapred.job.priority=VERY_HIGH;
db = database me;
use db;
drop table if exists yuliao_pure;
create external table if not exists yuliao_pure(
content_id string,
content string,
host string,
subject string,
brand string,
industry string,
critic string,
time string
)
partitioned by (dt string)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
b、去重
insert into table test_pure
partition (dt='${hivevar:new_dt}')
select
collect_set(content_id)[0],
content,
collect_set(host)[0],
collect_set(subject)[0],
collect_set(brand)[0],
collect_set(industry)[0],
collect_set(critic)[0],
time
from
test
where
industry="aaa"
and time like '%201501%'
and dt >= "20150520"
and dt <= "20151027"
group by
content,time;
2、利用row_number去重
由于的hive.0.11.0版本加入了对row_number()的内建函数的支持,但是0.8.0仍然不支持,因此通过编写UDF内建函数来完成,具体如下:
(1)下载hive-exec-1.0.1.jar;
(2)编写row_number的用户自定义的UDF;
package com.blue.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
public class RowNumber extends UDF {
private static int MAX_VALUE = 50;
private static String comparedColumn[] = new String[MAX_VALUE];
private static int rowNum = 1;
public int evaluate(Object... args)
{
String columnValue[] = new String[args.length];
for (int i = 0; i < args.length; i++)
columnValue[i] = args[i].toString();
if (rowNum == 1)
{
for (int i = 0; i < columnValue.length; i++)
comparedColumn[i] = columnValue[i];
}
for (int i = 0; i < columnValue.length; i++)
{
if (!comparedColumn[i].equals(columnValue[i]))
{
for (int j = 0; j < columnValue.length; j++)
{
comparedColumn[j] = columnValue[j];
}
rowNum = 1;
return rowNum++;
}
}
return rowNum++;
}
}
该类继承自hive的UDF类,该类是按若干列的值进行分组,在相同组内对记录进行编号。每扫描一行记录,该类的evaluate方法就会被执行,evaluate方法的内部逻辑是:comparedColumn数组用于跟踪上一条记录的分组列的值,碰到上一条记录在若干列上的值与当前记录的若干列的值相同,便将rowNumber++,一旦碰到值不同的
便将当前记录的若干列的值赋给comparedColumn数组,作为新的编号起点,依次下去,直到扫描完所有记录为止。
(3)将该类生成jar文件,打jar包时,不用将hive-exec-1.0.1.jar一块儿打入,rz -be 上传该jar文件到某一路径下;
(4)内建函数的使用方法,如下:
add jar /home/work/zcc/data_checksame/blueudf.jar;
create temporary function row_number as 'com.blue.hive.udf.RowNumber';
select
t.content_id,t.content,t.host,t.subject,t.brand,t.industry,t.critic,t.time
from
( select
content_id,content,host,subject,brand,industry,critic,time,
row_number() over (distribute by content,time sort by time) as rn
from yuliao
where
industry="${hivevar:industry}"
and time like '%${hivevar:month}%'
and dt in ("20150521")
) t where t.rn=1;
总结:(1)hive如何获取非group by字段;
(2)hive表的查重策略;
(3)hive表的UDF使用方法;
(4) Execution Error, return code [-101] (unknow error) from org.apache.hadoop.hive.ql.exec.FunctionTask异常解决方法,务必保证UDF的编译jdk版本与运行hive的jdk版本保持一致。
后续提供hive表的不同查重方法的优化性能。