hive数据表去重方法

最新推荐文章于 2024-07-24 20:46:25 发布

zcc_0015

最新推荐文章于 2024-07-24 20:46:25 发布

阅读量2.9w

点赞数 1

分类专栏：数据挖掘文章标签： hive表数据查重 hivesql获取非group by字段 hive如何建外部表 exec.FunctionTask异常解

本文链接：https://blog.csdn.net/zcc_0015/article/details/50151307

版权

数据挖掘专栏收录该内容

13 篇文章 0 订阅

订阅专栏

1、hive 0.8.0数据表去重方法

问题描述：hive的外部表test中，在若干字段上存在重复现象，现在需要将若干字段上值相同的多条记录，只保其中留一条，
          舍弃其余的。

解决思路：
   （1）group by的方法

     首先新建与test表完全相同的新表test_pure,然后利用group by在有相同值的若干字段上进行分组统计，正常情况下，hive sql的select是无法取到非group by的字段，采用
collect_set(非group by字段)[0]，从分组统计的非group by字段值集合中，取到其中一个，按此方式取得所有非group by字段，然后将
所有字段的值，insert 到新建的表中。

    a、建表
       set mapred.job.name = 'chachong';
           set mapred.job.map.capacity = 300;
       set mapred.job.reduce.capacity = 300;
           set hive.exec.reducers.max = 300;
           set hive.mapred.mode=nonstrict;
           set hive.map.aggr=true;
           set hive.groupby.skewindata=true;
       set mapred.job.priority=VERY_HIGH;

           db = database me;
           use db;
           drop table if exists yuliao_pure;
       create external table if not exists yuliao_pure(
           content_id string,
           content string,
           host string,
           subject string,
           brand string,
           industry string,
           critic string,
           time string
               )
               partitioned by (dt string)
               row format delimited
               fields terminated by '\t'
               lines terminated by '\n'
               stored as textfile;

    b、去重

insert into table test_pure
partition (dt='${hivevar:new_dt}')
select
    collect_set(content_id)[0],
    content,
    collect_set(host)[0],
    collect_set(subject)[0],
    collect_set(brand)[0],
    collect_set(industry)[0],
    collect_set(critic)[0],
    time
from
    test
where
    industry="aaa"
    and time like '%201501%'
    and dt >= "20150520"
    and dt <= "20151027"
group by
    content,time;

2、利用row_number去重

由于的hive.0.11.0版本加入了对row_number()的内建函数的支持，但是0.8.0仍然不支持，因此通过编写UDF内建函数来完成，具体如下：

(1)下载hive-exec-1.0.1.jar;

(2)编写row_number的用户自定义的UDF；

package com.blue.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
public class RowNumber extends UDF {
   private static int MAX_VALUE = 50;
    private static String comparedColumn[] = new String[MAX_VALUE];
    private static int rowNum = 1;

    public int evaluate(Object... args)
    {
        String columnValue[] = new String[args.length];
        for (int i = 0; i < args.length; i++)
            columnValue[i] = args[i].toString();
        if (rowNum == 1)
        {
            for (int i = 0; i < columnValue.length; i++)
                comparedColumn[i] = columnValue[i];
        }
        for (int i = 0; i < columnValue.length; i++)
        {
            if (!comparedColumn[i].equals(columnValue[i]))
            {
                for (int j = 0; j < columnValue.length; j++)
                {
                    comparedColumn[j] = columnValue[j];
                }
                rowNum = 1;
                return rowNum++;
            }
        }
        return rowNum++;
    }
}

该类继承自hive的UDF类，该类是按若干列的值进行分组，在相同组内对记录进行编号。每扫描一行记录，该类的evaluate方法就会被执行，evaluate方法的内部逻辑是：comparedColumn数组用于跟踪上一条记录的分组列的值，碰到上一条记录在若干列上的值与当前记录的若干列的值相同，便将rowNumber++，一旦碰到值不同的

便将当前记录的若干列的值赋给comparedColumn数组，作为新的编号起点，依次下去，直到扫描完所有记录为止。

（3）将该类生成jar文件，打jar包时，不用将hive-exec-1.0.1.jar一块儿打入，rz -be 上传该jar文件到某一路径下；

（4）内建函数的使用方法，如下：

add jar /home/work/zcc/data_checksame/blueudf.jar;
create temporary function row_number as 'com.blue.hive.udf.RowNumber';

insert into table test_pure
select
t.content_id,t.content,t.host,t.subject,t.brand,t.industry,t.critic,t.time
from
( select
     content_id,content,host,subject,brand,industry,critic,time,
     row_number() over (distribute by content,time sort by time) as rn
     from yuliao
     where
     industry="${hivevar:industry}"
     and time like '%${hivevar:month}%'
     and dt in ("20150521")

) t where t.rn=1;

总结：(1)hive如何获取非group by字段；

(2)hive表的查重策略；

(3)hive表的UDF使用方法；

(4) Execution Error, return code [-101] (unknow error) from org.apache.hadoop.hive.ql.exec.FunctionTask异常解决方法，务必保证UDF的编译jdk版本与运行hive的jdk版本保持一致。

后续提供hive表的不同查重方法的优化性能。