hive-删除表中数据

最新推荐文章于 2024-05-14 04:17:05 发布

乐乐今天没bug

最新推荐文章于 2024-05-14 04:17:05 发布

阅读量2.1k

点赞数 1

分类专栏：大数据 hive 文章标签： hive hql

本文链接：https://blog.csdn.net/qq_41826265/article/details/102951793

版权

大数据同时被 2 个专栏收录

13 篇文章 0 订阅

订阅专栏

hive

2 篇文章 0 订阅

订阅专栏

背景

表结构和表数据:
我有一个表(如下)

hive> select * from t2;
OK
t2.id	t2.name	t2.addr	t2.n1	t2.n2	t2.n3
1	NULL	NULL	NULL	NULL	NULL
2	NULL	NULL	NULL	NULL	NULL
3	3n1	3n2	3n3	NULL	NULL
4	NULL	NULL	4n1	4n2	4n3
5	NULL	NULL	NULL	NULL	NULL
6	NULL	NULL	NULL	NULL	NULL
7	7n1	7n2	7n3	NULL	NULL
8	8n1	8n2	8n3	NULL	NULL
9	NULL	NULL	NULL	NULL	NULL
10	NULL	NULL	NULL	NULL	NULL

写好自定义函数返回空值个数
我想要删除空值大于三个的数据,我首先想到了用UDF,并且我写了一个UDF打好了jar包,UDF如下,主要作用就是返回空值的个数

package com.hiveUDF;
import org.apache.hadoop.hive.ql.exec.UDF;

public class add5 extends UDF {

    public int evaluate(int id,String name,String addr,String n1,String n2, String n3){
        return ifNull(id,name,addr,n1,n2,n3);

    }
    private int ifNull(int id,String name,String addr,String n1,String n2, String n3){
        int numnull=0;
        String strid=String.valueOf(id);
        String[] para={strid,name,addr,n1,n2,n3};
        for(int i=0;i<para.length;i++){
            if(para[i].equals("NULL")){
                numnull++;
            }
        }
        return numnull;
    }

}

导入jar包创建函数
在hive中导入hdfs中的jar包并创建临时函数f3,如下

hive> add jar /data/f3.jar;
Added [/data/f3.jar] to class path
Added resources: [/data/f3.jar]
hive> create temporary function f3 as "com.hiveUDF.add5";
OK
Time taken: 0.021 seconds

操作验证一下函数

hive> select f3(*) from t2;
OK
5
5
2
2
5
5
2
2
5
5
Time taken: 0.674 seconds, Fetched: 10 row(s)

发现问题

尝试了各种delete方式,仍然不能达成需求,所以反思了一下.
突然想起来hive好像不支持"delete from 表;"这样的操作,但是我不死心的尝试了一下:

hive> delete from t2 where id=1;
FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.

虽然英语很差, 但我仍然一眼就看到了他说不支持这种操作,行叭,只能曲线救国啦

解决办法

首先我要新建一张表,把t2的源数据和f3(*)的结果放在一起,于是我有了t3:

hive> create table t3(f3 int,id int,name String,addr String,n1 String,n2 String,n3 String);
OK
Time taken: 0.316 seconds

插入数据:

hive> insert into t3 select f3(*),* from t2;

查看t3是不是插入成功:

hive> select * from t3;
OK
5	1	NULL	NULL	NULL	NULL	NULL
5	2	NULL	NULL	NULL	NULL	NULL
2	3	3n1	3n2	3n3	NULL	NULL
2	4	NULL	NULL	4n1	4n2	4n3
5	5	NULL	NULL	NULL	NULL	NULL
5	6	NULL	NULL	NULL	NULL	NULL
2	7	7n1	7n2	7n3	NULL	NULL
2	8	8n1	8n2	8n3	NULL	NULL
5	9	NULL	NULL	NULL	NULL	NULL
5	10	NULL	NULL	NULL	NULL	NULL
Time taken: 0.078 seconds, Fetched: 10 row(s)

如果我们要保存原表t2,就新建一个t4去接收处理后的数据,否则的话可以直接重写t2,这里我选择建一个新表t4去接收处理后的数据
建一个新表t4,表结构和t2一致:(复制表结构)

hive> create table t4 like t2;
OK
Time taken: 0.178 seconds

把符合要求的数据导入t4表中:

hive> insert into t4 select id,name,addr,n1,n2,n3 from t3 where f3<=3;

查看t4内的数据:

hive> select * from t4;
OK
3	3n1	3n2	3n3	NULL	NULL
4	NULL	NULL	4n1	4n2	4n3
7	7n1	7n2	7n3	NULL	NULL
8	8n1	8n2	8n3	NULL	NULL
Time taken: 0.087 seconds, Fetched: 4 row(s)