干货 | HIVE表常用操作大全(一)

Souvenirser

已于 2023-02-03 14:38:43 修改

阅读量858

点赞数

分类专栏： Hive 文章标签： hive 大数据 hadoop

于 2022-12-23 17:47:36 首次发布

本文链接：https://blog.csdn.net/qq_39285950/article/details/128422274

版权

Hive 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

文章目录

1. 操作外部表
2. 设置内存大小
3. 设置非严格模式
4. 设置执行引擎
5. union all报错
6. 将yyyy-MM-dd HH:mm:ss的timestamp转化为yyyyMMdd或者yyyyMMddHH
7. timestamp转换为yyyyMMdd/yyyy-MM-dd
8. impala更新数据
9. csv建表语句
10. 计算分钟差
11. 列转行
12. coalesce,nvl的坑
13. 新增字段
14. 增加分区
15. beeline重定向
16. 设置队列名

1. 操作外部表

1. 修改外部表为内部表

alter table xxx.xxx set tblproperties ('EXTERNAL' = 'false');

2. 删除hive表分区

alter table xxx.xxx drop if exists partition (dp='bkactive',dt<'xxx');

3.修改内部表为外部表

alter table xxx.xxx set tblproperties ('EXTERNAL' = true);

2. 设置内存大小

set spark.executor.memory=10g;

3. 设置非严格模式

set hive.exec.dynamici.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.support.quoted.identifiers=None;

4. 设置执行引擎

set hive.execution.engine=mr;
set hive.execution.engine=spark;
set hive.execution.engine=tez;

5. union all报错

The column KEY._col1:0._col0 is not in the vectorization context column map

添加参数:

set hive.vectorized.execution.enabled=false;

6. 将yyyy-MM-dd HH:mm:ss的timestamp转化为yyyyMMdd或者yyyyMMddHH

FROM_UNIXTIME(unix_timestamp(ctime),'yyyyMMdd') dt
FROM_UNIXTIME(unix_timestamp(ctime),'yyyyMMddHH') dh

7. timestamp转换为yyyyMMdd/yyyy-MM-dd

FROM_UNIXTIME(cast(ctime as int),'yyyyMMdd') dt
FROM_UNIXTIME(cast(ctime as int),'yyyy-MM-dd') dt

8. impala更新数据

invalidate metadata ads.ads_cps_yg_referral_push_report

9. csv建表语句

--在tmp中建表.表名可以修改. 即 tmp.表名
CREATE EXTERNAL TABLE `tmp.tmp_xhd_csv`(
-- csv中各种字段
`uid` string COMMENT '',
`days_acl` string COMMENT '',
`days_arrange` string COMMENT '',
`days_ac` string COMMENT '',
`days_nonac` string COMMENT '',
`dt_max` string COMMENT '',
`days_acavg` string COMMENT '',
`avtice` string COMMENT '')
COMMENT '&增量'
-- csv默认根据','分割
row format delimited  
fields terminated by ','
-- 指定路径,指定到目录下.即/user/xiehuadong/questionnaires
LOCATION
'hdfs://nameservice1/user/hive/warehouse/tmp.db/zhd_csv'
-- csv第一行一般为字段名,此设置为跳过第一行
TBLPROPERTIES('skip.header.line.count'='1')
;

10. 计算分钟差

两个时间计算时间差的绝对值.

select abs((unix_timestamp('2021-03-01 00:00:00','yyyy-MM-dd HH:mm')- unix_timestamp('2021-03-02 00:01:00','yyyy-MM-dd HH:mm'))/60)

11. 列转行

explode对array和map生效

select app_id,app_name
from table_a
later view explode(app_names) table_view as app_name;

根据|切割
select app_id,app_name
from table_a
later view explode(split(app_names,'|')) table_view as app_name;

12. coalesce,nvl的坑

(NVL同coalesce,只会过滤NULL,不会过滤空字符串)
示例1:

select imei,oaid 
from (
    select case when length(imei)>0 then imei end imei,
    	   case when length(trim(imei))=0 or imei is null 
    	        or imei = 'NULL' 
    	        or imei='null' 
    	        and length(oaid)>0 
    	        then oaid end oaid
    from dwd.dwd_tf_tf_h5_dsp_log
    where dt='20210413'
) s
group by imei,oaid;

示例2:

select imei(
    select coalesce(imei,oaid) imei
    from dwd.dwd_tf_tf_h5_dsp_log
    where dt='20210413'
) s
group by imei;

发现示例2中数据量要比示例1中少,分析原因:

imei    oaid
''      1
NULL    2
3       4
''      5
6       7

示例2 中的coalesce只会过滤null,并不会过滤空字符串,所以取数的时候,coalesce(imei,oaid)结果为:

imei
''
2
3
''
6

然后我们进行distinct后,只保存了

imei
''
2
3
6

共4条数据.
而CASE WHEN则避免了这个情况,我们使用了length()>0,去重前的结果为:

imei oaid
      1
      2
3
      5
6

去重后:

imei oaid
      1
      2
3
      5
6

共5条数据,比coalesce 4条数据多1条.

13. 新增字段

alter table table_name add columns(new_column string comment '新增字段');

14. 增加分区

alter table al.dwd_rec_clt_user_read_chpts add partition (dt='20210701');

15. beeline重定向

beeline --silent=true \
-u jdbc:hive2://192.168.1.124:10000/default \
-n uname -p 123456  \
--hiveconf mapreduce.job.queuename=root.yarn.hdfs \
--hiveconf spark.app.name="clt_abtest" \
-e "${sql}" >> data/report_2021_location.csv

16. 设置队列名

set mapreduce.job.queuename=root.yarn.rectask;

Souvenirser

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录