HIVE工作中使用技巧

mn_kw

于 2023-01-04 17:29:06 发布

阅读量658

点赞数

分类专栏： HIVE 文章标签： hive hadoop 数据仓库

本文链接：https://blog.csdn.net/mn_kw/article/details/128551549

版权

HIVE 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

HIVE工作中使用技巧

1. HIVE解析日志

1.1 使用正则表达式解析http中的日志信息

    trim(split(regexp_extract(u.uri_path,'app_uid=(.*)',1),'&')[0]) as app_uid

1.2 使用正则表达式解析数组中的信息

    regexp_extract(app_uid,'^\\["(.*)\\"]$',1) as app_uid

1.3 使用反射的方式来解析uri中的字段

    reflect("java.net.URLDecoder", "decode", trim(parse_url(concat('http://?',u.uri_path),'QUERY','app_uid')), "UTF-8") as app_uid

1.4 get_json_object() 来解析json中的某个字段

    get_json_object(data_list,'$.SpuID')        as supply_id

1.5 使用lateral view json_tuple() 将json中的字段打平为多个字段

    lateral view json_tuple(t.source,'logid','msg','level') v1 as logid,msg,level

2. HIVE行转列

2.1 使用以下函数进行行转列 collect_set/collect_list + concat_ws() + group by

    select
        t.app_uid as app_uid
        ,concat_ws(',',collect_set(t.tag)) as tags
    from
    (
        select
            app_uid
            ,case
                when type=1 then '1'
                when type=2 then '2'
                when type=3 then '3'
                when type=4 then '4'
                else '其他'
            end as tag
        from ymtods.test
        where day='{ptdate}'
        and status=1
    ) t
    group by t.app_uid
    ;

2.2 collect_list 和 collect_set的区别

    都是返回array类型字段,但是前者不去重,后者去重

2.3 使用case when + 聚合函数 + group by进行行转列

    例如：学生号      科目  成绩
            01      语文  80
            01      英语  90
            02      语文  60
    需要展示 学生号 语文 英语 数学
            01    80  90   70
    select 
        st_no 
        ,sum(case when 科目 = '语文' then 成绩 end) as 语文
        ,sum(case when 科目 = '数学' then 成绩 end) as 数学
        ,sum(case when 科目 = '英语' then 成绩 end) as 英语
    from table 
    group by 学生号

3. HIVE列转行

3.1 多列转俩列多行 LATERAL VIEW explode(map())

    例如：订单信息中有多个字段 created_time,buyer_pay_time,seller_send_goods_time,buyer_recieve_goods_time,order_refund_time
    LATERAL VIEW explode (map(
        '100',created_time
        ,'200',created_time
        ,'300',buyer_pay_time
        ,'410',seller_send_goods_time
        ,'500',buyer_recieve_goods_time
        ,'501',order_refund_time
    )) b as order_status,order_status_time

3.2 多列转单列多行 concat_ws() LATERAL VIEW explode()

    例如：一个供应有多个类目，一级类目，二级类目，三级类目，需要多个类目下求聚合指标
    select
        e.category_id
        ,c.logid
    from
    (
        select
            concat_ws(',',cast(parsed_cid1_id as string),cast(parsed_cid2_id as string),cast(parsed_product_id as string),cast(parsed_breed_id as string)) as category_id
            ,logid
        from ymtcube.dwd_search_impress_track_h
        where pthour = from_unixtime(unix_timestamp('${pthour}','yyyyMMddHH')-3600,'yyyyMMddHH')
    ) c
    lateral view explode(split(c.category_id,",")) e as category_id
    where e.category_id > 0

3.3 单列转单列多行 LATERAL VIEW explode() 数组类型的直接打开

    lateral view explode(split(c.category_id,",")) e as category_id

3.4 使用union进行列转行

    例如：
    数据 学生号 语文 英语 数学
        01    80  90   70
    学生号   科目  成绩
    01      语文  80
    01      英语  90
    02      语文  60

    select 
        select 
            学生号
            ,'语文' as 科目
            ,语文 as 成绩 
        union 
            学生号
            ,'英语' as 科目
            ,英语 as 成绩 
        union 
            学生号
            ,'数学' as 科目
            ,数学 as 成绩  
    from table

3.5 使用posexplode()函数实现多列转多行

    a. 比如数据如下：
        id	    tim	           
        a,b,c,d	2:00,3:00,4:00,5:00
        a,b,c,d	2:00,3:00,4:00,5:00
        a,b,c,d	2:00,3:00,4:00,5:00
        a,b,c,d	2:00,3:00,4:00,5:00
        f,b,c,d	1:10,2:20,3:30,4:40
        f,b,c,d	1:10,2:20,3:30,4:40
        f,b,c,d	1:10,2:20,3:30,4:40
        f,b,c,d	1:10,2:20,3:30,4:40
    b. 我们要实现 id的第一列对应tim的第一列,以此类推,可以用以下函数posexplode(),可以将数据和index都取出来
        select id,tim,single_id_index,single_id from test.a 
        lateral view posexplode(split(id,',')) t as single_id_index, single_id;
            id	    tim	single_id_index	single_id
            a,b,c,d	2:00,3:00,4:00,5:00	0	a
            a,b,c,d	2:00,3:00,4:00,5:00	1	b
            a,b,c,d	2:00,3:00,4:00,5:00	2	c
            a,b,c,d	2:00,3:00,4:00,5:00	3	d
            f,b,c,d	1:10,2:20,3:30,4:40	0	f
            f,b,c,d	1:10,2:20,3:30,4:40	1	b
            f,b,c,d	1:10,2:20,3:30,4:40	2	c
            f,b,c,d	1:10,2:20,3:30,4:40	3	d
    c. 使用两次posexplode()函数实现多列转多行
        select id,tim,single_id,single_tim from test.a 
        lateral view posexplode(split(id,',')) t as single_id_index, single_id
        lateral view posexplode(split(tim,',')) t as single_yim_index, single_tim
        where single_id_index = single_yim_index;
            id	tim	single_id	single_tim
            a,b,c,d	2:00,3:00,4:00,5:00	a	2:00
            a,b,c,d	2:00,3:00,4:00,5:00	b	3:00
            a,b,c,d	2:00,3:00,4:00,5:00	c	4:00
            a,b,c,d	2:00,3:00,4:00,5:00	d	5:00
            f,b,c,d	1:10,2:20,3:30,4:40	f	1:10
            f,b,c,d	1:10,2:20,3:30,4:40	b	2:20
            f,b,c,d	1:10,2:20,3:30,4:40	c	3:30
            f,b,c,d	1:10,2:20,3:30,4:40	d	4:40

4. HIVE GROUP高级用法

4.1 group by with cube

    为指定表达式集的每个可能组合创建分组集。首先会对(A、B、C)进行group by,
    然后依次是(A、B),(A、C),(A),(B、C),(B),(C),最后对全表进行group by操作。

4.2 group by with rollup

    rollup 是根据维度在数据结果集中进行的聚合操作。
    group by A,B,C with rollup 首先会对(A、B、C)进行group by,
    然后对(A、B)进行group by,然后是(A)进行group by,
    最后对全表进行group by操作

    With rollup函数:主要针对层级维度的组合处理,假如现在的组合是省+市+区三个维度进行组合,
    with rollup 就会自动的一级一级往上卷,变成省+市,最后是省

4.3 grouping sets

    这个可以由用户自己决定需要哪些维度组合,将自己需要的维度组合都写在Grouping sets函数后
    select class, avg(score) from tmp.tmp_student group by class;
    union 
    select sex, avg(score) from tmp.tmp_student group by sex;
    union 
    select class, sex, avg(score) from tmp.tmp_student group by class, sex;
    可以写成以下
    select
        grouping__id, -- 内置变量，只要使用grouping sets就可以调用
        class,
        sex,
        avg(score) as avg_score
    from
        tmp.tmp_student
    group by
        class,
        sex 
    grouping sets(
        class, 
        sex,
        (class, sex)
    )

5. HIVE开窗函数

5.1 开窗函数拿到第一个值 first() over()

    first_value(source) over (partition by callid,customer_id,seller_id order by click_time) as source_first

5.2 开窗函数排序

    row_number() over () 
        对相等的值不做区别,相等的值对应的排名相同,序号从1到n连续 
            比如91,90,90的排名是1,2,3。
    rank() over () 
        相等的值排名相同,但若有相等的值,则序号从1到n是间断的 
            比如91,90,90,89 的排名是1,2,2,4
    dense_rank() over()
        对相等的值排名相同,但序号从1到n连续。 
            比如91,90,90,89 的排名是1,2,2,3

6. HIVE中UNION和UNION ALL

6.1 UNION

    union 会将俩个表的信息合并，并且会去重

6.2 UNION ALL

    union all 会将俩个表信息合并，但是不会去重

7. with as 和临时表

7.1 with as

7.1.1 with as 原理：

当我们书写一些结构相对复杂的SQL语句时，可能某个子查询在多个层级多个地方存在重复使用的情况，这个时候我们可以使用 with as 语句将其独立出来，极大提高SQL可读性，简化SQL~
注：目前 oracle、sql server、hive等均支持 with as 用法，但 mysql并不支持！

with as 也叫做子查询部分，首先定义一个sql片段，该sql片段会被整个sql语句所用到，为了让sql语句的可读性更高些，作为提供数据的部分，也常常用在union等集合操作中。
with as就类似于一个视图或临时表，可以用来存储一部分的sql语句作为别名，不同的是with as 属于一次性的，而且必须要和其他sql一起使用才可以！其最大的好处就是适当的提高代码可读性，
而且如果with子句在后面要多次使用到，这可以大大的简化SQL；更重要的是：一次分析，多次使用，这也是为什么会提供性能的地方，达到了“少读”的目标。

7.1.2 简单使用：

	WITH 
	t1 as (SELECT * FROM Table1),
	t2 as (SELECT * FROM Table2) 
	SELECT t1.*
	FROM t1
	LEFT JOIN t2 ON t1.id=t2.id

Hive可以通过with查询来提高查询性能，因为先通过with语法将数据查询到内存(在内存，如果说数据量太大，就会占用大量的资源，可能会内存溢出)，然后后边其他查询可以直接使用，这种方法与创建临时表类似，但是不需要创建临时实体表，内存中的子查询结果在会话结束后自动删除。作用和视图或临时表类似

7.1.4 hive中with as 相关参数

hive.optimize.cte.materialize.threshold
默认这个参数是-1关闭，当当开启（大于0），比如设置为2，则如果with..as语句被引用2次及以上时，会把with..as语句生成的table物化，从而做到with..as语句只执行一次，来提高效率。

7.2 临时表

7.2.1 创建临时表语法

语句:create temporary table 。。。

7.2.2 有效范围

hive中的临时表只对当前session有效，session退出后，表会自动删除注:若创建的临时表表名已经存在，那么当前session应用该表名时使用的是你创建的临时表，只有删除或者修改临时表才能使用原先存在的表。

7.2.3 临时表限制

临时表不支持分区字段和创建索引

8. (NULL)空值处理函数

8.1 NVL

8.1.1 基本语法 NVL（表达式 1，表达式 2）

	如果表达式 1 为空值，NVL 返回值为表达式 2 的值，否则返回表达式 1 的值。 
	该函数的目的是把一个空值（null）转换成一个实际的值。
	其表达式的值可以是数字型、 字符型和日期型。但是表达式 1 和表达式 2 的数据类型必须为同一个类型。

8.1.2 案例实操

	hive (gmall)> select nvl(1,0); 1 
	hive (gmall)> select nvl(null,"hello"); hello

8.2 coalesce

8.2.1 基本语法 coalesce(表达式1，表达式2…表达式n)

    coalesce 函数语法为 coalesce(表达式1，表达式2....表达式n),coalesce函数的返回结果是第一个非空表达式，如果全是空则 返回 空.使用时需注意：所有表达式值是同一类型(转换同一类型亦可)

8.2.2 案例实操

    select coalesce(c1, c2, c3, c4) null_1, coalesce(c1, c4) null_allfrom temp.jc_test_coalesce_nvl;

9. HIVE 内部表和外部表

9.1 HIVE内部表

	Hive 创建内部表时，会将数据移动到数据仓库指向的路径；在删除表的时候，内部表的元数据和数据会被一起删除。
	又称为管理表：Hive 默认情况下会将这些表的数据存储在由配置项 hive.metastore.warehouse.dir(例如，/user/hive/warehouse)所定义的目录的子目录下

9.2 HIVE外部表

	EXTERNAL 关键字可以让用户创建一个外部表，在建表的同时指定一个指向实际数据的路径（LOCATION），若创建外部表，仅记录数据所在的路径，不对数据的位置做任何改变。
	在删除表的时候，外部表只删除元数据，不删除数据。

9.3 管理表和外部表的使用场景

    每天将收集到的网站日志定期流入 HDFS 文本文件。在外部表（原始日志表）的基础上做大量的统计分析，用到的中间表、结果表使用内部表存储，数据通过 SELECT+INSERT 进入内部表。

9.4 内部表和外部表互相转化

	修改内部表 student2 为外部表：
		alter table student2 set tblproperties('EXTERNAL'='TRUE'); 
	修改外部表 student2 为内部表：
		alter table student2 set tblproperties('EXTERNAL'='FALSE');
	注意：('EXTERNAL'='TRUE')和('EXTERNAL'='FALSE')为固定写法，区分大小写

9.5 生产中为什么建议使用外部表

    1. 数据安全，外部表在删除表的时候，只是删除表元数据，不删除表数据
	2. 外部表不会加载数据到hive，减少数据传输，数据还能共享

10. 四个排序by以及区别

10.1 全部排序（order by）

	全局排序，只有一个reducer，适合数据量不大的请求，数据量较大会有oom风险

10.2 每个 MapReduce 内部排序（SortBy）

	每个 Reducer 内部进行排序，对全局结果集来说不是排序。

10.3 分区排序（DistributeBy）

	DistributeBy：类似 MR 中 partition，进行分区，结合 sortby 使用。 
	注意，Hive 要求 DISTRIBUTEBY 语句要写在 SORTBY 语句之前。 
	distribute by 采集hash算法，在map端将查询结果中hash值相同的结果分发到对应的reduce文件中。
	hive会根据distribute by后面列，对应reduce的个数进行分发，默认是采用hash算法

10.4 ClusterBy

	当 distribute by 和 sortsby 字段相同时，可以使用 clusterby 方式。 
	clusterby 除了具有 distribute by 的功能外还兼具 sortby 的功能。
	但是排序只能是升序排序，不能指定排序规则为 ASC 或者 DESC。 
		1）以下两种写法等价 
			hive (default)> select * from emp cluster by deptno; 
			hive (default)> select * from emp distribute by deptno sort by deptno; 
			注意：按照部门编号分区，不一定就是固定死的数值，可以是 20 号和 30 号部门分到一个分区里面去。

10.5 group by

	和distribute by类似 都是按key值划分数据，都使用reduce操作
	唯一不同的是，distribute by只是单纯的分散数据，distribute by col – 按照col列把数据分散到不同的reduce。
	group by把相同key的数据聚集到一起，后续必须是聚合操作

10.6 partition by

	类似hive的建表，分区的意思；

11. hive -e 和 hive -f区别

11.1 “-e”不进入 hive 的交互窗口执行 sql 语句

	[hive]$ bin/hive -e "select id from student;"

11.2 “-f”执行脚本中 sql 语句

	（1）在/opt/module/datas 目录下创建 hivef.sql 文件 
		[datas]$ touch hivef.sql 文件中写入正确的 sql 语句 select*fromstudent; 
	（2）执行文件中的 sql 语句
		[hive]$ bin/hive -f /opt/module/datas/hivef.sql 
	（3）执行文件中的 sql 语句并将结果写入文件中 
		[hive]$ bin/hive -f /opt/module/datas/hivef.sql > /opt/module/datas/hive_result.txt