大数据开发之Hive案例篇2-hive复合类型array处理

一.需求

1.1 数据准备

-- 表
drop table if exists gcable_look_offline;
CREATE TABLE gcable_look_offline(
  client_type string COMMENT '终端类型', 
  client_id string COMMENT '终端号', 
  region_code string COMMENT '终端区域码',   
  network_id string COMMENT '网络Id', 
  service_param array<string> COMMENT '业务服务参数')
PARTITIONED BY (stadate int)
row format delimited fields terminated by '|'
collection items terminated by "\^"

-- 数据
08|109001212201363151|3060||2021-09-13T08:58:45Z^04^3001^茂名公共^3001^3001^3001^0^1
08|109001212201363151|3060||2021-09-13T08:58:45Z^04^1715^卫生健康^1715^1715^1715^0^1
08|109001210401297979|3201||2021-09-13T08:58:45Z^01^com.suma.midware.dmp.play.PlayActivity^直播全屏^com.suma.midware.ihome.third.Main3DActivity^首页
08|279001205200190933|3201||2021-09-13T08:58:47Z^01^com.suma.midware.ihome.third.Main3DActivity^首页^^
08|299001204200473398|3201||2021-09-13T08:58:45Z^03^GDZX2220210107121454^中国兄弟连12^11003^GDZX0920190517000004_CPBQ00006^07^2194^1^2813^1^0^0
08|299001204200468045|3220||2021-09-13T08:58:45Z^09^00700100^01^01^file:///data/adv/pic/1602815006348.jpg^null

-- shell 录入数据
v_date=20211122
filename=/root/20211123.txt
hive -e "load data local inpath '${filename}' overwrite into table test.gcable_look_offline partition(stadate = ${v_date})"

查看数据:

    > select * from gcable_look_offline;
OK
08      109001212201363151      3060            ["2021-09-13T08:58:45Z","04","3001","茂名公共","3001","3001","3001","0","1"]    20211122
08      109001212201363151      3060            ["2021-09-13T08:58:45Z","04","1715","卫生健康","1715","1715","1715","0","1"]    20211122
08      109001210401297979      3201            ["2021-09-13T08:58:45Z","01","com.suma.midware.dmp.play.PlayActivity","直播全屏","com.suma.midware.ihome.third.Main3DActivity","首页"]  20211122
08      279001205200190933      3201            ["2021-09-13T08:58:47Z","01","com.suma.midware.ihome.third.Main3DActivity","首页","",""]        20211122
08      299001204200473398      3201            ["2021-09-13T08:58:45Z","03","GDZX2220210107121454","中国兄弟连12","11003","GDZX0920190517000004_CPBQ00006","07","2194","1","2813","1","0","0"] 20211122
08      299001204200468045      3220            ["2021-09-13T08:58:45Z","09","00700100","01","01","file:///data/adv/pic/1602815006348.jpg","null"]      20211122
Time taken: 0.841 seconds, Fetched: 6 row(s)
hive> 

1.2 需求

需要将service_param这个复合类型的array第一列和第二列单独拿出来作为表新的列。

二.解决方案

Hive提供了explode、posexplode两个UDTF来协助处理array这样的复合类型。

-- 不包含pos
select region_code,new_param from  gcable_look_offline lateral view explode(service_param) gcable_look_offline as new_param;

-- 包含pos
select region_code,pos,new_param from  gcable_look_offline lateral view posexplode(service_param) gcable_look_offline as pos,new_param;

测试记录:

hive> use test;
OK
Time taken: 1.17 seconds
hive> select region_code,new_param from  gcable_look_offline lateral view explode(service_param) gcable_look_offline as new_param;
Query ID = root_20211123165539_5b007f70-0826-495a-8b84-0702ae9c0ead
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1637634703570_0005, Tracking URL = http://hp1:8088/proxy/application_1637634703570_0005/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1637634703570_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2021-11-23 16:55:49,502 Stage-1 map = 0%,  reduce = 0%
2021-11-23 16:55:54,726 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.83 sec
MapReduce Total cumulative CPU time: 1 seconds 830 msec
Ended Job = job_1637634703570_0005
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.83 sec   HDFS Read: 6923 HDFS Write: 1530 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 830 msec
OK
3060    2021-09-13T08:58:45Z
3060    04
3060    3001
3060    茂名公共
3060    3001
3060    3001
3060    3001
3060    0
3060    1
3060    2021-09-13T08:58:45Z
3060    04
3060    1715
3060    卫生健康
3060    1715
3060    1715
3060    1715
3060    0
3060    1
3201    2021-09-13T08:58:45Z
3201    01
3201    com.suma.midware.dmp.play.PlayActivity
3201    直播全屏
3201    com.suma.midware.ihome.third.Main3DActivity
3201    首页
3201    2021-09-13T08:58:47Z
3201    01
3201    com.suma.midware.ihome.third.Main3DActivity
3201    首页
3201
3201
3201    2021-09-13T08:58:45Z
3201    03
3201    GDZX2220210107121454
3201    中国兄弟连12
3201    11003
3201    GDZX0920190517000004_CPBQ00006
3201    07
3201    2194
3201    1
3201    2813
3201    1
3201    0
3201    0
3220    2021-09-13T08:58:45Z
3220    09
3220    00700100
3220    01
3220    01
3220    file:///data/adv/pic/1602815006348.jpg
3220    null
Time taken: 17.069 seconds, Fetched: 50 row(s)
hive> select region_code,pos,new_param from  gcable_look_offline lateral view posexplode(service_param) gcable_look_offline as pos,new_param;
Query ID = root_20211123165607_ab382c18-7409-4527-a0b4-2eab00e0f348
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1637634703570_0006, Tracking URL = http://hp1:8088/proxy/application_1637634703570_0006/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1637634703570_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2021-11-23 16:56:14,368 Stage-1 map = 0%,  reduce = 0%
2021-11-23 16:56:20,564 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.84 sec
MapReduce Total cumulative CPU time: 1 seconds 840 msec
Ended Job = job_1637634703570_0006
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.84 sec   HDFS Read: 7075 HDFS Write: 1633 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 840 msec
OK
3060    0       2021-09-13T08:58:45Z
3060    1       04
3060    2       3001
3060    3       茂名公共
3060    4       3001
3060    5       3001
3060    6       3001
3060    7       0
3060    8       1
3060    0       2021-09-13T08:58:45Z
3060    1       04
3060    2       1715
3060    3       卫生健康
3060    4       1715
3060    5       1715
3060    6       1715
3060    7       0
3060    8       1
3201    0       2021-09-13T08:58:45Z
3201    1       01
3201    2       com.suma.midware.dmp.play.PlayActivity
3201    3       直播全屏
3201    4       com.suma.midware.ihome.third.Main3DActivity
3201    5       首页
3201    0       2021-09-13T08:58:47Z
3201    1       01
3201    2       com.suma.midware.ihome.third.Main3DActivity
3201    3       首页
3201    4
3201    5
3201    0       2021-09-13T08:58:45Z
3201    1       03
3201    2       GDZX2220210107121454
3201    3       中国兄弟连12
3201    4       11003
3201    5       GDZX0920190517000004_CPBQ00006
3201    6       07
3201    7       2194
3201    8       1
3201    9       2813
3201    10      1
3201    11      0
3201    12      0
3220    0       2021-09-13T08:58:45Z
3220    1       09
3220    2       00700100
3220    3       01
3220    4       01
3220    5       file:///data/adv/pic/1602815006348.jpg
3220    6       null
Time taken: 14.502 seconds, Fetched: 50 row(s)
hive> 

参考:

  1. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode(array)
  2. https://blog.csdn.net/yahahassr/article/details/97911676
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值