1、复合数据类型struct,struct内部的数据可以通过DOT(.)来存取,例如,表中一列c的类型为STRUCT{a String; b int},我们可以通过c.a来访问域a
语法:
struct_type
: STRUCT < col_name : data_type [COMMENT col_comment], ...>
1、准备测试数据
[root@node04 data]# cd /opt/software/data/
[root@node04 data]# vi struct_student
1,小红1,18-female-tianjin
2,小明,16-man-xizhang
3,豆豆,28-female1-lasha
4,小康,22-female-changsha
5,聪聪,21-man-dalian
6,阿萌,26-man-chongqing
2、建表,字段之间英文逗号分隔 字段内个Item 用‘-’分割
create table student_contain_struct
(
id int,
name String,
info struct<age:int,sex:String,address:String>
)
row format delimited
fields terminated by ','
collection items terminated by '-';
3、导入数据
hive> load data local inpath '/opt/software/data/struct_student' into table student_contain_struct;
4、查看数据
hive> select * from student_contain_struct;
OK
1 小红1 {"age":18,"sex":"female","address":"tianjin"}
2 小明 {"age":16,"sex":"man","address":"xizhang"}
3 豆豆 {"age":28,"sex":"female1","address":"lasha"}
4 小康 {"age":22,"sex":"female","address":"changsha"}
5 聪聪 {"age":21,"sex":"man","address":"dalian"}
6 阿萌 {"age":26,"sex":"man","address":"chongqing"}
Time taken: 1.706 seconds, Fetched: 6 row(s)
hive> select id,name,info.age,info.address from student_contain_struct;
OK
1 小红1 18 tianjin
2 小明 16 xizhang
3 豆豆 28 lasha
4 小康 22 changsha
5 聪聪 21 dalian
6 阿萌 26 chongqing
Time taken: 0.323 seconds, Fetched: 6 row(s)
hive>
2、单词统计
文本数据导入表
1、准备测试数据
[root@node04 data]# vi wordcount
chongqing man terminated female1 collection address
student terminated female1 address
collection chongqing terminated address
2、建表导入数据,由于数据是无规则不定长的数据,考虑使用serde序列化和反序列化数据
create table wordcount
(
line String
)
row format SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
3、导入数据到表
hive> load data local inpath '/opt/software/data/wordcount' into table wordcount;
Loading data to table default.wordcount
OK
Time taken: 0.799 seconds
4、查询导入数据
hive> select * from wordcount;
OK
chongqing man terminated female1 collection address
student terminated female1 address
collection chongqing terminated address
Time taken: 1.798 seconds, Fetched: 4 row(s)
hive>
方式一、对导入表中数据进行转换,split 函数进行分割成数组,再使用内置函数explode 将数组装换成列
1、分割成数组 \t:制表符 \n:换行符 \n\t:换行且行首空四格 \s:空格 \r:回车符
hive> select split(line,' ') as word from wordcount;
OK
["chongqing","man","terminated","female1","collection","address"]
["student","terminated","female1","address"]
["collection","chongqing","terminated","address"]
[""]
Time taken: 0.208 seconds, Fetched: 4 row(s)
hive>
2、创建结果表
create table wordcount_result
(
word String,
count int
);
3、向表中插入查询数据
from (select explode(split(line,' ')) word from wordcount) t
insert into wordcount_result select word,count(word) group by word;
hive> from (select explode(split(line,' ')) word from wordcount) t
> insert into wordcount_result select word,count(word) group by word;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20210627184212_5e3927fb-c5cd-4627-b5d6-e416a2963b4c
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1624790236634_0002, Tracking URL = http://node03:8088/proxy/application_1624790236634_0002/
Kill Command = /opt/software/hadoop-2.10.1/bin/hadoop job -kill job_1624790236634_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2021-06-27 18:42:24,131 Stage-1 map = 0%, reduce = 0%
2021-06-27 18:42:35,416 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.18 sec
2021-06-27 18:42:45,851 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.31 sec
MapReduce Total cumulative CPU time: 4 seconds 310 msec
Ended Job = job_1624790236634_0002
Loading data to table default.wordcount_result
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.31 sec HDFS Read: 9502 HDFS Write: 157 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 310 msec
OK
Time taken: 35.242 seconds
hive>
4、查询插入结果
hive> select * from wordcount_result;
OK
1
address 3
chongqing 2
collection 2
female1 2
man 1
student 1
terminated 3
Time taken: 0.231 seconds, Fetched: 8 row(s)
hive>
方式二、直接嵌套子表查询
select word,count(word) from
(select explode(split(line,' ')) as word from wordcount) ws group by word order by word;
hive> select word,count(word) from
> (select explode(split(line,' ')) as word from wordcount) ws group by word order by word;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20210627184513_6c9f0c96-6121-4698-a8d8-5e6281aed6a4
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1624790236634_0003, Tracking URL = http://node03:8088/proxy/application_1624790236634_0003/
Kill Command = /opt/software/hadoop-2.10.1/bin/hadoop job -kill job_1624790236634_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2021-06-27 18:45:23,900 Stage-1 map = 0%, reduce = 0%
2021-06-27 18:45:35,231 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.93 sec
2021-06-27 18:45:43,430 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.46 sec
MapReduce Total cumulative CPU time: 3 seconds 460 msec
Ended Job = job_1624790236634_0003
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1624790236634_0004, Tracking URL = http://node03:8088/proxy/application_1624790236634_0004/
Kill Command = /opt/software/hadoop-2.10.1/bin/hadoop job -kill job_1624790236634_0004
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2021-06-27 18:45:59,325 Stage-2 map = 0%, reduce = 0%
2021-06-27 18:46:07,522 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 1.27 sec
2021-06-27 18:46:15,720 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 2.66 sec
MapReduce Total cumulative CPU time: 2 seconds 660 msec
Ended Job = job_1624790236634_0004
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.46 sec HDFS Read: 8127 HDFS Write: 301 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 2.66 sec HDFS Read: 5742 HDFS Write: 260 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 120 msec
OK
1
address 3
chongqing 2
collection 2
female1 2
man 1
student 1
terminated 3
Time taken: 63.494 seconds, Fetched: 8 row(s)
hive>