Hive 练习一单词统计、建表复合数据类型struct

最新推荐文章于 2023-12-25 11:15:26 发布

A6-母婴小店-第6分店

最新推荐文章于 2023-12-25 11:15:26 发布

阅读量160

点赞数

分类专栏： HIVE 文章标签： hive

本文链接：https://blog.csdn.net/chooser007/article/details/118275523

版权

HIVE 专栏收录该内容

16 篇文章 2 订阅

订阅专栏

1、复合数据类型struct，struct内部的数据可以通过DOT（.）来存取，例如，表中一列c的类型为STRUCT{a String; b int}，我们可以通过c.a来访问域a

语法：

struct_type
  : STRUCT < col_name : data_type [COMMENT col_comment], ...>

1、准备测试数据
[root@node04 data]# cd /opt/software/data/
[root@node04 data]# vi struct_student

1,小红1,18-female-tianjin
2,小明,16-man-xizhang
3,豆豆,28-female1-lasha
4,小康,22-female-changsha
5,聪聪,21-man-dalian
6,阿萌,26-man-chongqing

2、建表，字段之间英文逗号分隔 字段内个Item 用‘-’分割
create table student_contain_struct
(
id int,
name String,
info struct<age:int,sex:String,address:String>
)
row format delimited
fields terminated by ','
collection items terminated by '-';

3、导入数据
hive> load data local inpath '/opt/software/data/struct_student' into table student_contain_struct;

4、查看数据
hive> select * from student_contain_struct;
OK
1	小红1	{"age":18,"sex":"female","address":"tianjin"}
2	小明	{"age":16,"sex":"man","address":"xizhang"}
3	豆豆	{"age":28,"sex":"female1","address":"lasha"}
4	小康	{"age":22,"sex":"female","address":"changsha"}
5	聪聪	{"age":21,"sex":"man","address":"dalian"}
6	阿萌	{"age":26,"sex":"man","address":"chongqing"}
Time taken: 1.706 seconds, Fetched: 6 row(s)
hive> select id,name,info.age,info.address from student_contain_struct;
OK
1	小红1	18	tianjin
2	小明	16	xizhang
3	豆豆	28	lasha
4	小康	22	changsha
5	聪聪	21	dalian
6	阿萌	26	chongqing
Time taken: 0.323 seconds, Fetched: 6 row(s)
hive>

2、单词统计

文本数据导入表

1、准备测试数据
[root@node04 data]# vi wordcount
chongqing man terminated female1 collection address
student terminated female1 address
collection chongqing terminated address

2、建表导入数据,由于数据是无规则不定长的数据，考虑使用serde序列化和反序列化数据
create table wordcount
(
line String
)
row format SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

3、导入数据到表
hive> load data local inpath '/opt/software/data/wordcount' into table wordcount;
Loading data to table default.wordcount
OK
Time taken: 0.799 seconds

4、查询导入数据
hive> select * from wordcount;
OK
chongqing man terminated female1 collection address
student terminated female1 address
collection chongqing terminated address

Time taken: 1.798 seconds, Fetched: 4 row(s)
hive>

方式一、对导入表中数据进行转换,split 函数进行分割成数组，再使用内置函数explode 将数组装换成列

1、分割成数组  \t：制表符  \n：换行符   \n\t：换行且行首空四格  \s：空格  \r：回车符

hive> select split(line,' ') as word from wordcount;
OK
["chongqing","man","terminated","female1","collection","address"]
["student","terminated","female1","address"]
["collection","chongqing","terminated","address"]
[""]
Time taken: 0.208 seconds, Fetched: 4 row(s)
hive> 

2、创建结果表
create table wordcount_result
(
word String,
count int
);

3、向表中插入查询数据
from (select explode(split(line,' ')) word from wordcount) t
insert into wordcount_result select word,count(word) group by word;

hive> from (select explode(split(line,' ')) word from wordcount) t
    > insert into wordcount_result select word,count(word) group by word;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20210627184212_5e3927fb-c5cd-4627-b5d6-e416a2963b4c
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1624790236634_0002, Tracking URL = http://node03:8088/proxy/application_1624790236634_0002/
Kill Command = /opt/software/hadoop-2.10.1/bin/hadoop job  -kill job_1624790236634_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2021-06-27 18:42:24,131 Stage-1 map = 0%,  reduce = 0%
2021-06-27 18:42:35,416 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.18 sec
2021-06-27 18:42:45,851 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.31 sec
MapReduce Total cumulative CPU time: 4 seconds 310 msec
Ended Job = job_1624790236634_0002
Loading data to table default.wordcount_result
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.31 sec   HDFS Read: 9502 HDFS Write: 157 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 310 msec
OK
Time taken: 35.242 seconds
hive> 


4、查询插入结果
hive> select * from wordcount_result;
OK
	1
address	3
chongqing	2
collection	2
female1	2
man	1
student	1
terminated	3
Time taken: 0.231 seconds, Fetched: 8 row(s)
hive>

方式二、直接嵌套子表查询

select word,count(word) from
(select explode(split(line,' ')) as word from wordcount) ws group by word order by word;


hive> select word,count(word) from
    > (select explode(split(line,' ')) as word from wordcount) ws group by word order by word;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20210627184513_6c9f0c96-6121-4698-a8d8-5e6281aed6a4
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1624790236634_0003, Tracking URL = http://node03:8088/proxy/application_1624790236634_0003/
Kill Command = /opt/software/hadoop-2.10.1/bin/hadoop job  -kill job_1624790236634_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2021-06-27 18:45:23,900 Stage-1 map = 0%,  reduce = 0%
2021-06-27 18:45:35,231 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.93 sec
2021-06-27 18:45:43,430 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.46 sec
MapReduce Total cumulative CPU time: 3 seconds 460 msec
Ended Job = job_1624790236634_0003
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1624790236634_0004, Tracking URL = http://node03:8088/proxy/application_1624790236634_0004/
Kill Command = /opt/software/hadoop-2.10.1/bin/hadoop job  -kill job_1624790236634_0004
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2021-06-27 18:45:59,325 Stage-2 map = 0%,  reduce = 0%
2021-06-27 18:46:07,522 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 1.27 sec
2021-06-27 18:46:15,720 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 2.66 sec
MapReduce Total cumulative CPU time: 2 seconds 660 msec
Ended Job = job_1624790236634_0004
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.46 sec   HDFS Read: 8127 HDFS Write: 301 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 2.66 sec   HDFS Read: 5742 HDFS Write: 260 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 120 msec
OK
	1
address	3
chongqing	2
collection	2
female1	2
man	1
student	1
terminated	3
Time taken: 63.494 seconds, Fetched: 8 row(s)
hive>

A6-母婴小店-第6分店

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hive 练习一单词统计、建表复合数据类型struct

1、复合数据类型struct，struct内部的数据可以通过DOT（.）来存取，例如，表中一列c的类型为STRUCT{a String; b int}，我们可以通过c.a来访问域a语法：struct_type : STRUCT < col_name : data_type [COMMENT col_comment], ...>1、准备测试数据[root@node04 data]# cd /opt/software/data/[root@node04 data]# vi s
复制链接

扫一扫