Hive数据类型

最新推荐文章于 2024-07-30 09:00:41 发布

黑夜中奔跑

最新推荐文章于 2024-07-30 09:00:41 发布

阅读量769

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/ysy_1_2/article/details/106426062

版权

大数据专栏收录该内容

14 篇文章 1 订阅

订阅专栏

1.基本数据类型
在这里插入图片描述
2.复杂数据类型

一、数组类型array
案列一:原数据
在这里插入图片描述
建表语句

create external table ex(vals array<int>) row format delimited fields terminated by '\t'
collection items terminated by ',' location '/ex';

在这里插入图片描述

上传数据.

load data local inpath '/usr/wenjian/array1.txt' overwrite into table ex;

查询每行数组的个数

select size(vals) from ex;

在这里插入图片描述
查询一行的个数

select vals[0] from ex;

注:hive内置函数不具备查询某个具体的数组元素,需要自定义函数来实现,但这样的需求在实际开发中很少,所以不需要在意.

案列二
元数据:
在这里插入图片描述
建表语句:

 create external table ex1(info1 array<int>,info2 array<string>) row format delimited fields terminated by '\t' collection items terminated by ',' location '/ex';

在这里插入图片描述

二、map类型
案列一
原数据:
在这里插入图片描述
创建表语句
如果是map类型,列分隔符必须是\t.

 create external table m1(vals map<string,int>) row format delimited fields terminated by '\t' map keys terminated by ',' location '/map';

上传map.txt到m1表中.
在这里插入图片描述

案列二,要求查询tom这个人浏览了那些网站,并且为null的值不显示
原数据(分隔符为空格):
在这里插入图片描述
创建表语句
注意:map类型,列的分割符必须是’\t’

create external table ex3(vals map<string,sting>) row format delimited fields terminated by '/t' map keys terminated by '' location '/ex3';

上传文件和查询语句
在这里插入图片描述
如果想要去重工作,可以调用distinct内置函数.

三、struct类型(对象类型)
元数据:
在这里插入图片描述
建表语句:

create external table ex4(vals struct<name:string,age:int>) row format delimited collection items terminated by ' ' location '/ex4';

上传数据
在这里插入图片描述
查询数据:

四、Hive collect_set
collect_set函数用于数组去重,并将结果形成数组返回.

案列一,数组去重
在这里插入图片描述
实现步骤:
创建外部表:
执行:create external ex5(num int) location '/cset'
调用collect_set函数.

案列二,分组数据去重
原数据(姓名和ip之间,分隔符为空格);
在这里插入图片描述
需求:查询每个人都查看了那些ip地址,并实现数组去重.
实现步骤:

 create external table ex1(name string,ip string) row format delimited fields terminated by ' ' location '/cset';

调用collection_set函数,并结合group by分组函数
执行select name,collect_set(ip) from ex1 group by name;
在这里插入图片描述

案列三查看每个人,一共浏览了多少个不同的ip地址
执行:select name,size(ip) from(select name,collect_set(ip) ip from ex1 group by name) ex1;
select name,size(collect_set(ip)) from d2 group by name; 在这里插入图片描述
五、Hive explode
explode 命令可以将行数据,按指定规则切分出多行.
案列一,利用split执行切分规则
如下数据

将上面两行数据更具逗号拆分成多行(每个数字占一行)

实现步骤
1.准备原数据
2.上传HDFS,并创建对应的外部表
执行:create external table ex5(num string) location '/ex';
在这里插入图片描述
注:用explode做切分,注意表里只有一列,并且行数据是string类型,因为只有字符类型才能做切分.
上传文件

3.通过explode指令来做切分
执行:select explode(split(num,'',')) from ex5

六、Hive实现wordCount.

1.txt原数据(分割符为tab制表符)
在这里插入图片描述
实现步骤:
1.为此数据创建外部表(如果HDFS没有此数据,也可以通过内部表方式创建)
执行:create external table textlines(text string) location '/word' ;

2.要做单词统计,目前只有一列,且每列的数据是这种形式:hello world;所以,还要创建一张表,这张表里存的是切分后的数据,即这种形式
hello
world
hello
hadoop
…
所以先创建一张表:
在这里插入图片描述
3.对textlines表的字段做切分,并把切分后的单词写到words表里.
执行:insert overwrite table words select explode(split(text,'[ \t]+')) word from textlines;

4.通过group by,做单词统计
执行:select word,count(*) from words group by word;
最后结果如下:
在这里插入图片描述
5.结论:
可以看到,通过hive做小数据规模的单词统计,没有任何优势,整个过程大约需要耗费几分钟左右,所以,hive的使用定位是大数据统计.