Hive教程（二）

最新推荐文章于 2023-01-10 17:48:48 发布

辛聪明

最新推荐文章于 2023-01-10 17:48:48 发布

阅读量1.2k

点赞数 3

分类专栏： Hadoop生态圈文章标签： hive

本文链接：https://blog.csdn.net/qq_41861558/article/details/102620437

版权

Hadoop生态圈专栏收录该内容

29 篇文章 0 订阅

订阅专栏

一、hive文件格式

存储文本文件的表：

create table t_access_text(ip string,url string,access_time string)
row format delimited fields terminated by ','
stored as textfile;

存储sequence file文件的表：

create table t_access_seq(ip string,url string,access_time string)
stored as sequencefile;

存储parquet file文件的表：

create table t_access_parq(ip string,url string,access_time string)
stored as parquetfile;

二、数据类型

1、整形：

整型数据可以指定使用整型数据类型，INT。当数据范围超过INT的范围，需要使用BIGINT，如果数据范围比INT小，使用SMALLINT。 TINYINT比SMALLINT小。

类型	后缀	示例
TINYINT	Y	10Y
SMALLINT	S	10S
INT	-	10
BIGINT	L	10L

create table t_test(a string ,b int,c bigint,d float,e double,f tinyint,g smallint)

2、字符串类型

字符串类型的数据类型可以使用单引号('')或双引号(“”)来指定。它包含两个数据类型：VARCHAR和CHAR。

3、日期

DATE值在年/月/日的格式形式描述 {{YYYY-MM-DD}}.

4、时间戳：

它支持传统的UNIX时间戳可选纳秒的精度。它支持的java.sql.Timestamp格式“YYYY-MM-DD HH:MM:SS.fffffffff”和格式“YYYY-MM-DD HH:MM:ss.ffffffffff”。

5、小数点
在Hive 小数类型与Java大十进制格式相同。它是用于表示不可改变任意精度。语法和示例如下：

DECIMAL(precision, scale)
decimal(10,0)

6、联合类型

联合是异类的数据类型的集合。可以使用联合创建的一个实例。语法和示例如下：

{index：数据类型} 前一个数表示uniontype定义数据类型的下标

UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>

{0:1} 
{1:2.0} 
{2:["three","four"]} 
{3:{"a":5,"b":"five"}} 
{2:["six","seven"]} 
{3:{"a":8,"b":"eight"}} 
{0:9} 
{1:10.0}

7、浮点类型：有小数点的数字。通常，这种类型的数据组成DOUBLE数据类型。

8、复杂类型

（1）数组：

假如有以下数据，actor主演属性用arry数组比较方便（相同数据类型<>）

战狼2,吴京:吴刚:龙母,2017-08-16
三生三世十里桃花,刘亦菲:痒痒,2017-08-20

 ARRAY<data_type>

//建表
create table t_movie(moive_name string,actors array<string>,first_show date)
row format delimited fields terminated by ','
collection items terminated by ':';

//查询
select moive_name,actors[0] from t_movie;
select moive_name,actors from t_movie where array_contains(actors,'吴刚');       
select moive_name,size(actors) from t_movie;

（2）映射

假如有以下数据，家庭成员用map类型比较合适

1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
4,mayun,father:mayongzhen#mother:angelababy,26

MAP<primitive_type, data_type>

//建表语句：
create table t_person(id int,name string,family_members map<string,string>,age int)
row format delimited fields terminated by ','
collection items terminated by '#'
map keys terminated by ':';

//查询
## 取map字段的指定key的值
select id,name,family_members['father'] as father from t_person;

## 取map字段的所有key
select id,name,map_keys(family_members) as relation from t_person;

## 取map字段的所有value
select id,name,map_values(family_members) from t_person;
select id,name,map_values(family_members)[0] from t_person;

（3）结构体

假如有以下数据：个人信息包括整形：字符串：字符串，用一个类型可以使用struct

1,zhangsan,18:male:beijing
2,lisi,28:female:shanghai

STRUCT<col_name : data_type [COMMENT col_comment], ...>

//建表：
create table t_person_struct(id int,name string,info struct<age:int,sex:string,addr:string>)
row format delimited fields terminated by ','
collection items terminated by ':';

//查询
select * from t_person_struct;
select id,name,info.age from t_person_struct;

三、修改表定义

修改表名：

ALTER TABLE table_name RENAME TO new_table_name

修改分区名：

alter table t_partition partition(department='xiangsheng',sex='male',howold=20) rename to partition(department='1',sex='1',howold=20);

添加分区

alter table t_partition add partition (department='2',sex='0',howold=40);

删除分区

alter table t_partition drop partition (department='2',sex='2',howold=24);

修改表的文本格式定义

ALTER TABLE table_name [PARTITION partitionSpec] SET FILEFORMAT file_format

修改表列名定义

ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENTcol_comment] [FIRST|(AFTER column_name)]  

alter table t_user change price jiage float first;

增加替换列

ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type[COMMENT col_comment], ...)  

alter table t_user add columns (sex string,addr string);
alter table t_user replace columns (id string,age int,price float);

四、hive查询语法

1、本地查询

对数据量比较小的操作，就可以在本地执行，这样要比提交任务到集群执行效率要快很多，开启Hive的本地模式：

hive> set hive.exec.mode.local.auto=true;(默认为false)

当一个job满足如下条件才能真正使用本地模式：
1.job的输入数据大小必须小于参数：hive.exec.mode.local.auto.inputbytes.max(默认128MB)
2.job的map数必须小于参数：hive.exec.mode.local.auto.tasks.max(默认4)
3.job的reduce数必须为0或者1

2、条件查询

select * from t_access where access_time<'2017-08-06 15:30:20'

select * from t_access where access_time<'2017-08-06 16:30:20' and ip>'192.168.33.3';

3、聚合函数

1）求总行数（count） 在没有group by 和over()情况下，不可与纯字段一起查询

hive (default)> select count(*) cnt from emp;

2）求工资的最大值（max）

hive (default)> select max(sal) max_sal from emp;

3）求工资的最小值（min）

hive (default)> select min(sal) min_sal from emp;

4）求工资的总和（sum）

hive (default)> select sum(sal) sum_sal from emp;

5）求工资的平均值（avg）

hive (default)> select avg(sal) avg_sal from emp;

运算符：

1、A<=>B        A=B，且AB为null时也返回true，任一为null时返回null
select 123 where null <=> null;

2、 A <> B, A != B
select 8 where 8 <> 7;

> < >= ....

3、A [NOT] BETWEEN B AND C       C>A>B   其中B<C
select 8 where 8 between 5 and 9;  

4、is null    is not null

5、 in 匹配列表中任一满足即可
select 888 where 'li' in ('zhang','san','li');

6、A [NOT] LIKE B      % 代表零个或多个字符(任意个字符)  _ 代表一个字符。
select 888 where 'li' like 'li%';

7、A RLIKE B, A REGEXP B    其中B是JAVA的正则表达式
select * from emp where sal RLIKE '[2]';       #查找薪水中带2的数据

8、逻辑运算符  and or not

4、limit子句

hive (default)> select * from emp limit 5;

5、分组查询 group by

一旦有group by子句，那么，在select子句中就不能有（分组字段，聚合函数）以外的字段(select后只能接分组字段聚合函数)

//计算 emp 表每个部门的平均工资

hive (default)> select t.deptno, avg(t.sal) avg_sal from emp t group by t.deptno;

//计算 emp 每个部门中每个岗位的最高薪水

hive (default)> select t.deptno, t.job, max(t.sal) max_sal from emp t group by t.deptno,t.job;

select dt,count(*),max(ip) as cnt from t_access group by dt having dt>'20170804';

Having 语句
1．having 与 where 不同点
（1）where 针对表中的列发挥作用，查询数据；having 针对查询结果中的列发挥作用，
筛选数据。
（2）where 后面不能写聚合函数，而 having 后面可以使用聚合函数。
（3）having 只用于 group by 分组统计语句。

6、Join查询只支持等值连接，不支持非等值连接(连连接谓词持中不支持 or)

（1）inner join（join）

内连接：只有进行连接的两个表中都存在与连接条件相匹配的数据才会被保留下来。

select 
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
join t_b b
on a.name=b.name

（2）left outer join（left join）

左外连接：JOIN 操作符左边表中符合 WHERE 子句的所有记录将会被返回。

select 
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
left outer join t_b b
on a.name=b.name

（3）right outer join（right join）

右外连接：JOIN 操作符右边表中符合 WHERE 子句的所有记录将会被返回

select 
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
right outer join t_b b
on a.name=b.name

（4）full outer join（full join）

满外连接：将会返回所有表中符合 WHERE 语句条件的所有记录。如果任一表的指定
字段没有符合条件的值的话，那么就使用 NULL 值替代。

select
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
full join t_b b
on a.name=b.name;

结果：

（5）left semi join

hive中不支持exist/IN子查询，可以用left semi join来实现同样的效果：

select 
a.name as aname,
a.numb as anumb
from t_a a
left semi join t_b b
on a.name=b.name;

left semi join的 select子句中，不能有右表的字段

(6)多表连接

注意：连接 n 个表，至少需要 n-1 个连接条件。例如：连接三个表，至少需要两个连接条件。

SELECT e.ename, d.deptno, l.loc_name
FROM emp e
JOIN dept d
ON d.deptno = e.deptno
JOIN location l
ON d.loc = l.loc;

大多数情况下，Hive 会对每对 JOIN 连接对象启动一个 MapReduce 任务。本例中会首先启动一个 MapReduce job 对表 e 和表 d 进行连接操作，然后会再启动一个 MapReduce job将第一个 MapReduce job 的输出和表 l;进行连接操作。注意：为什么不是表 d 和表 l 先进行连接操作呢？这是因为 Hive 总是按照从左到右的顺序执行的。

（7）笛卡尔积

笛卡尔集会在下面条件下产生
（1）省略连接条件
（2）连接条件无效
（3）所有表中的所有行互相连接

hive (default)> select empno, dname from emp, dept;

7、子查询

select id,name,father 
from 
(select id,name,family_members['brother'] as father from t_person) tmp
where father is not null;

8、排序

（1）全局排序 Order By （asc 升序，默认 desc降序）

注意：Order By：全局排序，一个 Reducer

Sort By：每个 Reducer 内部进行排序，对全局结果集来说不是排序。

#按照员工薪水的 2 倍排序
select ename, sal*2 twosal from emp order by twosal;

#多个列排序,按照部门和工资升序排序
select ename, deptno, sal from emp order by deptno,sal ;

（2）分区排序（Distribute By ）

Distribute By：类似 MR 中 partition，进行分区，结合 sort by 使用。

注意，Hive 要求 DISTRIBUTE BY 语句要写在 SORT BY 语句之前。对于 distribute by 进行测试，一定要分配多 reduce 进行处理，否则无法看到 distribute by的效果。

group by只可以查询聚合函数和分组列，distribute by可以查看每个区（不是组）所有数据

案例：先按照部门编号分区，再按照员工编号降序排序

hive (default)> set mapreduce.job.reduces=3;
hive (default)> insert  overwrite  local  directory
'/opt/module/datas/distribute-result'  select  *  from  emp
distribute by deptno sort by empno desc;

select * from dept distribute by groupid sort by deptno asc;

（3）Cluster By排序

当 distribute by 和 sorts by 字段相同时，可以使用 cluster by 方式。cluster by 除了具有 distribute by 的功能外还兼具 sort by 的功能。但是排序只能是升序排序，不能指定排序规则为 ASC 或者 DESC。

以下两种写法等价

hive (default)> select * from emp cluster by deptno;
hive (default)> select * from emp distribute by deptno sort by deptno;

注意：按照部门编号分区，不一定就是固定死的数值，可以是 20 号和 30 号部门分到一个分区里面去。

9、分桶及抽样查询

(1) 分桶表数据存储

分区针对的是数据的存储路径；分桶针对的是数据文件。

也就是说，分区是表内在分几个路径，分桶是表内分几个数据文件，分区提供一个隔离数据和优化查询的便利方式。不过，并非所有的数据集都可形成合理的分区，特别是之前所提到过的要确定合适的划分大小这个疑虑。分桶是将数据集分解成更容易管理的若干部分的另一个技术。

#创建分桶表
create table stu_buck(id int, name string)
clustered by(id) into 4 buckets
row format delimited fields terminated by '\t';

#设置分桶属性
hive (default)> set hive.enforce.bucketing=true;
hive (default)> set mapreduce.job.reduces=-1;
hive (default)> insert into table stu_buck
select id, name from stu;

（2）分桶抽样查询

对于非常大的数据集，有时用户需要使用的是一个具有代表性的查询结果而不是全部结果。Hive 可以通过对表进行抽样来满足这个需求。

select * from stu_buck tablesample(bucket 1 out of 4 on id);

注：tablesample 是抽样语句，语法：TABLESAMPLE(BUCKET x OUT OF y) 。y 必须是 table 总 bucket 数的倍数或者因子。hive 根据 y 的大小，决定抽样的比例。例如，table 总共分了 4 份，当 y=2 时，抽取(4/2=)2 个 bucket 的数据，当 y=8 时，抽取(4/8=)1/2个 bucket 的数据。

x 表示从哪个 bucket 开始抽取，如果需要取多个bucket ，以后的桶号为x+y的倍数。例如，table 总 bucket 数为 4，tablesample(bucket 1 out of 2)，表示总共抽取（4/2=）2 个bucket 的数据，抽取第 1个和第 3个 bucket 的数据。

注意：x 的值必须小于等于 y 的值，因为查询最后一个桶是(Z/Y-1)Y+X即x-y+z，如果x>y，则大于总桶数，否则报错：
FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck

10、其他常用查询函数

（1）NVL：给值为 NULL 的数据赋值，它的格式是 NVL( string1, replace_with)。

#如果员工的 job 为 NULL，则用 1 代替
select id,job,nvl(job,1) from stu_bucket3;

#如果员工的 job 为 NULL，则用 id 代替
select id,job,nvl(job,id) from stu_bucket3;

（2）时间类

#格式化时间查询  要格式化的时间必须用“-”分隔开yyyy-MM-dd
select date_format('2019-06-29','yyyy/MM/dd');
OK
_c0
2019/06/29

#转换当前时间
select date_format(current_date,'yyyy.MM.dd');

#date_add:时间跟天数相加
hive (default)> select date_add('2019-06-29',5);
OK
_c0
2019-07-04
hive (default)> select date_add('2019-06-29',-5);
OK
_c0
2019-06-24
select date_format(date_add('2020-04-10',2),'yyyy,MM,dd');
2020,04,12

#date_sub:时间跟天数相减
hive (default)> select date_sub('2019-06-29',5);
OK
_c0
2019-06-24
hive (default)> select date_sub('2019-06-29 12:12:12',5);
OK
_c0
2019-06-24
hive (default)> select date_sub('2019-06-29',-5);
OK
_c0
2019-07-04

#datediff:两个时间相减
hive (school)> select datediff('2020-04-10','2019-04-10');
OK
_c0
366
hive (school)> select datediff('2018-04-10','2019-04-10');
OK
_c0
-365

（3）case when

#职业为student显示1，其他为0
select id,name,score,job,case job when 'student' then 1 else 0 end as test from stu_bucket3;

#求出不同部门男女各多少人。结果如下：
select
dept_id,
sum(case sex when '男' then 1 else 0 end) male_count,
sum(case sex when '女' then 1 else 0 end) female_count
from
emp_sex
group by
dept_id;

五、hive函数

1、类型转换函数 cast

select cast("5" as int)；
select cast("2017-08-03" as date) ;
select cast(current_timestamp as date);

2、数学运算函数

select round(5.4);   ## 5   四舍五入
select round(5.1345,3);  ##5.135     四舍五入保留三位有效数字
select ceil(5.4); // select ceiling(5.4);   ## 6 大于5.4最小整数 
select floor(5.4);  ## 5     小于5.4最大整数
select abs(-5.4);  ## 5.4    绝对值
select greatest(3,5,6);  ## 6   最大的
select least(3,5,6);    ##3  最小

select max(age) from t_person;    聚合函数
select min(age) from t_person;    聚合函数

3、字符串函数

(1)截取

substr(string, int start)   ## 截取子串
substring(string, int start)     ##从start位置开始，0 1表示第一个位置
示例：select substr("abcdefg",2);   --> bcdefg


substr(string, int start, int len) 
substring(string, int start, int len)
示例：select substr("abcdefg",2,3);     -->bcd

（2）拼接

## 拼接字符串
concat(string A, string B...)  
hive (school)> select concat('i',' love',' you');
i love you

## 拼接字符串，每个字符串中间加分隔符
concat_ws(string SEP, string A, string B...)
select concat_ws('*','i','love','you');
i*love*you

## COLLECT_SET(col)：函数只接受基本数据类型，它的主要作用是将某字段的值进行去
重汇总，产生 array 类型字段。
select collect_set(score) from stu_bucket3;
[46.0,87.9,30.0,59.0,90.3,100.0,97.0]

select concat_ws('|',collect_set(name)) from stu_bucket3;
wuli|xixi|chen|xiaohong|xin|song|wang

（3）长度

length(string A)
示例：select length("192.168.33.44")

（4）切分

split(string str, string pat)

select split("192.168.33.44","\\.");   ##\\.对.进行转义

（5）大小写

select upper(string str) ##转大写
lower()      #转小写

4、时间函数

获取当前时间

select current_timestamp;          -->2019-10-21 12:26:43.872
select current_date;               -->2019-10-21

获取当前时间的毫秒数时间戳

select unix_timestamp();

unix时间戳转字符串

from_unixtime(bigint unixtime[, string format])

示例：select from_unixtime(unix_timestamp());
select from_unixtime(unix_timestamp(),"yyyy/MM/dd HH:mm:ss");   ##规定格式

字符串转unix时间戳

unix_timestamp(string date, string pattern)

示例： select unix_timestamp("2017-08-10 17:50:30");

select unix_timestamp("2017/08/10 17:50:30","yyyy/MM/dd HH:mm:ss");
## 将字符串转成日期date
select to_date("2017-09-17 16:58:32");

5、表生成函数

（1）使用explode()对数组字段“炸裂”

select explode(subject) from student;

炸裂去重

select distinct tmp.sub
from 
(select explode(subjects) as sub from t_stu_subject) tmp;

（2）表生成函数 lateral view

select id,name,tmp.sub 
from score lateral view explode(subject) tmp as sub;

理解： lateral view 相当于两个表在join

左表：是原表

右表：是explode(某个集合字段)之后产生的表

而且：这个join只在同一行的数据间进行

6、集合函数

（1）array_contains(Array<T>, value) 返回boolean值示例：

select moive_name,array_contains(actors,'吴刚') from t_movie;
select array_contains(array('a','b','c'),'c');

（2）sort_array(Array<T>) 返回排序后的数组

select sort_array(array('c','b','a')); -->"a","b","c"

（3）size(Array<T>) 返回一个int值

select moive_name,size(actors) as actor_number from t_movie;
-->返回每一行数组的长度

size(Map<K.V>)  返回一个int值
map_keys(Map<K.V>)  返回一个数组
map_values(Map<K.V>) 返回一个数组

7、CASE WHEN 函数

语法：
CASE   [ expression ]
       WHEN condition1 THEN result1
       WHEN condition2 THEN result2
       ...
       WHEN conditionn THEN resultn
       ELSE result
END

示例：
select id,name,
case
when age<28 then 'youngth'
when age>27 and age<40 then 'zhongnian'
else 'old'
end
from t_user;

8、if 函数

select id,if(age>25,'working','worked') from t_user;

select moive_name,if(array_contains(actors,'吴刚'),'好电影','rom t_movie;

9、json解析函数生成新表

json_tuple函数

select json_tuple(json,'movie','rate','timeStamp','uid') as(movie,rate,ts,uid) from t_rating_json;

10、分析函数

row_number函数，对表中的数据按照性别分组，按照年龄倒序排序并进行标记

select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rank
from t_rownumber

辛聪明

关注

3
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
Hive教程（二）

目录一、hive文件格式二、数据类型三、修改表定义四、hive查询语法五、hive函数一、hive文件格式存储文本文件的表：create table t_access_text(ip string,url string,access_time string)row format delimited fields terminated by ','s...
复制链接

扫一扫

专栏目录