hive基础总结

最新推荐文章于 2022-11-19 09:40:17 发布

打雷同学

最新推荐文章于 2022-11-19 09:40:17 发布

阅读量654

点赞数

文章标签： hive

本文链接：https://blog.csdn.net/weixin_45498137/article/details/109722533

版权

1：创建外部表

create external table if not exists employee_external(
name string,
work_palce array<string>,
sex_age struct<sex:string,age:int>,
depart_title map<string,array<string>>
)
comment 'this is an external table'
row format delimited
fields terminated by '|'
collection items terminated by ','
map keys terminated by ':'
stored as textfile
location '/wj/employee';

注意：文件地址必须为：hdfs上的文件路径

2：创建内部表

create table if not exists employee(
name string,
work_palce array<string>,
sex_age struct<sex:string,age:int>,
skills_score map<string,int>,
depart_title map<string,array<string>>
)
comment 'this is an inner table'
row format delimited
fields terminated by '|'
collection items terminated by ','
map keys terminated by ':'
stored as textfile;

注意：在创建内部表的时候，需要将hdfs上的/user文件夹权限改为777，否则无法写入

3：CTAS – as select方式建表

注：只能创建内部表，将查询的结果映射成一张内部表，不可创建外部表，不可创建分桶表，不可创建分区表

create table ctas_employee as select * from employee;

4：cte创建表

注：这个方法是将两个查询语句查询出来的信息组成两个表，然后将两个表用union all组合起来

create table cte_employee as
with
r1 as (select name from (select name from employee where sex_age.sex='Male') r2 where name='Michael'),
r3 as (select name from employee where sex_age.sex='Female')
select * from r1 union all select * from r3;

r2：将employee中的男性查出来

r1：在r1查出来的男性中找出来名叫Michael的，形成一个表

r3：将employee中的女性查出来

r1和r3的内容用union all进行组合

5：like创建表

create table like_employee like employee;

注意：复制表，只是复制表结构，如果需要表数据，则需要将原始数据传入指定的Hdfs位置

6：创建临时表

create temporary table tmp_employee like employee;

注：创建方式与上面几个相同，但是需要在table前加修饰：temporary，并且：临时表值存在session中，在重启beeline后，临时表消失

7：查看表结构

desc employee;

8：查看表信息

desc formatted employee;

9：删除表

drop table if exists like_employee;

10：清空表

truncate table ctas_employee;

11：分区（静态分区，动态分区）

11.1：静态分区：

11.1.1：创建静态分区：

create table employee_partitioned(
name string,
work_palce array<string>,
sex_age struct<sex:string,age:int>,
skills_score map<string,int>,
depart_title map<string,array<string>>
)
partitioned by(year int,month int)
row format delimited
fields terminated by '|'
collection items terminated by ','
map keys terminated by ':';

11.1.2：向分区表中添加分区：

alter table employee_partitioned add
partition (year=2019,month=09);

11.1.3：删除某分区：

alter table employee_partitioned drop
partition (year=2019,month=08);

静态分区表相当于一个内部表，只不过内部表是将所有的数据放在同一个文件夹内。而分区表示将表文件夹又重新建子文件夹，然后将文件放入其中某一个子文件夹。如果在两个子文件夹下有两个相同的数据，则都可以查出来。:

11.2：动态分区：

动态分区：按照原始数据的字段内容创建分区

首先，设置：原因（hive的动态分区默认不开启，以为如果选择的字段不合理，可能会造成出现成百上千个分区）

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

insert into table employee_partitioned partition(year, month)
select name,array('Toronto') as work_place,
named_struct("sex","male","age",30) as sex_age,
map("python",90) as skills_score,
map("r&d", array('developer')) as depart_title,
year(start_date) as year,month(start_date) as month
from employee_hr eh ;

12：分桶

定义：分桶就是将一个原始表拆分成多个子表进行存储

优点：查询快

12…1：创建原始表：

CREATE TABLE employee_id                         
(
  name string,
  employee_id int,
  work_place ARRAY<string>,
  sex_age STRUCT<sex:string,age:int>,
  skills_score MAP<string,int>,
  depart_title MAP<STRING,ARRAY<STRING>>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';

12.2：创建分桶表：

create table employee_id_buckets
(
	name string,
	employee_id int,
	work_place array<string>,
	sex_age struct<sex:string,age:int>,
	skills_score map<string,int>,
	depart_title map<string,array<string>>
)
clustered by (employee_id) into 2 buckets
row format delimited
fields terminated by '|'
collection items terminated by ','
map keys terminated by ':';

12.3：向分桶表中添加数据：（由于分桶是动态的，所以只能用insert语句进行操作）

设置：

set hive.enforce.bucketing=true;
set mapreduce.job.reduces=-1;

insert  overwrite table employee_id_buckets 
select * from employee_id;

13：视图

定义：将表通过hql语句隐藏部分信息，形成一个新的不可修改操作新表

面试题：如何知道查询的是表还是视图？

答：通过desc formatted 名字，查询出来的结果左下角可以看得到是view还是table。

13.1：创建视图：

create view view_employee_id as 
select * from employee_id;

13.2：查找视图

show tables;

13.3：查看视图定义（创建过程）

show create table view_employee_id;

14：侧视图

作用：将数据中含有数组或者Jason的字段炸开，行转列

select name,wps,score from employee
lateral view 
explode (work_palce) work_palce_single as wps
lateral view
explode (skills_score) sks as skill,score;

15：select查询

select * from tablename;

16：CTE查询与嵌套查询

16.1：CTE查询

with 
t1 as 
( select * from employee where sex_age.age>=30) 
select * from t1 where sex_age.age>50;

注释：将( select * from employee where sex_age.age>=30) 的查询结果作为子表

16.2：嵌套查询

select * from (select * from employee where sex_age.age>=30) a;

注意：此处字表查询结果必须命名，不然会报错

17：匹配正则表达式

首先，开启正则匹配

SET hive.support.quoted.identifiers = none;

匹配正则：

select `^emp.*` from employee_id;

注意：正则表达式的括号是esc下面的点，并且“^”不写也是ok的。

18：关联查询

18.1：内连接查询

select * from emp_basic eb inner join emp_psn ep on eb.emp_id=ep.emp_id limit 8;

注释：即将左表和右表的满足条件列取出来

18.2：外连接查询

18.2.1：左外连接查询

列出左表的所有行以及右表所有满足条件的行

select * from emp_basic eb left join emp_psn ep on eb.emp_id=ep.emp_id ;

18.2.2：右外连接查询

与左外连接查询相反，列出右表的所有行以及左表的符合条件的行

select * from emp_basic eb right join emp_psn ep on eb.emp_id=ep.emp_id ;

18.3：隐式连接查询

不需要多谈

select * from emp_basic eb,emp_psn ep where eb.emp_id=ep.emp_id;

19：union操作

19.1：union

将相同字段的两张表合并，并去除相同行

select * from emp_basic eb where eb.company = 'Roomm'
union
select * from emp_basic eb where eb.company = 'Cogilith'
order by emp_id;

19.2：union all

select * from emp_basic eb where eb.company = 'Roomm'
union all
select * from emp_basic eb where eb.company = 'Cogilith'
order by emp_id;

20：LOAD移动数据

20.1：移动本地的文件

load data local inpath '/opt/data/employee.txt' overwrite into table employee;

20.2：移动hdfs上的文件

load data inpath '/wj/employee/employee.txt' overwrite into table employee;

说明：当移动本地文件的时候，不会删除源文件；当移动hdfs上的文件的时候，会删除源文件

21：insert插入

21.1：insert into

INSERT INTO table_name VALUES

21.2：insert overwrite

21.2.1：insert overwrite 多插入

from employee
insert overwrite table test select *
insert overwrite table test1 select *;

说明：将employee中的数据插入test和test1中，并覆盖原数据

21.2.2：insert overwrite 插入分区（原始表为分区表）

from employee_partitioned
insert overwrite table test3 partition(year,month)
select * ;

说明：1：原始表A为分区表，B表必须为分区表

2：B表的表结构必须与A表相同

3：不需要在B表中创建分区，开启动态分区后，自动创建分区

4：如果A表有4个分区，但是只有2个分区有数据，则B表中只会创建对应的2个分区用来存放对应分区中的数据

5：开启动态分区的命令：

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

21.2.3：insert overwrite 插入分区（原始表为普通表）

from employee
insert overwrite table test3 partition(year,month)
select * ,'2019','01;

说明：1：原始表A为普通表，B表为分区表；

2：B表的数据结构必须与A表相同；

3：不需要在B表中创建分区，开启动态分区后，自动创建分区；

4：select后的数字为指定分区，将A表数据存入B表的指定分区内；

22：insert语句插入/导出数据

22.1：将表内容导入到本地，hdfs，其他表

from employee
insert overwrite local directory '/opt/data/aa' select *
insert overwrite directory '/tmp/wj' select *
insert overwrite table test5 select*;

说明：没什么好说的，注意写入位置的权限问题就好了

22.2：将表内容导出值文件夹，并按某字符进行字段分割

insert overwrite local directory '/opt/data/aa'
row format delimited fields terminated by ','
select * from employee;

23：IMPORT/EXPORT数据导入导出

23.1：export

EXPORT TABLE employee TO '/tmp/output3';
EXPORT TABLE employee_partitioned partition (year=2014, month=11) TO '/tmp/output5';

23.2：export

import TABLE employee FROM '/tmp/output3';
import TABLE employee_partitioned partition (year=2014, month=11) FROM '/tmp/output5';

24：hive排序

24.1：：order by

select * from employee_id order by employee_id ;

说明：按照id正序排列

select * from employee_id order by employee_id desc;

24.2：SORT BY/DISTRIBUTE BY

select department_id,name,employee_id,evaluation_score
from employee_hr
distribute by department_id sort by evaluation_score desc;

说明：

1：将employee_hr按照部门分组，然后组内按照evaluation_score进行倒序

2：sort by的列必须存在于select中

3：设置reducer个数：

set mapreduce.job.reduces=3;

说明：sort by 与order by的区别

1：都是排序

2：order by是全局排序，只启动一个reduce，sort by 在没有设置的情况下，启用一个reduce，进行全局排序

如果启用了多个reduce，则进行分区，然后分区内排序

3：order by 与group by组合使用，进行分组后排序

4：sort by 与distribute by使用，将数据分区后进行排序

24.3：CLUSTER BY

SELECT name, employee_id FROM employee_hr CLUSTER BY name;

说明：

1：不支持ASC|DESC

2：排序列必须出现在SELECT column列表中

3：CLUSTER BY = DISTRIBUTE BY + SORT BY

24.4：补充：hive排序：distribute by 、sort by 、cluster by 、order by 区别

1：order by 只有一个reduce负责对所有的数据进行排序，若大数据量，则需要较长的时间。建议在小的数据集中使用order by 进行排序。

2：sort by 基本上不受hive.mapred.mode影响，可以通过mapred.reduce.task 指定reduce个数，查询后的数据被分发到相关的reduce中。

3：sort by 的数据在进入reduce前就完成排序，如果要使用sort by 是行排序，并且设置map.reduce.tasks>1,则sort by 才能保证每个reducer输出有序，不能保证全局数据有序。

4： sort by 的数据在进入reduce前就完成排序，如果要使用sort by 是行排序，并且设置map.reduce.tasks>1,则sort by 才能保证每个reducer输出有序，不能保证全局数据有序。

5：distribute by 采集hash算法，在map端将查询的结果中hash值相同的结果分发到对应的reduce文件中。

6：distribute by 可以使用length方法会根据string类型的长度划分到不同的reduce中，最终输出到不同的文件中。 length 是内建函数，也可以指定其他的函数或这使用自定义函数。

7：cluster by 除了distribute by 的功能外，还会对该字段进行排序，所以cluster by = distribute by +sort by 。

25：group by 分组

set hive.groupby.orderby.position.alias=true;

select category, max(offervalue) from offers group by category;

select if(category > 4000, 'GOOD', 'BAD') as newcat,max(offervalue) 
from offers 
group by category if(category > 4000, 'GOOD', 'BAD');

select sex_age.age from employee group by sex_age.age having count(*) <= 1;

说明：

group by 与distribute by的区别

group by是分组，将同一字段内的相同数据分组，然后课组内聚合

distribute by是分区，将同一字段内的相同数据分组，然后用sort by 按照某一个字段进行排序，可以保证组内有序，但是返回的是表内所有数据，可以保证区内有序

27：其他函数

27.1：NVL（字段，值）：如果字段为null的时候，自动转换为值，类似MySQL中的ifnull（）；

27.2：date_format(日期，‘yyyy-MM-dd’)，时间格式函数，说明：日期格式必须为：2019-08-18这样的格式

27.3：字符串替换函数：regexp_replace(str1,str2,str3)，将str1中的str2替换成str3

27.4：date_add（日期，int），时间跟天数相加，如果int为负数，则为相减

27.5：datediff（日期1，日期2），日期1-日期2，算出时间差

27.6：if(Boolean类型表达式，结果1，结果2)，如果Boolean表达式为true，则取结果1，否则取结果2

27.7：concat（str1,str2,str3…）：将各个str进行拼接，然后返回一个字符串

27.8：concat_ws(str1,str2,str3…)：将str1插入在后面的字符串中，然后返回总字符串

27.9：collect_set：将几个字段的内容组合成set集合

28：常用的窗口函数

ROW_NUMBER()：对所有数值输出不同的序号，序号唯一连续

29：hive关键字的执行顺序

from => join on => where => group by => select => having => order by => limit

函数：

1：字符串函数

1.1：concat(string|binary A, string|binary B…)

将字符串或者二级制字符进行连接

1.2：instr(string a, string b)

查找字符a在b中出现的位置

1.3：length(string A)

返回字符串A的长度

1.4：locate(string a, string b[, int pos])

返回字符串a在字符串b中第一次出现的位置

例：locate（‘a’,‘noda’），返回值是4

1.5：lower(string A) /upper(string A)

将字符串大小写转换

1.6：regexp_replace(string a, string b, string c)

将a中的b全部用c替代

例：regexp_replace(‘2019/02/15’,’/’,’-’)，返回值是2019-02-15

1.7：split(string str, string pat)

按照pat来将str分割成字符串数组

例：select split(‘a-a-a-a’,’-’) 返回值是：[‘a’,‘a’,‘a’,‘a’]

1.8：substr(string|binary A, int start, int len)

从字符串A中，从start开始截取len长度的字符串，返回新的字符串

例：select substr(‘abcd’,2,3);，返回bcd

1.9：trim(string A)

取出字符串A前后的空格，返回新的字符串

例：select str_to_map(‘aaa:11&bbb:22’, ‘&’, ‘:’)，返回：{“bbb”:“22”,“aaa”:“11”}

1.10：encode(string src, string charset)

用指定字符集charset将字符串编码成二进制值**（此函数待定，没有试出来）**

1.11：cast(expr as )

将expr转换成type类型如：cast(“1” as BIGINT) 将字符串1转换成了BIGINT类型

1.12：binary(string|binary)

将输入值转换成二进制

此处有疑惑，为何返回值扔然是本身

2：数学函数

2.1：round(DOUBLE a)，

对小数进行四舍五入，去bigint值

例：select round(2.13);返回值是2.0

2.2：round(DOUBLE a, INT d)，

对a取d位小数

例：select round(2.13，1);返回值是2.1

2.3：floor(DOUBLE a) 向下取整

例1：select floor(2.13);返回值是2

例2：select floor(2.9);返回值是2

2.4：rand(INT seed)

说明：直接用rand（）则返回0-1之间的随机数，如果加seed，则每次返回的随机数都是一个数

如果想要0-10之间的随机数则可以*10，想要整数则可以用round（rand（））进行取整

2.5：power(DOUBLE a, DOUBLE p)

去a的p次方值

例：select power(2,2);返回值是4

2.6：abs(DOUBLE a)

计算a的绝对值

3：日期函数

3.1：from_unixtime(时间戳，‘格式’)

将时间戳转换为时间格式

例：select from_unixtime(unix_timestamp(),‘yyyy-MM-dd’)

返回值：2019-09-20

3.2：unix_timestamp()

获取当前系统时间的时间戳

例：select unix_timestamp()；

返回：1569035493

例：select from_unixtime(unix_timestamp(),‘yyyy-MM-dd’);

返回当前的系统时间：2019-09-20

3.3：to_date(string timestamp)

返回字符串中的日期部分

例1：select to_date(‘2019-02-15aa’);返回值：2019-02-15

例2：select to_date(‘2019/02/15aa’);返回值：NULL

用法：如果切出来的字符串中含有日期以及其他的字符，则可以用这个函数将日期切出来

3.4：year(string date)
month/day/hour/minute/second/weekofyear

将字符串中的年月日时分秒以及当天是本年的第几周截取出来，但是日期格式必须是yyyy-MM-dd的格式

例：select year(‘2019-05-02’);，返回值是2019

例：select weekofyear(‘2019-05-02’)，返回值是18，这一天是2019年的第18周

3.5：datediff(string enddate, string startdate)

计算两个日期之间相差的天数

例：select datediff(‘2019-05-02’,‘2019-05-12’);返回值是-10，

注意，第一个时间是结束的时间，第二个时间才是开始的时间

3.6：date_add(string startdate, int days)

在原有的时间上加上几天后，返回新的日期，days可以是负数

例1：select date_add(‘2019-09-20’,1);返回值是2019-09-21

例2：select date_add(‘2019-09-20’,-1);返回值是2019-09-19

3.7：current_date

获取时间

例：select current_date();返回值：2019-09-21

3.8：current_timestamp

返回当前时间戳，可以用from_timestamp()转换成日期格式

例：select from_unixtime(unix_timestamp(),‘yyyy-MM-dd’)，返回是当前的日期

3.9：date_format(date/timestamp/string ts, string fmt)

格式化日期

例：select date_format(‘2019-09-21’,‘yyyy-MM’)；返回值是2019-09

date_format('日期‘，’u‘)；返回值是当前的周几

4：集合函数

4.1：size(Map<K.V>)

返回map的大小

例：有点问题，需要解释

4.2：size(Array)

返回数组的大小

例：select size(Array(1,2,3,4));返回值：4

4.3：map_keys(Map<K.V>)

返回map中所有的key

4.4：map_values(Map<K.V>)

返回map中所有的value

4.5：array_contains(Array, value)

查看数组中是否含有某值，如果含有，则返回true，如果没有，则返回false

4.6：sort_array(Array)

对数组进行排序

5：条件函数

5.1：if(条件，值1，值2)

如果条件返回值为true，则函数返回值是值1，如果不成立，则返回值2

例1：select if(1=2,true,false);返回值是：false

例2：select if(1=1,true,false);返回值是：true

5.2：NVL（字段，值）：如果字段为null的时候，自动转换为值，类似MySQL中的ifnull（）；

例：select nvl(null,2);返回值：2

5.3：coalesce（值1，值2，值3…）

返回值中的第一个不为空的值，如果都为空，则返回null

例1：select coalesce(null,null,2,null);返回值：2

例2：select coalesce(null,null,null,null);返回值：null

5.4：CASE a WHEN b THEN c
[WHEN d THEN e]* [ELSE f] END

当a与b相等的时候则返回c，如果不相等则返回f

例1：select case 1 when 1 then ‘1=1’ else ‘1!=1’ end;返回值：1=1

例2：select case 1 when 2 then ‘1=1’ else ‘1!=1’ end;返回值：1！=1

5.5：isnull( a )

判断值是否为空

例1：select isnull(null);返回值：true

例2：select isnull(2);返回值：2

5.6：isnotnull ( a )

判断值是否不是空，不是空返回true，是空则返回false

6：聚合函数及表生成函数

6.1：count、sum、max、min、avg

与MySQL中用法一致

6.2：var_samp

计算方差

6.3：explode（数组）

将数组裂开成行

例：select explode(Array(1,2,3));，

返回值：

+------+--+
| col  |
+------+--+
| 1    |
| 2    |
| 3    |
+------+--+

7：

join的时候，MySQL是小表在前，hive是大表在前

（现在高版本的hive据说是已经优化过了，无论是大表还是小表在前，都优化了join）

打雷同学

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
hive基础总结

1：创建外部表create external table if not exists employee_external(name string,work_palce array<string>,sex_age struct<sex:string,age:int>,depart_title map<string,array<string>>)comment 'this is an external table'row format delimi
复制链接

扫一扫