Hive-day02

最新推荐文章于 2024-07-16 22:49:37 发布

bug女孩的每一天

最新推荐文章于 2024-07-16 22:49:37 发布

阅读量298

点赞数

文章标签： hive hadoop 大数据

本文链接：https://blog.csdn.net/qq_56159271/article/details/128138399

版权

更改数据库

语法

语法:
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...);   -- (Note: SCHEMA added in Hive 0.14.0)


例如:

hive (default)> alter database bigdate_hive3 set dbproperties ('create_dt'="2022-11-29");
OK
Time taken: 0.321 seconds
hive (default)> desc database bigdate_hive3;
OK
db_name	comment	location	owner_name	owner_type	parameters
bigdate_hive3		hdfs://bigdate20:9000/user/hive/warehouse/bigdate_hive3.db	hadoop	USER	
Time taken: 0.072 seconds, Fetched: 1 row(s)
hive (default)> desc database extended bigdate_hive3;
OK
db_name	comment	location	owner_name	owner_type	parameters
bigdate_hive3		hdfs://bigdate20:9000/user/hive/warehouse/bigdate_hive3.db	hadoop	USER	{create_dt=2022-11-29, creator=doublehappy}
Time taken: 0.077 seconds, Fetched: 1 row(s)

删除数据库

语法

##语法
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];

##例如
hive (default)> DROP DATABASE bigdate_hive4;   //如果不存在这个数据库会提示
FAILED: SemanticException [Error 10072]: Database does not exist: bigdate_hive4
hive (default)> DROP DATABASE bigdate_hive3;
OK
Time taken: 0.535 seconds


DROP DATABASE bigdata_hive2 CASCADE;   // 删库跑路的操作

查看table的信息：

语法

##语法
DESCRIBE [EXTENDED|FORMATTED]

##例如
hive (bigdate_hive2)> desc  EXTENDED test;
hive (bigdate_hive2)> desc FORMATTED test;
[hadoop@bigdate20 ~]$ hadoop fs -ls /datee/bigdata_hive2/
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2022-12-01 21:08 /datee/bigdata_hive2/test

table

语法

##语法
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name   
  [(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]


#####
create table test(
	id string comment '用户id',
	name string,
	age bigint
) comment 'test table'
row format  delimited fields terminated by ','
stored as textfile;

##例如

hive (bigdate_hive)> create table test(
                   > id string comment '用户id',
                   > name string,
                   > age bigint
                   > ) comment 'test table' row format delimited fields terminated by ',' stored as textfile;
OK
Time taken: 1.417 seconds
hive (bigdate_hive)> show tables;
OK
tab_name
test
Time taken: 0.289 seconds, Fetched: 1 row(s)

data_type: 常用的数据类型：
- 数值：
  - 整型：
    int
    bigint
  - 小数：float double decimal
  - 字符串：string
  - 时间:
    日期 DATE YYYY-MM-DD
    时间戳 TIMESTAMP YYYY-MM-DD HH:MM:SS
思考：建表为什么要指定分隔符？
- metastore =》 mysql 表中的元数据
  id int
  name string
  age int
  
  hive数据：hdfs+matastore =》 table
  hdfs上文件每一行的列的分隔符一定得指定
     分割符：
         1.hdfs 每行有分割符 \n
         2.hdfs 一行数据内容每个列之间也要指定分割符 ,
- 导入数据：
  - 语法
    - LOCAL: 从本地磁盘往hivetable 中导入数据
      不加local ：从hdfs 往hive table 中导入数据
```
##语法:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]


##例如

hive (bigdate_hive)> load data local inpath '/home/hadoop/tmp/test.txt' into TABLE test;
Loading data to table bigdate_hive.test
OK
Time taken: 1.041 seconds
```

练习： 1.创建 emp 表并导入数据

//
CREATE TABLE emp (
  empno decimal(4,0) ,
  ename string ,
  job string ,
  mgr decimal(4,0) ,
  hiredate string ,
  sal decimal(7,2) ,
  comm decimal(7,2) ,
  deptno decimal(2,0) 
) 
row format  delimited fields terminated by ','
stored as textfile;

//插入数据
[hadoop@bigdate20 tmp]$ vim emp.txt

-------------------------------------------
//
hive (bigdate_hive)> CREATE TABLE emp (
                   >   empno decimal(4,0) ,
                   >   ename string ,
                   >   job string ,
                   >   mgr decimal(4,0) ,
                   >   hiredate string ,
                   >   sal decimal(7,2) ,
                   >   comm decimal(7,2) ,
                   >   deptno decimal(2,0) 
                   > ) 
                   > row format  delimited fields terminated by ','
                   > stored as textfile;
OK
Time taken: 0.182 seconds
hive (bigdate_hive)> load data local inpath '/home/hadoop/tmp/emp.txt' into table emp;
Loading data to table bigdate_hive.emp
OK
Time taken: 2.836 seconds
hive (bigdate_hive)> select * from emp;
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7369	SMITH	CLERK	7902	1980-12-17 14:00	800.00	NULL	20
7499	ALLEN	SALESMAN	7698	1981-2-20 14:00	1600.00	300.00	30
7521	WARD	SALESMAN	7698	1981-2-22 14:00	1250.00	500.00	30
7566	JONES	MANAGER	7839	1981-4-2 14:00	2975.00	NULL	20
7654	MARTIN	SALESMAN	7698	1981-9-28 13:00	1250.00	1400.00	30
7698	BLAKE	MANAGER	7839	1981-5-1 13:00	2850.00	NULL	30
7782	CLARK	MANAGER	7839	1981-6-9 13:00	2450.00	NULL	10
7788	SCOTT	ANALYST	7566	1982-12-9 14:00	3000.00	NULL	20
7839	KING	PRESIDENT	NULL	1981-11-17 14:00	5000.00	NULL	10
7844	TURNER	SALESMAN	7698	1981-9-8 13:00	1500.00	0.00	30
7876	ADAMS	CLERK	7788	1983-1-12 14:00	1100.00	NULL	20
7900	JAMES	CLERK	7698	1981-12-3 14:00	950.00	NULL	30
7902	FORD	ANALYST	7566	1981-12-3 14:00	3000.00	NULL	20
7934	MILLER	CLERK	7782	1982-1-23 14:00	1300.00	NULL	10
Time taken: 0.328 seconds, Fetched: 14 row(s)

创建hive table ：
- 1.普通的建表语句
- 2.CREATE TABLE emp2 like emp; => 仅仅获取的表结构【建表语句】
- 3.create table emp3 as select * from emp; 表结构+数据
  CTAS ：
  使用场景临时查询 =》放入结果表中

DML:

1.load 加载数据
- 1.加载本地数据
- 2.加载hdfs数据【覆盖原有表中的数据】

语法

##语法
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

1.1.加载本地数据

hive (bigdate_hive)> load data local inpath '/home/hadoop/tmp/emp.txt' INTO TABLE emp;
Loading data to table bigdate_hive.emp
OK
Time taken: 1.268 seconds

1.2 加载hdfs数据

[hadoop@bigdate20 tmp]$ hadoop fs -ls /data
[hadoop@bigdate20 tmp]$ hadoop fs -put ./emp.txt /data
[hadoop@bigdate20 tmp]$ hadoop fs -put ./emp.txt /data
put: `/data/emp.txt': File exists
[hadoop@bigdate20 tmp]$ hadoop fs -ls /data
Found 1 items
-rw-r--r--   1 hadoop supergroup        687 2022-12-01 23:06 /data/emp.txt

1.3 覆盖表中数据

hive (bigdate_hive)> load data local inpath '/home/hadoop/tmp/emp.txt' OVERWRITE INTO TABLE emp;
Loading data to table bigdate_hive.emp
OK
Time taken: 1.016 seconds
hive (bigdate_hive)> select * from emp;
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7369	SMITH	CLERK	7902	1980-12-17 14:00	800.00	NULL	20
7499	ALLEN	SALESMAN	7698	1981-2-20 14:00	1600.00	300.00	30
7521	WARD	SALESMAN	7698	1981-2-22 14:00	1250.00	500.00	30
7566	JONES	MANAGER	7839	1981-4-2 14:00	2975.00	NULL	20
7654	MARTIN	SALESMAN	7698	1981-9-28 13:00	1250.00	1400.00	30
7698	BLAKE	MANAGER	7839	1981-5-1 13:00	2850.00	NULL	30
7782	CLARK	MANAGER	7839	1981-6-9 13:00	2450.00	NULL	10
7788	SCOTT	ANALYST	7566	1982-12-9 14:00	3000.00	NULL	20
7839	KING	PRESIDENT	NULL	1981-11-17 14:00	5000.00	NULL	10
7844	TURNER	SALESMAN	7698	1981-9-8 13:00	1500.00	0.00	30
7876	ADAMS	CLERK	7788	1983-1-12 14:00	1100.00	NULL	20
7900	JAMES	CLERK	7698	1981-12-3 14:00	950.00	NULL	30
7902	FORD	ANALYST	7566	1981-12-3 14:00	3000.00	NULL	20
7934	MILLER	CLERK	7782	1982-1-23 14:00	1300.00	NULL	10
Time taken: 0.218 seconds, Fetched: 14 row(s)

2.insert

1.Inserting data into Hive Tables from queries
       insert into|OVERWRITE table tablename selectQury
               追加数据
               覆盖数据
2.Inserting values into tables from SQL 【不推荐使用】
       INSERT INTO TABLE tablename
       VALUES values_row [, values_row ...]
           1.每导入一条数据就会触发一次 mapreduce job 效率太低

例如

// emp2

hive (bigdate_hive)> insert into table emp2 select * from emp;
hive (bigdate_hive)> desc emp2;
OK
col_name	data_type	comment
empno               	decimal(4,0)        	                    
ename               	string              	                    
job                 	string              	                    
mgr                 	decimal(4,0)        	                    
hiredate            	string              	                    
sal                 	decimal(7,2)        	                    
comm                	decimal(7,2)        	                    
deptno              	decimal(2,0)        	                    
Time taken: 0.132 seconds, Fetched: 8 row(s)
-----------------------------------------------------
hive (bigdate_hive)> insert overwrite table emp2 select * from emp where deptno=10;
hive (bigdate_hive)> select * from emp2;
OK
emp2.empno	emp2.ename	emp2.job	emp2.mgr	emp2.hiredate	emp2.sal	emp2.comemp2.deptno
7782	CLARK	MANAGER	7839	1981-6-9 13:00	2450.00	NULL	10
7839	KING	PRESIDENT	NULL	1981-11-17 14:00	5000.00	NULL	10
7934	MILLER	CLERK	7782	1982-1-23 14:00	1300.00	NULL	10
Time taken: 0.285 seconds, Fetched: 3 row(s)

3.update delete => hive 不要用效率低不要用
使用 insert OVERWRITE

4.清空表中数据

[语法] TRUNCATE [TABLE] table_name

hive (bigdate_hive)> truncate table emp2;
OK
Time taken: 1.393 seconds

5.select

语法

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
  FROM table_reference
  [WHERE where_condition]
  [GROUP BY col_list]
  [ORDER BY col_list]
  [CLUSTER BY col_list
    | [DISTRIBUTE BY col_list] [SORT BY col_list]
  ]
 [LIMIT [offset,] rows]

1.where 过滤条件
   where_condition
   <
   >
   =
   <> !=
   and
   or
   in
   not in
   between and
   is
   is not

需求：查询表中 deptno 20 10

hive (bigdate_hive)> select  
                   > *
                   > from emp 
                   > where deptno=20 or deptno =10;
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7369	SMITH	CLERK	7902	1980-12-17 14:00	800.00	NULL	20
7566	JONES	MANAGER	7839	1981-4-2 14:00	2975.00	NULL	20
7782	CLARK	MANAGER	7839	1981-6-9 13:00	2450.00	NULL	10
7788	SCOTT	ANALYST	7566	1982-12-9 14:00	3000.00	NULL	20
7839	KING	PRESIDENT	NULL	1981-11-17 14:00	5000.00	NULL	10
7876	ADAMS	CLERK	7788	1983-1-12 14:00	1100.00	NULL	20
7902	FORD	ANALYST	7566	1981-12-3 14:00	3000.00	NULL	20
7934	MILLER	CLERK	7782	1982-1-23 14:00	1300.00	NULL	10
Time taken: 0.508 seconds, Fetched: 8 row(s)
hive (bigdate_hive)> select  
                   > *
                   > from emp 
                   > where deptno in (10,20);
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7369	SMITH	CLERK	7902	1980-12-17 14:00	800.00	NULL	20
7566	JONES	MANAGER	7839	1981-4-2 14:00	2975.00	NULL	20
7782	CLARK	MANAGER	7839	1981-6-9 13:00	2450.00	NULL	10
7788	SCOTT	ANALYST	7566	1982-12-9 14:00	3000.00	NULL	20
7839	KING	PRESIDENT	NULL	1981-11-17 14:00	5000.00	NULL	10
7876	ADAMS	CLERK	7788	1983-1-12 14:00	1100.00	NULL	20
7902	FORD	ANALYST	7566	1981-12-3 14:00	3000.00	NULL	20
7934	MILLER	CLERK	7782	1982-1-23 14:00	1300.00	NULL	10
Time taken: 0.421 seconds, Fetched: 8 row(s)
hive (bigdate_hive)> select  
                   > *
                   > from emp 
                   > where deptno <> 20;
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7499	ALLEN	SALESMAN	7698	1981-2-20 14:00	1600.00	300.00	30
7521	WARD	SALESMAN	7698	1981-2-22 14:00	1250.00	500.00	30
7654	MARTIN	SALESMAN	7698	1981-9-28 13:00	1250.00	1400.00	30
7698	BLAKE	MANAGER	7839	1981-5-1 13:00	2850.00	NULL	30
7782	CLARK	MANAGER	7839	1981-6-9 13:00	2450.00	NULL	10
7839	KING	PRESIDENT	NULL	1981-11-17 14:00	5000.00	NULL	10
7844	TURNER	SALESMAN	7698	1981-9-8 13:00	1500.00	0.00	30
7900	JAMES	CLERK	7698	1981-12-3 14:00	950.00	NULL	30
7934	MILLER	CLERK	7782	1982-1-23 14:00	1300.00	NULL	10
Time taken: 0.221 seconds, Fetched: 9 row(s)

2.order by 排序语法

1.默认asc 升序

2.降序 desc

hive (bigdate_hive)> select 
                   > sal
                   > from emp 
                   > order by sal desc;


sal
5000.00
3000.00
3000.00
2975.00
2850.00
2450.00
1600.00
1500.00
1300.00
1250.00
1250.00
1100.00
950.00
800.00

3.like 语法模糊匹配
- 1._ 占位符
- 2.% 模糊
4.合并表
- 1.union 去重
- 2.union all 不去重
5.null 处理
- 1. 过滤
  where xxx is not null
  is null 和 <=> 作用一样
- 2. etl 转换
         ifnull => hive里没有
         coalesce =》
         nvl =》
- 补充
  - 查看hive支持的function ：
                   y=f(x)
           SHOW FUNCTIONS [LIKE "<pattern>"];
           show functions like nvl; => 判断 function hive 是否存在
           desc function nvl; =》查看某个函数具体使用

6.分组聚合函数 join

1.group by
       1.和聚合函数一起使用
       2. 一个或者多个 column 进行分组
       3.分组字段 select 出现和 group by 出现要一致
having ：条件过滤
只能在group by 后面使用

2.聚合函数
sum max min avg count

hive (bigdate_hive)> select  
                   > sum(sal) as sal_sum,
                   > max(sal)  as sal_max,
                   > min(sal)  as sal_min,
                   > avg(sal) as  sal_avg,
                   > count(1) as cnt 
                   > from emp ;
sal_sum	sal_max	sal_min	sal_avg	cnt
29025.00	5000.00	800.00	2073.21428571428571428571428571	14
Time taken: 57.577 seconds, Fetched: 1 row(s)

-------------------------------------------------------------------------------
hive (bigdate_hive)> select 
                   > job, 
                   > sum(sal) as sal_sum,
                   > max(sal)  as sal_max,
                   > min(sal)  as sal_min,
                   > avg(sal) as  sal_avg,
                   > count(1) as cnt 
                   > from emp 
                   > group by job
                   > having sal_sum > 6000;
job	sal_sum	sal_max	sal_min	sal_avg	cnt
MANAGER	8275.00	2975.00	2450.00	2758.333333	3
Time taken: 40.858 seconds, Fetched: 1 row(s)

//
子查询：
select 
job, 
sal_sum,
sal_max,
sal_min,
sal_avg,
cnt 
from 
(
	select 
	job, 
	sum(sal) as sal_sum,
	max(sal)  as sal_max,
	min(sal)  as sal_min,
	avg(sal) as  sal_avg,
	count(1) as cnt 
	from emp 
	group by job
) as a 
where sal_sum > 6000;

7.join

1.inner join [join] 内连接

select 
a.*,
b.*
from 
(
	select 
	*
	from a 
) as a  join 
(
	select 
	*
	from b
) as b 
on a.id = b.id;

2.left join 左连接

select 
a.*,
b.*
from 
(
	select 
	*
	from a 
) as a left join 
(
	select 
	*
	from b
) as b 
on a.id = b.id;

3.right join 右连接

select 
a.*,
b.*
from 
(
	select 
	*
	from a 
) as a right join 
(
	select 
	*
	from b
) as b 
on a.id = b.id;

需求: emp

思想:
- table：
  维度：group by
  指标：聚合函数、普普通通列
  where:

1.各个部门每年入职的人数

table：emp
维度：部门每年
指标：人数
where: no

//
select 
 部门  每年
 人数
 from emp 
 group by 部门  每年;

// 每年=》hiredate 处理etl

select 
deptno,date_format(hiredate,'YYYY') as year,
count(1) as cnt 
from emp 
group by deptno,date_format(hiredate,'YYYY');

// 1980-12-17 00:00:00  => etl 使用处理日期相关的function date_format

hive (bigdate_hive)> select 
                   > deptno,date_format(hiredate,'YYYY') as year,
                   > count(1) as cnt 
                   > from emp 
                   > group by deptno,date_format(hiredate,'YYYY');
deptno	year	cnt
10	1981	2
10	1982	1
20	1980	1
20	1981	2
20	1982	1
20	1983	1
30	1981	6
Time taken: 40.421 seconds, Fetched: 7 row(s)

2.整个公司每年每月的入职人数

维度：年月
指标：人数

hiredate ：1980-12-17 00:00:00  =》 1980-12
---------------------------------------------------
select  
date_format(hiredate,'YYYY-MM') as ymoth,
count(1) as cnt
from emp 
group by date_format(hiredate,'YYYY-MM');

//
hive (bigdate_hive)> select  
                   > date_format(hiredate,'YYYY-MM') as ymoth,
                   > count(1) as cnt
                   > from emp 
                   > group by date_format(hiredate,'YYYY-MM');
Total MapReduce CPU Time Spent: 6 seconds 610 msec
OK
ymoth	cnt
1980-12	1
1981-02	2
1981-04	1
1981-05	1
1981-06	1
1981-09	2
1981-11	1
1981-12	2
1982-01	1
1982-12	1
1983-01	1
Time taken: 34.947 seconds, Fetched: 11 row(s)

1.开窗函数

聚合函数：多行数据按照一定规则进行聚合为一行
理论上聚合后的行数 <=聚合前的行数
需求：既要显示聚合前的数据，又要显示聚合后的数据？
eg : id name sal
   1 zs 3w
   2 ls 2.5w
   3 ww 2w
需求：按照工资降序排列还显示对应的排名
   id name sal rank
   1 zs 3w 1
   2 ls 2.5w 2
   3 ww 2w 3

窗口函数/开窗函数：窗口 + 函数

窗口：函数运行时计算的数据集的范围
函数：运行时函数
语法结构：
- 函数 over([partition by xxx,...] [order by xxx,....])
         over: 以谁进行开窗 table、
         parition by : 以谁进行分组 table columns
         order by : 以谁进行排序 table columns
         函数：开窗函数、聚合函数

举个例子

//数据： 
haige,2022-11-10,1
haige,2022-11-11,5
haige,2022-11-12,7
haige,2022-11-13,3
haige,2022-11-14,2
haige,2022-11-15,4
haige,2022-11-16,4

// 创建一个表
create table user_mt (
name string,
dt string,
cnt int
)
row format  delimited fields terminated by ',' ;
//插入
load data local inpath '/home/hadoop/tmp/mt.txt' overwrite into table user_mt;

//需求： 统计累计问题 ，每个用户每天累计点外卖次数
[partition by xxx,...] [order by xxx,....]

select 
name ,
dt ,
cnt ,
sum(cnt) over(partition by name  order by dt ) as sum_cnt
from user_mt;


hive (bigdate_hive)> select 
                   > name ,
                   > dt ,
                   > cnt ,
                   > sum(cnt) over(partition by name  order by dt ) as sum_cnt
                   > from user_mt;
Total MapReduce CPU Time Spent: 7 seconds 810 msec
OK
name	dt	cnt	sum_cnt
haige	2022-11-10	1	1
haige	2022-11-11	5	6
haige	2022-11-12	7	13
haige	2022-11-13	3	16
haige	2022-11-14	2	18
haige	2022-11-15	4	22
haige	2022-11-16	4	26
Time taken: 39.138 seconds, Fetched: 7 row(s)