- 更改数据库
- 语法
语法: ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...); -- (Note: SCHEMA added in Hive 0.14.0) 例如: hive (default)> alter database bigdate_hive3 set dbproperties ('create_dt'="2022-11-29"); OK Time taken: 0.321 seconds hive (default)> desc database bigdate_hive3; OK db_name comment location owner_name owner_type parameters bigdate_hive3 hdfs://bigdate20:9000/user/hive/warehouse/bigdate_hive3.db hadoop USER Time taken: 0.072 seconds, Fetched: 1 row(s) hive (default)> desc database extended bigdate_hive3; OK db_name comment location owner_name owner_type parameters bigdate_hive3 hdfs://bigdate20:9000/user/hive/warehouse/bigdate_hive3.db hadoop USER {create_dt=2022-11-29, creator=doublehappy} Time taken: 0.077 seconds, Fetched: 1 row(s)
- 语法
- 删除数据库
- 语法
##语法 DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE]; ##例如 hive (default)> DROP DATABASE bigdate_hive4; //如果不存在这个数据库会提示 FAILED: SemanticException [Error 10072]: Database does not exist: bigdate_hive4 hive (default)> DROP DATABASE bigdate_hive3; OK Time taken: 0.535 seconds DROP DATABASE bigdata_hive2 CASCADE; // 删库跑路的操作
- 查看table的信息:
- 语法
##语法 DESCRIBE [EXTENDED|FORMATTED] ##例如 hive (bigdate_hive2)> desc EXTENDED test; hive (bigdate_hive2)> desc FORMATTED test; [hadoop@bigdate20 ~]$ hadoop fs -ls /datee/bigdata_hive2/ Found 1 items drwxr-xr-x - hadoop supergroup 0 2022-12-01 21:08 /datee/bigdata_hive2/test
- 语法
- 语法
- table
- 语法
##语法 CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ [ROW FORMAT row_format] [STORED AS file_format] | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] -- (Note: Available in Hive 0.6.0 and later) ] ##### create table test( id string comment '用户id', name string, age bigint ) comment 'test table' row format delimited fields terminated by ',' stored as textfile; ##例如 hive (bigdate_hive)> create table test( > id string comment '用户id', > name string, > age bigint > ) comment 'test table' row format delimited fields terminated by ',' stored as textfile; OK Time taken: 1.417 seconds hive (bigdate_hive)> show tables; OK tab_name test Time taken: 0.289 seconds, Fetched: 1 row(s)
- 语法
- data_type: 常用的数据类型:
- 数值:
- 整型 :
int
bigint - 小数 :float double decimal
- 字符串 :string
- 时间:
日期 DATE YYYY-MM-DD
时间戳 TIMESTAMP YYYY-MM-DD HH:MM:SS
- 整型 :
- 数值:
- 思考: 建表为什么要指定分隔符?
-
metastore =》 mysql 表中的元数据
id int
name string
age inthive数据:hdfs+matastore =》 table
hdfs上文件 每一行 的列的分隔符一定得指定
分割符:
1.hdfs 每行 有分割符 \n
2.hdfs 一行数据内容 每个列之间也要指定分割符 , -
导入数据:
-
语法
-
LOCAL: 从本地磁盘 往hivetable 中导入数据
不加local : 从hdfs 往hive table 中导入数据##语法: LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] ##例如 hive (bigdate_hive)> load data local inpath '/home/hadoop/tmp/test.txt' into TABLE test; Loading data to table bigdate_hive.test OK Time taken: 1.041 seconds
-
-
-
- 练习: 1.创建 emp 表 并导入数据
-
// CREATE TABLE emp ( empno decimal(4,0) , ename string , job string , mgr decimal(4,0) , hiredate string , sal decimal(7,2) , comm decimal(7,2) , deptno decimal(2,0) ) row format delimited fields terminated by ',' stored as textfile; //插入数据 [hadoop@bigdate20 tmp]$ vim emp.txt ------------------------------------------- // hive (bigdate_hive)> CREATE TABLE emp ( > empno decimal(4,0) , > ename string , > job string , > mgr decimal(4,0) , > hiredate string , > sal decimal(7,2) , > comm decimal(7,2) , > deptno decimal(2,0) > ) > row format delimited fields terminated by ',' > stored as textfile; OK Time taken: 0.182 seconds hive (bigdate_hive)> load data local inpath '/home/hadoop/tmp/emp.txt' into table emp; Loading data to table bigdate_hive.emp OK Time taken: 2.836 seconds hive (bigdate_hive)> select * from emp; OK emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno 7369 SMITH CLERK 7902 1980-12-17 14:00 800.00 NULL 20 7499 ALLEN SALESMAN 7698 1981-2-20 14:00 1600.00 300.00 30 7521 WARD SALESMAN 7698 1981-2-22 14:00 1250.00 500.00 30 7566 JONES MANAGER 7839 1981-4-2 14:00 2975.00 NULL 20 7654 MARTIN SALESMAN 7698 1981-9-28 13:00 1250.00 1400.00 30 7698 BLAKE MANAGER 7839 1981-5-1 13:00 2850.00 NULL 30 7782 CLARK MANAGER 7839 1981-6-9 13:00 2450.00 NULL 10 7788 SCOTT ANALYST 7566 1982-12-9 14:00 3000.00 NULL 20 7839 KING PRESIDENT NULL 1981-11-17 14:00 5000.00 NULL 10 7844 TURNER SALESMAN 7698 1981-9-8 13:00 1500.00 0.00 30 7876 ADAMS CLERK 7788 1983-1-12 14:00 1100.00 NULL 20 7900 JAMES CLERK 7698 1981-12-3 14:00 950.00 NULL 30 7902 FORD ANALYST 7566 1981-12-3 14:00 3000.00 NULL 20 7934 MILLER CLERK 7782 1982-1-23 14:00 1300.00 NULL 10 Time taken: 0.328 seconds, Fetched: 14 row(s)
-
- 创建hive table :
- 1.普通的建表语句
- 2.CREATE TABLE emp2 like emp; => 仅仅获取的 表结构 【建表语句 】
- 3.create table emp3 as select * from emp; 表结构+数据
CTAS :
使用场景 临时查询 =》 放入结果表中
- DML:
- 1.load 加载数据
- 1.加载本地数据
- 2.加载hdfs数据 【覆盖原有表中的数据】
- 语法
##语法 LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
- 1.1.加载本地数据
hive (bigdate_hive)> load data local inpath '/home/hadoop/tmp/emp.txt' INTO TABLE emp; Loading data to table bigdate_hive.emp OK Time taken: 1.268 seconds
- 1.2 加载hdfs数据
[hadoop@bigdate20 tmp]$ hadoop fs -ls /data [hadoop@bigdate20 tmp]$ hadoop fs -put ./emp.txt /data [hadoop@bigdate20 tmp]$ hadoop fs -put ./emp.txt /data put: `/data/emp.txt': File exists [hadoop@bigdate20 tmp]$ hadoop fs -ls /data Found 1 items -rw-r--r-- 1 hadoop supergroup 687 2022-12-01 23:06 /data/emp.txt
- 1.3 覆盖表中数据
hive (bigdate_hive)> load data local inpath '/home/hadoop/tmp/emp.txt' OVERWRITE INTO TABLE emp; Loading data to table bigdate_hive.emp OK Time taken: 1.016 seconds hive (bigdate_hive)> select * from emp; OK emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno 7369 SMITH CLERK 7902 1980-12-17 14:00 800.00 NULL 20 7499 ALLEN SALESMAN 7698 1981-2-20 14:00 1600.00 300.00 30 7521 WARD SALESMAN 7698 1981-2-22 14:00 1250.00 500.00 30 7566 JONES MANAGER 7839 1981-4-2 14:00 2975.00 NULL 20 7654 MARTIN SALESMAN 7698 1981-9-28 13:00 1250.00 1400.00 30 7698 BLAKE MANAGER 7839 1981-5-1 13:00 2850.00 NULL 30 7782 CLARK MANAGER 7839 1981-6-9 13:00 2450.00 NULL 10 7788 SCOTT ANALYST 7566 1982-12-9 14:00 3000.00 NULL 20 7839 KING PRESIDENT NULL 1981-11-17 14:00 5000.00 NULL 10 7844 TURNER SALESMAN 7698 1981-9-8 13:00 1500.00 0.00 30 7876 ADAMS CLERK 7788 1983-1-12 14:00 1100.00 NULL 20 7900 JAMES CLERK 7698 1981-12-3 14:00 950.00 NULL 30 7902 FORD ANALYST 7566 1981-12-3 14:00 3000.00 NULL 20 7934 MILLER CLERK 7782 1982-1-23 14:00 1300.00 NULL 10 Time taken: 0.218 seconds, Fetched: 14 row(s)
- 1.1.加载本地数据
- 2.insert
- 1.Inserting data into Hive Tables from queries
insert into|OVERWRITE table tablename selectQury
追加数据
覆盖数据 - 2.Inserting values into tables from SQL 【不推荐使用】
INSERT INTO TABLE tablename
VALUES values_row [, values_row ...]
1.每导入一条数据 就会触发一次 mapreduce job 效率太低 - 例如
// emp2 hive (bigdate_hive)> insert into table emp2 select * from emp; hive (bigdate_hive)> desc emp2; OK col_name data_type comment empno decimal(4,0) ename string job string mgr decimal(4,0) hiredate string sal decimal(7,2) comm decimal(7,2) deptno decimal(2,0) Time taken: 0.132 seconds, Fetched: 8 row(s) ----------------------------------------------------- hive (bigdate_hive)> insert overwrite table emp2 select * from emp where deptno=10; hive (bigdate_hive)> select * from emp2; OK emp2.empno emp2.ename emp2.job emp2.mgr emp2.hiredate emp2.sal emp2.comemp2.deptno 7782 CLARK MANAGER 7839 1981-6-9 13:00 2450.00 NULL 10 7839 KING PRESIDENT NULL 1981-11-17 14:00 5000.00 NULL 10 7934 MILLER CLERK 7782 1982-1-23 14:00 1300.00 NULL 10 Time taken: 0.285 seconds, Fetched: 3 row(s)
- 1.Inserting data into Hive Tables from queries
- 3.update delete => hive 不要用 效率低 不要用
使用 insert OVERWRITE - 4.清空表中数据
- [语法] TRUNCATE [TABLE] table_name
hive (bigdate_hive)> truncate table emp2; OK Time taken: 1.393 seconds
- [语法] TRUNCATE [TABLE] table_name
- 5.select
- 语法
SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [ORDER BY col_list] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list] ] [LIMIT [offset,] rows]
- 1.where 过滤条件
where_condition
<
>
=
<> !=
and
or
in
not in
between and
is
is not - 需求: 查询表中 deptno 20 10
hive (bigdate_hive)> select > * > from emp > where deptno=20 or deptno =10; OK emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno 7369 SMITH CLERK 7902 1980-12-17 14:00 800.00 NULL 20 7566 JONES MANAGER 7839 1981-4-2 14:00 2975.00 NULL 20 7782 CLARK MANAGER 7839 1981-6-9 13:00 2450.00 NULL 10 7788 SCOTT ANALYST 7566 1982-12-9 14:00 3000.00 NULL 20 7839 KING PRESIDENT NULL 1981-11-17 14:00 5000.00 NULL 10 7876 ADAMS CLERK 7788 1983-1-12 14:00 1100.00 NULL 20 7902 FORD ANALYST 7566 1981-12-3 14:00 3000.00 NULL 20 7934 MILLER CLERK 7782 1982-1-23 14:00 1300.00 NULL 10 Time taken: 0.508 seconds, Fetched: 8 row(s) hive (bigdate_hive)> select > * > from emp > where deptno in (10,20); OK emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno 7369 SMITH CLERK 7902 1980-12-17 14:00 800.00 NULL 20 7566 JONES MANAGER 7839 1981-4-2 14:00 2975.00 NULL 20 7782 CLARK MANAGER 7839 1981-6-9 13:00 2450.00 NULL 10 7788 SCOTT ANALYST 7566 1982-12-9 14:00 3000.00 NULL 20 7839 KING PRESIDENT NULL 1981-11-17 14:00 5000.00 NULL 10 7876 ADAMS CLERK 7788 1983-1-12 14:00 1100.00 NULL 20 7902 FORD ANALYST 7566 1981-12-3 14:00 3000.00 NULL 20 7934 MILLER CLERK 7782 1982-1-23 14:00 1300.00 NULL 10 Time taken: 0.421 seconds, Fetched: 8 row(s) hive (bigdate_hive)> select > * > from emp > where deptno <> 20; OK emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno 7499 ALLEN SALESMAN 7698 1981-2-20 14:00 1600.00 300.00 30 7521 WARD SALESMAN 7698 1981-2-22 14:00 1250.00 500.00 30 7654 MARTIN SALESMAN 7698 1981-9-28 13:00 1250.00 1400.00 30 7698 BLAKE MANAGER 7839 1981-5-1 13:00 2850.00 NULL 30 7782 CLARK MANAGER 7839 1981-6-9 13:00 2450.00 NULL 10 7839 KING PRESIDENT NULL 1981-11-17 14:00 5000.00 NULL 10 7844 TURNER SALESMAN 7698 1981-9-8 13:00 1500.00 0.00 30 7900 JAMES CLERK 7698 1981-12-3 14:00 950.00 NULL 30 7934 MILLER CLERK 7782 1982-1-23 14:00 1300.00 NULL 10 Time taken: 0.221 seconds, Fetched: 9 row(s)
- 2.order by 排序语法
- 1.默认asc 升序
- 2.降序 desc
hive (bigdate_hive)> select > sal > from emp > order by sal desc; sal 5000.00 3000.00 3000.00 2975.00 2850.00 2450.00 1600.00 1500.00 1300.00 1250.00 1250.00 1100.00 950.00 800.00
- 语法
- 3.like 语法 模糊匹配
- 1._ 占位符
- 2.% 模糊
- 4.合并表
- 1.union 去重
- 2.union all 不去重
- 5.null 处理
- 1. 过滤
where xxx is not null
is null 和 <=> 作用一样 - 2. etl 转换
ifnull => hive里没有
coalesce =》
nvl =》 - 补充
- 查看hive支持的function :
y=f(x)
SHOW FUNCTIONS [LIKE "<pattern>"];
show functions like nvl; => 判断 function hive 是否存在
desc function nvl; =》 查看某个函数具体使用
- 查看hive支持的function :
- 1. 过滤
- 6.分组 聚合函数 join
- 1.group by
1.和聚合函数一起使用
2. 一个 或者多个 column 进行分组
3.分组字段 select 出现 和 group by 出现 要一致 - having : 条件过滤
只能在group by 后面使用 - 2.聚合函数
sum max min avg counthive (bigdate_hive)> select > sum(sal) as sal_sum, > max(sal) as sal_max, > min(sal) as sal_min, > avg(sal) as sal_avg, > count(1) as cnt > from emp ; sal_sum sal_max sal_min sal_avg cnt 29025.00 5000.00 800.00 2073.21428571428571428571428571 14 Time taken: 57.577 seconds, Fetched: 1 row(s) ------------------------------------------------------------------------------- hive (bigdate_hive)> select > job, > sum(sal) as sal_sum, > max(sal) as sal_max, > min(sal) as sal_min, > avg(sal) as sal_avg, > count(1) as cnt > from emp > group by job > having sal_sum > 6000; job sal_sum sal_max sal_min sal_avg cnt MANAGER 8275.00 2975.00 2450.00 2758.333333 3 Time taken: 40.858 seconds, Fetched: 1 row(s) // 子查询: select job, sal_sum, sal_max, sal_min, sal_avg, cnt from ( select job, sum(sal) as sal_sum, max(sal) as sal_max, min(sal) as sal_min, avg(sal) as sal_avg, count(1) as cnt from emp group by job ) as a where sal_sum > 6000;
- 1.group by
- 7.join
- 1.inner join [join] 内连接
select a.*, b.* from ( select * from a ) as a join ( select * from b ) as b on a.id = b.id;
- 2.left join 左连接
select a.*, b.* from ( select * from a ) as a left join ( select * from b ) as b on a.id = b.id;
- 3.right join 右连接
select a.*, b.* from ( select * from a ) as a right join ( select * from b ) as b on a.id = b.id;
- 1.inner join [join] 内连接
- 1.load 加载数据
- 需求: emp
- 思想:
- table:
维度:group by
指标:聚合函数、 普普通通列
where:
- table:
- 1.各个部门每年入职的人数
- table:emp
维度:部门 每年
指标:人数
where: no// select 部门 每年 人数 from emp group by 部门 每年; // 每年=》hiredate 处理etl select deptno,date_format(hiredate,'YYYY') as year, count(1) as cnt from emp group by deptno,date_format(hiredate,'YYYY'); // 1980-12-17 00:00:00 => etl 使用处理日期相关的function date_format hive (bigdate_hive)> select > deptno,date_format(hiredate,'YYYY') as year, > count(1) as cnt > from emp > group by deptno,date_format(hiredate,'YYYY'); deptno year cnt 10 1981 2 10 1982 1 20 1980 1 20 1981 2 20 1982 1 20 1983 1 30 1981 6 Time taken: 40.421 seconds, Fetched: 7 row(s)
- table:emp
- 2.整个公司每年每月的入职人数
- 维度:年 月
指标:人数hiredate :1980-12-17 00:00:00 =》 1980-12 --------------------------------------------------- select date_format(hiredate,'YYYY-MM') as ymoth, count(1) as cnt from emp group by date_format(hiredate,'YYYY-MM'); // hive (bigdate_hive)> select > date_format(hiredate,'YYYY-MM') as ymoth, > count(1) as cnt > from emp > group by date_format(hiredate,'YYYY-MM'); Total MapReduce CPU Time Spent: 6 seconds 610 msec OK ymoth cnt 1980-12 1 1981-02 2 1981-04 1 1981-05 1 1981-06 1 1981-09 2 1981-11 1 1981-12 2 1982-01 1 1982-12 1 1983-01 1 Time taken: 34.947 seconds, Fetched: 11 row(s)
- 维度:年 月
- 思想:
- 1.开窗函数
- 聚合函数:多行数据按照一定规则 进行聚合为 一行
理论上 聚合后的行数 <=聚合前的行数 - 需求: 既要显示聚合前的数据,又要显示聚合后的数据?
eg : id name sal
1 zs 3w
2 ls 2.5w
3 ww 2w - 需求: 按照工资降序排列 还显示对应的 排名
id name sal rank
1 zs 3w 1
2 ls 2.5w 2
3 ww 2w 3 - 窗口函数/开窗函数:窗口 + 函数
- 窗口: 函数 运行时 计算的数据集的范围
函数: 运行时函数 - 语法结构:
- 函数 over([partition by xxx,...] [order by xxx,....])
over: 以谁进行开窗 table、
parition by : 以谁进行分组 table columns
order by : 以谁进行排序 table columns
函数:开窗函数、聚合函数
- 函数 over([partition by xxx,...] [order by xxx,....])
- 举个例子
//数据: haige,2022-11-10,1 haige,2022-11-11,5 haige,2022-11-12,7 haige,2022-11-13,3 haige,2022-11-14,2 haige,2022-11-15,4 haige,2022-11-16,4 // 创建一个表 create table user_mt ( name string, dt string, cnt int ) row format delimited fields terminated by ',' ; //插入 load data local inpath '/home/hadoop/tmp/mt.txt' overwrite into table user_mt; //需求: 统计累计问题 ,每个用户每天累计点外卖次数 [partition by xxx,...] [order by xxx,....] select name , dt , cnt , sum(cnt) over(partition by name order by dt ) as sum_cnt from user_mt; hive (bigdate_hive)> select > name , > dt , > cnt , > sum(cnt) over(partition by name order by dt ) as sum_cnt > from user_mt; Total MapReduce CPU Time Spent: 7 seconds 810 msec OK name dt cnt sum_cnt haige 2022-11-10 1 1 haige 2022-11-11 5 6 haige 2022-11-12 7 13 haige 2022-11-13 3 16 haige 2022-11-14 2 18 haige 2022-11-15 4 22 haige 2022-11-16 4 26 Time taken: 39.138 seconds, Fetched: 7 row(s)
- 窗口: 函数 运行时 计算的数据集的范围
- 聚合函数:多行数据按照一定规则 进行聚合为 一行
- 2.command line
- 1.hive shell
- 2.jdbc => hiveServer2
- 补充:
beeline => 连接 hive =》 hdfs
对hdfs 做一个设置 代理设置: - 需要配置环境
Hive-day02
最新推荐文章于 2024-07-16 22:49:37 发布