前言:
上一堂课回顾总结:大数据开发最主要做底层的优化架构的优化,这些东西才是核心,写代码用的不多,做平台化,开发一个平台直接在页面上拖拽就可以了,底层的数据平台都已经封装好了,只是会某一个功能,只能说是大数据开发的入门,初级程序员做的事。Oracle在大数据几乎不用,只用一些轻量级的数据库。
讲了DDL DML,Hive数据模型的抽象,顶层是Database,下一层是table,再下一层是partition,最下面一层是 bucket,除了bucket之外,上面的都是针对于HDFS的一个文件夹,因为Hive是构建 再Hadoop之上的一个数据仓库,Hive的数据是存在HDFS上的,Hive的元数据是存放在关系型数据库上面的,比如说MySQL。
DDL 创建数据库,创建表
对于表的操作:
1、创建表 2、复制表结构 3、复制一个表的数据过来 create table as select
insert overwrite
插入的时候如果有一列,和源数据信息的列,位置搞错了,不按正常写,就会出现报错,数据错乱
所有在插入的时候需要慎重,列的数量,列的类型,以及列的顺序都写清楚。
内部表和外部表:
Hive上有两种类型的表,一种是Managed Table(默认的),另一种是External Table(加上EXTERNAL关键字)。内部表数据由Hive自身管理,外部表数据由HDFS管理;
它俩的主要区别在于:当我们drop表时,Managed Table会同时删去data(存储在HDFS上)和meta data(存储在MySQL),而External Table只会删meta data。内部表数据存储的位置是hive.metastore.warehouse.dir(默认:/user/hive/warehouse),外部表数据的存储位置由自己制定; 对内部表的修改会将修改直接同步给元数据,而对外部表的表结构和分区进行修改,则需要修复(MSCK REPAIR TABLE table_name;) 自己套一个业务解释场景。
生产中99%都用到的是外部表,因为可以有一份备份,如果删除表用LOAD DATA LOCAL INPATH 在传一份就可以了。
生产中1%用到的内部表的情况,外面的数据传过来,你这里可建立外部表,如果表删没了再导一份。
DML 导入数据,95%的场景下用node把一个文件导进来就可以了,导的时候可以支持本地导,和HDFS上面导,导的时候95%以上的场景都是 OVERWRITE ,加载数据到表的时候数据的处理方式,覆盖 (生产使用),在Hive里面从0.14版本是支持insert into的 但是这种是比较鸡肋的功能,如果一定要用insert into的这种情况只有一种,在你做测试的时候造一点数据,不推荐使用
本期课程如下:
Select (查询):
hive> select * from emp where deptno=10;
OK
7782 CLARK MANAGER 7839 1981/6/9 2450.0 NULL 10
7839 KING PRESIDENT NULL 1981/11/17 5000.0 NULL 10
7934 MILLER CLERK 7782 1982/1/23 1300.0 NULL 10
Time taken: 1.144 seconds, Fetched: 3 row(s)
hive> select * from emp where empno <= 7800;
OK
7369 SMITH CLERK 7902 1980/12/17 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981/2/20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981/2/22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981/4/2 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981/9/28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981/5/1 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981/6/9 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1987/4/19 3000.0 NULL 20
Time taken: 0.449 seconds, Fetched: 8 row(s)
hive> select * from emp where empno <= 7800;
OK
7369 SMITH CLERK 7902 1980/12/17 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981/2/20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981/2/22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981/4/2 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981/9/28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981/5/1 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981/6/9 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1987/4/19 3000.0 NULL 20
Time taken: 0.449 seconds, Fetched: 8 row(s)
hive> select * from emp where salary between 1000 and 1500;
OK
7521 WARD SALESMAN 7698 1981/2/22 1250.0 500.0 30
7654 MARTIN SALESMAN 7698 1981/9/28 1250.0 1400.0 30
7844 TURNER SALESMAN 7698 1981/9/8 1500.0 0.0 30
7876 ADAMS CLERK 7788 1987/5/23 1100.0 NULL 20
7934 MILLER CLERK 7782 1982/1/23 1300.0 NULL 10
Time taken: 0.178 seconds, Fetched: 5 row(s)
hive> select * from emp limit 5;
OK
7369 SMITH CLERK 7902 1980/12/17 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981/2/20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981/2/22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981/4/2 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981/9/28 1250.0 1400.0 30
Time taken: 0.47 seconds, Fetched: 5 row(s)
hive> select * from emp where empno in(7566,7499);
OK
7499 ALLEN SALESMAN 7698 1981/2/20 1600.0 300.0 30
7566 JONES MANAGER 7839 1981/4/2 2975.0 NULL 20
Time taken: 0.4 seconds, Fetched: 2 row(s)
hive> select * from emp where comm is not null;
OK
7499 ALLEN SALESMAN 7698 1981/2/20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981/2/22 1250.0 500.0 30
7654 MARTIN SALESMAN 7698 1981/9/28 1250.0 1400.0 30
7844 TURNER SALESMAN 7698 1981/9/8 1500.0 0.0 30
Time taken: 0.262 seconds, Fetched: 4 row(s)
聚合函数:
max/min/count/sum/ave 特点:多进一出,进来很多条记录出去只有一条记录
-- 查询部门编号为10的有多少条记录
hive> select count(1) from ruozedata_emp where deptno=10;
Query ID = hadoop_20180608111010_cdb33a86-762b-4a72-8f56-c94c9f719b7d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1528425516016_0001, Tracking URL = http://hadoop000:8088/proxy/application_1528425516016_0001/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1528425516016_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-06-08 11:18:19,145 Stage-1 map = 0%, reduce = 0%
2018-06-08 11:18:26,615 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.08 sec
2018-06-08 11:18:33,001 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.97 sec
MapReduce Total cumulative CPU time: 4 seconds 970 msec
Ended Job = job_1528425516016_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.97 sec HDFS Read: 8417 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 970 msec
OK
3 (部门编号10的有三条记录)
Time taken: 28.056 seconds, Fetched: 1 row(s)
-- 查询下是否是三条记录
hive> select * from ruozedata_emp where deptno=10;
OK
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
Time taken: 0.107 seconds, Fetched: 3 row(s)
-- 求最大工资,最小工资,平均工资,工资的和
hive> select max(salary),min(salary),avg(salary),sum(salary) from ruozedata_emp;
Query ID = hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1528425516016_0002, Tracking URL = http://hadoop000:8088/proxy/application_1528425516016_0002/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1528425516016_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-06-08 11:27:54,739 Stage-1 map = 0%, reduce = 0%
2018-06-08 11:28:00,082 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.46 sec
2018-06-08 11:28:06,527 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.18 sec
MapReduce Total cumulative CPU time: 3 seconds 180 msec
Ended Job = job_1528425516016_0002
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.18 sec HDFS Read: 10128 HDFS Write: 41 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 180 msec
OK
10300.0(最大工资) 800.0(最小工资) 2621.6666666666665(平均工资) 39325.0(工资的总和)
Time taken: 21.745 seconds, Fetched: 1 row(s)
分组函数 group by
-- 求部门的平均工资
hive> select deptno,avg(salary) from ruozedata_emp group by deptno;
Query ID = hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1528425516016_0003, Tracking URL = http://hadoop000:8088/proxy/application_1528425516016_0003/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1528425516016_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-06-08 11:31:49,358 Stage-1 map = 0%, reduce = 0%
2018-06-08 11:31:53,622 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec
2018-06-08 11:31:59,910 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.72 sec
MapReduce Total cumulative CPU time: 2 seconds 720 msec
Ended Job = job_1528425516016_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.72 sec HDFS Read: 8490 HDFS Write: 65 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 720 msec
OK
NULL 10300.0
10 2916.6666666666665
20 2175.0
30 1566.6666666666667
Time taken: 18.807 seconds, Fetched: 4 row(s)
常见报错信息:
hive> select ename,deptno,avg(salary) from ruozedata_emp group by deptno;
FAILED: SemanticException [Error 10025]: Line 1:7 Expression not in GROUP BY key 'ename'
select中出现的字段,如果没有出现在组函数/聚合函数中,必须出现在group by里面,否则就会产生报错
我把ename,添加到group by后面了,就没有产生报错信息。说明除了聚合函数不用添加到group by后面,select出现的字段都需要加gourp by
hive> select ename,deptno,avg(salary) from ruozedata_emp group by ename,deptno;
Query ID = hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1528425516016_0004, Tracking URL = http://hadoop000:8088/proxy/application_1528425516016_0004/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1528425516016_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-06-08 11:38:54,714 Stage-1 map = 0%, reduce = 0%
2018-06-08 11:38:59,953 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.16 sec
2018-06-08 11:39:06,262 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.55 sec
MapReduce Total cumulative CPU time: 2 seconds 550 msec
Ended Job = job_1528425516016_0004
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.55 sec HDFS Read: 8851 HDFS Write: 238 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 550 msec
OK
ADAMS 20 1100.0
ALLEN 30 1600.0
BLAKE 30 2850.0
CLARK 10 2450.0
FORD 20 3000.0
HIVE NULL 10300.0
JAMES 30 950.0
JONES 20 2975.0
KING 10 5000.0
MARTIN 30 1250.0
MILLER 10 1300.0
SCOTT 20 3000.0
SMITH 20 800.0
TURNER 30 1500.0
WARD 30 1250.0
Time taken: 18.978 seconds, Fetched: 15 row(s)
求每个部门(deptno)、工作岗位(job)的最高工资(salary),很简单还是注意group by后面的字段内容
hive> select deptno,job,max(salary) from ruozedata_emp group by deptno,job;
Query ID = hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1528425516016_0005, Tracking URL = http://hadoop000:8088/proxy/application_1528425516016_0005/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1528425516016_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-06-08 13:06:16,651 Stage-1 map = 0%, reduce = 0%
2018-06-08 13:06:22,003 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.62 sec
2018-06-08 13:06:28,306 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.28 sec
MapReduce Total cumulative CPU time: 3 seconds 280 msec
Ended Job = job_1528425516016_0005
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.28 sec HDFS Read: 8488 HDFS Write: 177 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 280 msec
OK
NULL PROGRAM 10300.0
10 CLERK 1300.0
10 MANAGER 2450.0
10 PRESIDENT 5000.0
20 ANALYST 3000.0
20 CLERK 1100.0
20 MANAGER 2975.0
30 CLERK 950.0
30 MANAGER 2850.0
30 SALESMAN 1600.0
Time taken: 21.136 seconds, Fetched: 10 row(s)
每个部门的平均工资大于2000我们通常的情况下对分组情况后做筛选,是Hive 0.11版本以后的
【若泽大数据面试题】
where和having的区别:where 是对当条记录进行筛选,
having是对分组后的结果进行筛选的
hive> select deptno avg(salary) from ruozedata_emp group by deptno where avg(salary) >2000 ;
FAILED: ParseException line 1:17 missing EOF at '(' near 'avg'(就算是查询语法错误,where应该在group by之前)
这个是分组函数结果后进行计算出平均工资大于2000的结果,所以用having:
hive> select deptno,avg(salary) from ruozedata_emp group by deptno having avg(salary) >2000;
Query ID = hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1528425516016_0006, Tracking URL = http://hadoop000:8088/proxy/application_1528425516016_0006/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1528425516016_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-06-08 13:17:37,251 Stage-1 map = 0%, reduce = 0%
2018-06-08 13:17:42,490 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.17 sec
2018-06-08 13:17:48,706 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.29 sec
MapReduce Total cumulative CPU time: 3 seconds 290 msec
Ended Job = job_1528425516016_0006
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.29 sec HDFS Read: 8929 HDFS Write: 43 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 290 msec
OK
NULL 10300.0
10 2916.6666666666665
20 2175.0
Time taken: 18.912 seconds, Fetched: 3 row(s)
case when then end 根据已有的字段取一个常用的标识:
hive> select ename, salary,
> case
> when salary > 1 and salary <= 1000 then 'LOWER'
> when salary > 1000 and salary <= 2000 then 'MIDDLE'
> when salary > 2000 and salary <= 4000 then 'HIGH'
> ELSE 'HIGHEST'
> end
> from ruozedata_emp;
OK
SMITH 800.0 LOWER
ALLEN 1600.0 MIDDLE
WARD 1250.0 MIDDLE
JONES 2975.0 HIGH
MARTIN 1250.0 MIDDLE
BLAKE 2850.0 HIGH
CLARK 2450.0 HIGH
SCOTT 3000.0 HIGH
KING 5000.0 HIGHEST
TURNER 1500.0 MIDDLE
ADAMS 1100.0 MIDDLE
JAMES 950.0 LOWER
FORD 3000.0 HIGH
MILLER 1300.0 MIDDLE
HIVE 10300.0 HIGHEST
Time taken: 0.134 seconds, Fetched: 15 row(s)
-- JOIN:
[hadoop@hadoop000 data]$ vi join_a.txt
1 ruoze
2 j
3 k
[hadoop@hadoop000 data]$ vi join_b.txt
1 30
2 29
4 21
-- 创建a表和b表:
hive> create table a(
> id int, name string
> ) row format delimited fields terminated by '\t';
OK
Time taken: 0.307 seconds
hive> create table b(
> id int, age int
> ) row format delimited fields terminated by '\t';
OK
Time taken: 0.082 seconds
-- 数据导入a表和b表load data local inpath '/home/hadoop/data/join_a.txt' overwrite into table a;
load data local inpath '/home/hadoop/data/join_b.txt' overwrite into table b;
-- 查询结果为什么是NULL值Hive中创建表加载数据的时候,分隔符与加载文件中的分隔符需要一致,才能得到正确的查询结果
hive> select * from a;
OK
NULL NULL
NULL NULL
NULL NULL
Time taken: 0.051 seconds, Fetched: 3 row(s)
hive> select * from a;
OK
NULL NULL
NULL NULL
NULL NULL
Time taken: 0.052 seconds, Fetched: 3 row(s)
将文件“/home/hadoop/data/join_a.txt”的每一行(id,name)之间使用一个Tab键分割。hive> select * from a;
OK
1 ruoze
2 j
3 k
hive> select * from b;
OK
1 30
2 29
4 21
inner join = joina表joinb表
hive> select a.id,a.name,b.age from a join b on a.id=b.id;
Query ID = hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52
Total jobs = 1
Execution log at: /tmp/hadoop/hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52.log
2018-06-08 04:42:56 Starting to launch local task to process map join; maximum memory = 477626368
2018-06-08 04:42:57 Dump the side-table for tag: 1 with group count: 3 into file: file:/tmp/hadoop/079e5638-99ad-41be-8fa3-ccdb1f0c780a/hive_2018-06-08_16-42-52_389_8099852311780202647-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile01--.hashtable
2018-06-08 04:42:57 Uploaded 1 File to: file:/tmp/hadoop/079e5638-99ad-41be-8fa3-ccdb1f0c780a/hive_2018-06-08_16-42-52_389_8099852311780202647-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile01--.hashtable (320 bytes)
2018-06-08 04:42:57 End of local task; Time Taken: 1.225 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1528425516016_0007, Tracking URL = http://hadoop000:8088/proxy/application_1528425516016_0007/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1528425516016_0007
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2018-06-08 16:43:03,901 Stage-3 map = 0%, reduce = 0%
2018-06-08 16:43:10,137 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.6 sec
MapReduce Total cumulative CPU time: 2 seconds 600 msec
Ended Job = job_1528425516016_0007
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1 Cumulative CPU: 2.6 sec HDFS Read: 5564 HDFS Write: 18 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 600 msec
OK
1 ruoze 30
2 j 29
Time taken: 18.935 seconds, Fetched: 2 row(s)
左外连接(left join)以左边的为基准,左边的数据全部数据全部出现,如果没有出现就赋null值
hive> select a.id,a.name,b.age from a right join b on a.id=b.id;
Query ID = hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52
Total jobs = 1
Execution log at: /tmp/hadoop/hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52.log
2018-06-08 04:56:03 Starting to launch local task to process map join; maximum memory = 477626368
2018-06-08 04:56:04 Dump the side-table for tag: 0 with group count: 3 into file: file:/tmp/hadoop/079e5638-99ad-41be-8fa3-ccdb1f0c780a/hive_2018-06-08_16-56-00_679_234800897436568174-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile20--.hashtable
2018-06-08 04:56:04 Uploaded 1 File to: file:/tmp/hadoop/079e5638-99ad-41be-8fa3-ccdb1f0c780a/hive_2018-06-08_16-56-00_679_234800897436568174-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile20--.hashtable (327 bytes)
2018-06-08 04:56:04 End of local task; Time Taken: 0.905 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1528425516016_0009, Tracking URL = http://hadoop000:8088/proxy/application_1528425516016_0009/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1528425516016_0009
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2018-06-08 16:56:11,359 Stage-3 map = 0%, reduce = 0%
2018-06-08 16:56:16,538 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.36 sec
MapReduce Total cumulative CPU time: 1 seconds 360 msec
Ended Job = job_1528425516016_0009
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1 Cumulative CPU: 1.36 sec HDFS Read: 5343 HDFS Write: 27 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 360 msec
OK
1 ruoze 30
2 j 29
NULL NULL 21
Time taken: 16.918 seconds, Fetched: 3 row(s)
全连接(full)全部补上,如果没有的数据就用null代替
hive> select a.id,a.name,b.age from a full join b on a.id=b.id;
Query ID = hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1528425516016_0010, Tracking URL = http://hadoop000:8088/proxy/application_1528425516016_0010/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1528425516016_0010
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2018-06-08 17:00:02,522 Stage-1 map = 0%, reduce = 0%
2018-06-08 17:00:09,974 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 1.65 sec
2018-06-08 17:00:11,010 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.11 sec
2018-06-08 17:00:16,220 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.55 sec
MapReduce Total cumulative CPU time: 4 seconds 550 msec
Ended Job = job_1528425516016_0010
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 4.55 sec HDFS Read: 10932 HDFS Write: 34 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 550 msec
OK
1 ruoze 30
2 j 29
3 k NULL
NULL NULL 21
Time taken: 20.559 seconds, Fetched: 4 row(s)
LEFT SEMI JOIN 是 IN/EXISTS 子查询的一种更高效的实现。Hive 当前没有实现 IN/EXISTS 子查询,所以你可以用 LEFT SEMI JOIN 重写你的子查询语句。
LEFT SEMI JOIN 的限制是, JOIN 子句中右边的表只能在 ON 子句中设置过滤条件,在 WHERE 子句、SELECT 子句或其他地方过滤都不行。
hive> select * from a Left Semi Join b on a.id=b.id;
Query ID = hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52
Total jobs = 1
Execution log at: /tmp/hadoop/hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52.log
2018-06-08 05:12:45 Starting to launch local task to process map join; maximum memory = 477626368
2018-06-08 05:12:46 Dump the side-table for tag: 1 with group count: 3 into file: file:/tmp/hadoop/079e5638-99ad-41be-8fa3-ccdb1f0c780a/hive_2018-06-08_17-12-42_081_4100879749531493003-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile31--.hashtable
2018-06-08 05:12:46 Uploaded 1 File to: file:/tmp/hadoop/079e5638-99ad-41be-8fa3-ccdb1f0c780a/hive_2018-06-08_17-12-42_081_4100879749531493003-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile31--.hashtable (314 bytes)
2018-06-08 05:12:46 End of local task; Time Taken: 1.322 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1528425516016_0011, Tracking URL = http://hadoop000:8088/proxy/application_1528425516016_0011/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1528425516016_0011
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2018-06-08 17:12:54,459 Stage-3 map = 0%, reduce = 0%
2018-06-08 17:12:59,721 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.21 sec
MapReduce Total cumulative CPU time: 2 seconds 210 msec
Ended Job = job_1528425516016_0011
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1 Cumulative CPU: 2.21 sec HDFS Read: 5197 HDFS Write: 12 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 210 msec
OK
1 ruoze
2 j
Time taken: 18.701 seconds, Fetched: 2 row(s)
1. 相关说明
在Hive中处理数据时,当处理的一张表的数据量过大的时候,每次查询都是遍历整张表,显然对于计算机来说,是负担比较重的。所以我们可不可以针对数据进行分类,查询时只遍历该分类中的数据,这样就能有效的解决问题。所以就会Hive在表的架构下,就会有分区的这个概念,就是为了满足此需求。
分区表的一个分区对应hdfs上的一个目录
分区表包括静态分区表和动态分区表,根据分区会不会自动创建来区分多级分区表,即创建的时候指定 PARTITIONED BY (event_month string,loc string),根据顺序,级联创建 event_month=XXX/loc=XXX目录,其他和一级的分区表是一样的
2. 静态分区表
创建静态分区表,加载数据:
-- 创建一张分区表:
create database test1;
use test1;
hive> create table order_partition(
> ordernumber string,
> eventtime string
> )
> partitioned by (event_month string)
> row format delimited fields terminated by '\t';
-- 建表语句,指定分区字段为event_month,这个字段是伪列-- 会在数据load到表的这个分区时,在hdfs上创建名为event_month=2014-05的子目录
hive> desc formatted order_partition;
OK
# col_name data_type comment
ordernumber string
eventtime string
# Partition Information (分区的信息)
# col_name data_type comment
event_month (分区的字段为事件的月份) string(数据类型)
# Detailed Table Information
Database: test1
Owner: hadoop
CreateTime: Sat Jun 09 12:48:40 CST 2018
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://hadoop000:9000/user/hive/warehouse/test1.db/order_partition
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1528519720
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \t
serialization.format \t
Time taken: 0.283 seconds, Fetched: 33 row(s)
-- 先查看表内是否有数据:hive> select * from order_partition;
OK
Time taken: 0.507 seconds
-- 先改partition的字符集mysql> show databasesl
mysql> use ruozedata_basic03
mysql> alter table PARTITIONS convert to character set latin1;
mysql> alter table PARTITION_KEYS convert to character set latin1;
-- 创建一份数据[hadoop@hadoop000 data]$ cat order.txt
10703007267488 2014-05-01 06:01:12.334+01
10101043505096 2014-05-01 07:28:12.342+01
10103043509747 2014-05-01 07:50:12.33+01
10103043501575 2014-05-01 09:27:12.33+01
10104043514061 2014-05-01 09:03:12.324+01
-- 把数据导入到表内
LOAD DATA LOCAL INPATH "/home/hadoop/data/order.txt"
OVERWRITE INTO TABLE order_partition
PARTITION (event_month='2014-05');
--- 查看order_partition是否有数据
hive> use test1;
hive> show tables;
OK
order_partition
hive> select * from order_partition where event_month='2014-05'; (如果是分区查询一定要带上分区的条件)
OK
10703007267488 2014-05-01 2014-05
10101043505096 2014-05-01 2014-05
10103043509747 2014-05-01 2014-05
10103043501575 2014-05-01 2014-05
10104043514061 2014-05-01 2014-05
Time taken: 0.168 seconds, Fetched: 5 row(s)
-- 然后可以查看到有一个分区
[hadoop@hadoop000 data]$ hadoop fs -ls /user/hive/warehouse/test1.db/order_partition/
Found 1 items
2018-06-09 13:08 /user/hive/warehouse/test1.db/order_partition/event_month=2014-05
-- 我们再创建一个2014-06分区
[hadoop@hadoop000 data]$ hadoop fs -mkdir /user/hive/warehouse/test1.db/order_partition/event_month=2014-06
-- 再把数据放到event_month=2014-06里面去
[hadoop@hadoop000 data]$ hadoop fs -put order.txt /user/hive/warehouse/test1.db/order_partition/event_month=2014-0
-- 查看到已经把本地数据放到了hdfs上了
[hadoop@hadoop000 data]$ hadoop fs -ls /user/hive/warehouse/test1.db/order_partition/event_month=2014-06/
Found 1 items
208 2018-06-09 13:22 /user/hive/warehouse/test1.db/order_partition/event_month=2014-06/order.txt
-- 查询结果并没有数据存在表内
Time taken: 0.457 seconds, Fetched: 5 row(s)
hive> select * from order_partition where event_month='2014-06';
OK
Time taken: 0.124 seconds
-- 我们去MySQL查询看看是否有元数据进入了
mysql> use ruozedata_basic03
Database changed
mysql> select * from PARTITIONS;
+---------+-------------+------------------+---------------------+-------+--------+
| PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME | SD_I | TBL_ID |
+---------+-------------+------------------+---------------------+-------+--------+
| 1 | 1528520920 | 0 | event_month=2014-05 | 32 | 31 |
+---------+-------------+------------------+---------------------+-------+--------+
1 row in set (0.00 sec)
mysql> select * from PARTITION_KEY_VALS;
+---------+--------------+-------------+
| PART_ID | PART_KEY_VAL | INTEGER_IDX |
+---------+--------------+-------------+
| 1 | 2014-05 | 0 |
+---------+--------------+-------------+
1 row in set (0.00 sec)
-- 我们需要修复MySQL表的元数据信息,Hive上才能查到到相应的数据结果
可以采用命令: MSCK REPAIR TABLE table_name;
hive> MSCK REPAIR TABLE order_partition;
ok
partitions not in metastore: order_partition:event_month=2014-06
Repair: Added partition to metastore order_partition:event_month=2014-06 (添加了这个分区表到元数据信息里面去)
Time taken:0.245 seconds,Fetched:2 row(s)
这个命令会把 2014-06的信息从MySQL表里面刷新,从而可以再Hive上查询表可以查询的到信息同步过来了。
但有一个致命的缺点,它会把刷新所有MySQL表里面的信息,如果有一张表已经存放好几年了,用这个命令去执行的话
半天都反应不了,所以这个命令太暴力了,生产中禁止使用。
Add partition (添加一个Partition)
data]$ hadoop fs -mkdir /user/hive/warehouse/test1.db/order_partition/event_month=2014-07
data]$ hadoop fs -put order.txt /user/hive/warehouse/test1.db/order_partition/event_month=2014-07
-- 用Hive查看表内是否有数据
hive> select * from order_partition where event_month='2014-07';
OK
Time taken: 0.091 seconds
-- 添加一个分区刷新MySQL表内的信息 (生产中使用这种刷新的方式,只单单刷新一张表的信息)
hive> ALTER TABLE order_partition ADD IF NOT EXISTS
> PARTITION (event_month='2014-07') ;
OK
Time taken: 0.177 seconds
hive> select * from order_partition where event_month='2014-07';
OK
10703007267488 2014-05-01 2014-07
10101043505096 2014-05-01 2014-07
10103043509747 2014-05-01 2014-07
10103043501575 2014-05-01 2014-07
10104043514061 2014-05-01 2014-07
Time taken: 0.115 seconds, Fetched: 5 row(s)
以上的方式是通过文本的形式上传本地数据的信息
总结:
select * from test1.order_partition;
-- 分区表全表扫描,不推荐
select * from test1.order_partition where event_month='2014-05';
-- 使用where子句,过滤分区字段,遍历某个分区
-- 以上两个SQL可以查到列event_month信息
-- 而使用hdfs dfs -cat看不到该列,说明分区表的分区列是伪列
-- 实际上是hdfs中的分区目录的体现
-- 创建一张新的分区表
create table order_4_partition( ordernumber string, eventtime string ) row format delimited fields terminated by '\t';
-- 把本地数据上传到hdfs上
hive> load data local inpath '/home/hadoop/data/order.txt' overwrite into table order_4_partition;
Loading data to table test1.order_4_partition
Table test1.order_4_partition stats: [numFiles=1, numRows=0, totalSize=208, rawDataSize=0] OK Time taken: 0.317 seconds
-- 插入一条数据 查看数据是否存入表内
hive> insert overwrite table order_partition
> partition(event_month='2014-08')
> select * from order_4_partition;
hive> select * from order_4_partition;
OK 10703007267488 2014-05-01 06:01:12.334+01
10101043505096 2014-05-01 07:28:12.342+01
10103043509747 2014-05-01 07:50:12.33+01
10103043501575 2014-05-01 09:27:12.33+01
10104043514061 2014-05-01 09:03:12.324+01
-- 数据更新到表内
hive> select * from order_partition where event_month='2014-08';
OK
10703007267488 2014-05-01 2014-08
10101043505096 2014-05-01 2014-08
10103043509747 2014-05-01 2014-08
10103043501575 2014-05-01 2014-08
10104043514061 2014-05-01 2014-08
-- 查看MySQL信息
mysql> select * from PARTITIONS;
+---------+-------------+------------------+-----------------------------------+-------+--------+
| PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME | SD_ID | TBL_ID |
+---------+-------------+------------------+-----------------------------------+-------+--------+
| 1 | 1528520920 | 0 | event_month=2014-05 | 32 | 31 |
| 2 | 1528523750 | 0 | event_month=2014-07 | 33 | 31 |
| 3 | 1528525251 | 0 | event_month=2014-06 | 35 | 31 |
| 6 | 1528526037 | 0 | event_month=2014-08 | 42 | 31 |
| 11 | 1528531683 | 0 | event_month=2014-05/event_day=01 | 48 | 42 |
| 16 | 1528558645 | 0 | deptno=20 | 56 | 45 |
| 17 | 1528558874 | 0 | deptno=10 | 57 | 45 |
| 18 | 1528628193 | 0 | deptno=30 | 59 | 46 |
| 19 | 1528628193 | 0 | deptno=__HIVE_DEFAULT_PARTITION__ | 60 | 46 |
| 20 | 1528628193 | 0 | deptno=20 | 61 | 46 |
| 21 | 1528628193 | 0 | deptno=10 | 62 | 46 |
+---------+-------------+------------------+-----------------------------------+-------+--------+
-- 查看Hive里面有多少个分区
hive> show partitions order_partition;
OK
event_month=2014-05event_month=2014-06event_month=2014-07
event_month=2014-08Time taken: 0.121 seconds, Fetched: 4 row(s)
以上讲解的是静态分区中的单级分区,分区里面的字段是不能出现在普通字段里面的
多级分区:(工作使用)
create table order_mulit_partition(
ordernumber string,
eventtime string
)
partitioned by (event_month string,event_day string)
row format delimited fields terminated by '\t';
-- 查看表信息
hive> desc formatted order_mulit_partition;
OK
# col_name data_type comment
ordernumber string
eventtime string
# Partition Information
# col_name data_type comment
event_month string (分区字段一)
event_day string (分区字段二)
# Detailed Table Information
Database: test1
Owner: hadoop
CreateTime: Sat Jun 09 15:46:27 CST 2018
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://hadoop000:9000/user/hive/warehouse/test1.db/order_mulit_partition
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1528530387
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \t
serialization.format \t
Time taken: 0.111 seconds, Fetched: 34 row(s)
hive>
-- 把数据导入HDFS上
hive> LOAD DATA LOCAL INPATH '/home/hadoop/data/order.txt'
> OVERWRITE INTO TABLE order_mulit_partition
> PARTITION(event_month='2014-05', event_day='01');
Loading data to table test1.order_mulit_partition partition (event_month=2014-05, event_day=01)
Partition test1.order_mulit_partition{event_month=2014-05, event_day=01} stats: [numFiles=1, numRows=0,
totalSize=208, rawDataSize=0]
OK
Time taken: 0.992 seconds
-- 查看HDFS多级分区的目录结构[hadoop@hadoop000 data]$ hadoop fs -ls /user/hive/warehouse/test1.db/order_mulit_partition/event_month=2014-05
Found 1 items08 /user/hive/warehouse/test1.db/order_mulit_partition/event_month=2014-05/event_day=01
/user/hive/warehouse/test1.db/order_mulit_partition/event_month=2014-05/event_day=01/order.txt(订单)
-- 查分区表的记录hive> select * from order_mulit_partition where event_month='2014-05' and event_day='01';
OK
10703007267488 2014-05-01 06:01:12.334+01 2014-05 01
10101043505096 2014-05-01 07:28:12.342+01 2014-05 01
10103043509747 2014-05-01 07:50:12.33+01 2014-05 01
10103043501575 2014-05-01 09:27:12.33+01 2014-05 01
10104043514061 2014-05-01 09:03:12.324+01 2014-05 01
Time taken: 0.294 seconds, Fetched: 5 row(s)
动态分区:
-- 先创建一张静态分区表
PARTITIONED by(deptno string) 红字的信息不能与字段里的信息重复,否则会报错
hive> create table ruozedata_static_emp
> (empno int, ename string, job string, mgr int, hiredate string, salary double, comm double)
> PARTITIONED by(deptno string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t' ;
-- 现在有两张表 ruozedata_emp 和 ruozedata_static_emp
-- 把ruozedata_emp里部门编号等于20的写到ruozedata_static_emp;
hive> select * from ruozedata_emp;
OK
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
8888 HIVE PROGRAM 7839 1988-1-23 10300.0 NULL NULL
-- 先查看是否有数据信息
hive> select * from ruozedata_static_emp;
OK
Time taken: 0.52 seconds
-- 插入查询数据
hive> insert into table ruozedata_static_emp partition(deptno='20')
> select empno,ename,job,mgr,hiredate,salary,comm from ruozedata_emp
> where deptno=20;
Query ID = hadoop_20180609165353_6aa814d6-fd99-4374-90f0-ff5a78ce59d1
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1528518767828_0002, Tracking URL = http://hadoop000:8088/proxy/application_1528518767828_0002/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1528518767828_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-06-09 23:37:16,744 Stage-1 map = 0%, reduce = 0%
2018-06-09 23:37:23,236 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.77 sec
MapReduce Total cumulative CPU time: 2 seconds 770 msec
Ended Job = job_1528518767828_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop000:9000/user/hive/warehouse/hive3.db/ruozedata_static_emp/deptno=20/.
hive-staging_hive_2018-06-09_23-37-07_657_1948946046232412047-1/-ext-10000
Loading data to table hive3.ruozedata_static_emp partition (deptno=20)
Partition hive3.ruozedata_static_emp{deptno=20} stats: [numFiles=1, numRows=5, totalSize=214, rawDataSize=209]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 2.77 sec HDFS Read: 4933 HDFS Write: 307 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 770 msec
OK
Time taken: 18.322 seconds
-- 查看更新后部门信息为20的表信息
hive> select * from ruozedata_static_emp where deptno='20';
OK
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
-- 再插入查询数据信息为10的表信息
hive> insert into table ruozedata_static_emp partition(deptno='10')
> select empno,ename,job,mgr,hiredate,salary,comm from ruozedata_emp
> where deptno=10;
Query ID = hadoop_20180609165353_6aa814d6-fd99-4374-90f0-ff5a78ce59d1
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1528518767828_0003, Tracking URL = http://hadoop000:8088/proxy/application_1528518767828_0003/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1528518767828_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-06-09 23:41:06,624 Stage-1 map = 0%, reduce = 0%
2018-06-09 23:41:12,920 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.04 sec
MapReduce Total cumulative CPU time: 2 seconds 40 msec
Ended Job = job_1528518767828_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop000:9000/user/hive/warehouse/hive3.db/ruozedata_static_emp/deptno=10/.
hive-staging_hive_2018-06-09_23-41-00_574_8181222805563183320-1/-ext-10000
Loading data to table hive3.ruozedata_static_emp partition (deptno=10)
Partition hive3.ruozedata_static_emp{deptno=10} stats: [numFiles=1, numRows=3, totalSize=130, rawDataSize=127]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 2.04 sec HDFS Read: 4933 HDFS Write: 223 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 40 msec
OK
Time taken: 13.9 seconds
-- 静态分区表有一个非常致命的缺点,如果需要大批量的查询所需要的数据,就只能这样一条条查询-- 查询看看有多少分区表
hive> show partitions ruozedata_static_emp;
OK
deptno=10
deptno=20
Time taken: 0.065 seconds, Fetched: 2 row(s)
动态分区操作部分:
-- 创建一张动态分区表
> (empno int, ename string, job string, mgr int, hiredate string, salary double, comm double)
> PARTITIONED by(deptno string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t' ;
OK
Time taken: 0.145 seconds
-- 查看动态分区表是否有数据
hive> select * from ruozedata_dynamic_emp;
OK
Time taken: 0.083 seconds
-- 根据部门编号分区,原表中的部门编号字段就没有必要创建了-- 而是由分区表创建的伪列来替代
set hive.exec.dynamic.partition.mode=nonstrict;-- 设置动态分区模式为非严格模式
insert into table ruozedata_dynamic_emp partition(deptno)
select empno,ename,job,mgr,hiredate,salary,comm,deptno from ruozedata_emp ;
-- 动态分区表的数据插入语句
-- partition(deptno) 而不是 partition(deptno=XXX)
-- select 子句从原表查出来的列数和列序要和分区表列数和列序保持一致
-- select 子句最后一列要为分区表的分区列
-- 不在需要where子句
-- 设置动态分区模式为非严格模式,这是Hive中常用的设置 key=value的方式
set hive.exec.dynamic.partition.mode=nonstrict;
语法格式:
set key=value 设置
set key; 取值
手工创建hdfs目录和文件,添加分区的情况:
静态分区表如果手工创建对应的hdfs目录上传文件,而不使用分区创建命令和load数据到分区的命令,分区表中无法查到该分区信息,需要刷新,这种添加分区的途径是不合法的:
动态分区表:
-- 根据从前实验创建的emp表-- 将emp表的数据按照部门分组,并将数据加载到其对应的分组中去
create table emp_partition(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double)
PARTITIONED BY (deptno int)
row format delimited fields terminated by '\t';
-- 根据部门编号分区,原表中的部门编号字段就没有必要创建了-- 而是由分区表创建的伪列来替代
set hive.exec.dynamic.partition.mode=nonstrict; -- 设置动态分区模式为非严格模式
insert into table emp_partition partition(deptno)
select empno,ename,job,mgr,hiredate,sal,comm ,deptno from emp;
-- MySQL查询结果mysql> select * from PARTITION_KEY_VALS;
+---------+----------------------------+-------------+
| PART_ID | PART_KEY_VAL | INTEGER_IDX |
+---------+----------------------------+-------------+
| 1 | 2014-05 | 0 |
| 2 | 2014-07 | 0 |
| 3 | 2014-06 | 0 |
| 6 | 2014-08 | 0 |
| 11 | 2014-05 | 0 |
| 11 | 01 | 1 |
| 16 | 20 | 0 |
| 17 | 10 | 0 |
| 18 | 30 | 0 |
| 19 | __HIVE_DEFAULT_PARTITION__ | 0 |
| 20 | 20 | 0 |
| 21 | 10 | 0 |
+---------+----------------------------+-------------+
mysql> select * from PARTITIONS;
+---------+-------------+------------------+-----------------------------------+-------+--------+
| PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME | SD_ID | TBL_ID |
+---------+-------------+------------------+-----------------------------------+-------+--------+
| 1 | 1528520920 | 0 | event_month=2014-05 | 32 | 31 |
| 2 | 1528523750 | 0 | event_month=2014-07 | 33 | 31 |
| 3 | 1528525251 | 0 | event_month=2014-06 | 35 | 31 |
| 6 | 1528526037 | 0 | event_month=2014-08 | 42 | 31 |
| 11 | 1528531683 | 0 | event_month=2014-05/event_day=01 | 48 | 42 |
| 16 | 1528558645 | 0 | deptno=20 | 56 | 45 |
| 17 | 1528558874 | 0 | deptno=10 | 57 | 45 |
| 18 | 1528628193 | 0 | deptno=30 | 59 | 46 |
| 19 | 1528628193 | 0 | deptno=__HIVE_DEFAULT_PARTITION__ | 60 | 46 |
| 20 | 1528628193 | 0 | deptno=20 | 61 | 46 |
| 21 | 1528628193 | 0 | deptno=10 | 62 | 46 |
+---------+-------------+------------------+-----------------------------------+-------+--------+
11 rows in set (0.00 sec)
hive> select * from ruozedata_dynamic_emp;
OK
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
8888 HIVE PROGRAM 7839 1988-1-23 10300.0 NULL __HIVE_DEFAULT_PARTITION__
Time taken: 0.817 seconds, Fetched: 15 row(s)
动态分区必须是非严格模式:(生产都是非严格模式)hive> set hive.exec.dynamic.partition.mode;
hive.exec.dynamic.partition.mode=nonstrict
Hive的内置函数:SHOW FUNCTIONS; (查看所有内置函数)
DESCRIBE FUNCTION <function_name>;
DESCRIBE FUNCTION EXTENDED <function_name>;
hive> desc function upper;
OK
upper(str) - Returns str with all characters changed to uppercase
Time taken: 0.201 seconds, Fetched: 1 row(s)
hive> desc function EXTENDED upper;
OK
upper(str) - Returns str with all characters changed to uppercase
Synonyms: ucase
Example:
> SELECT upper('Facebook') FROM src LIMIT 1;
'FACEBOOK'
Time taken: 0.066 seconds, Fetched: 5 row(s)
-- 查看函数具体语法
hive> DESCRIBE FUNCTION case;
OK
CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END - When a = b, returns c; when a = d, return e; else return f
hive> DESCRIBE FUNCTION EXTENDED case;
hive> DESCRIBE FUNCTION EXTENDED case;
OK
CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END - When a = b, returns c; when a = d, return e; else return f
Example:
SELECT
CASE deptno
WHEN 1 THEN Engineering
WHEN 2 THEN Finance
ELSE admin
END,
CASE zone
WHEN 7 THEN Americas
ELSE Asia-Pac
END
FROM emp_de
tails
![](https://i-blog.csdnimg.cn/blog_migrate/55670883567ced4dec49471d058b5236.png)