【若泽大数据实战第十八天】Hive - DML 分区表内部函数

最新推荐文章于 2022-08-30 07:51:03 发布

强悍de巨人

最新推荐文章于 2022-08-30 07:51:03 发布

阅读量553

点赞数 2

分类专栏：【若泽大数据】零基础班实战

本文链接：https://blog.csdn.net/qq_24073707/article/details/80658301

版权

【若泽大数据】零基础班实战专栏收录该内容

44 篇文章 6 订阅

订阅专栏

前言：

上一堂课回顾总结：大数据开发最主要做底层的优化架构的优化，这些东西才是核心，写代码用的不多，做平台化，开发一个平台直接在页面上拖拽就可以了，底层的数据平台都已经封装好了，只是会某一个功能，只能说是大数据开发的入门，初级程序员做的事。Oracle在大数据几乎不用，只用一些轻量级的数据库。

讲了DDL DML，Hive数据模型的抽象，顶层是Database，下一层是table，再下一层是partition，最下面一层是 bucket，除了bucket之外，上面的都是针对于HDFS的一个文件夹，因为Hive是构建再Hadoop之上的一个数据仓库，Hive的数据是存在HDFS上的，Hive的元数据是存放在关系型数据库上面的，比如说MySQL。

DDL 创建数据库，创建表

对于表的操作：

1、创建表 2、复制表结构 3、复制一个表的数据过来 create table as select

insert overwrite

插入的时候如果有一列，和源数据信息的列，位置搞错了，不按正常写，就会出现报错，数据错乱

所有在插入的时候需要慎重，列的数量，列的类型，以及列的顺序都写清楚。

内部表和外部表：

Hive上有两种类型的表，一种是Managed Table(默认的)，另一种是External Table（加上EXTERNAL关键字）。内部表数据由Hive自身管理，外部表数据由HDFS管理；
它俩的主要区别在于：当我们drop表时，Managed Table会同时删去data（存储在HDFS上）和meta data（存储在MySQL），而External Table只会删meta data。内部表数据存储的位置是hive.metastore.warehouse.dir（默认：/user/hive/warehouse），外部表数据的存储位置由自己制定；对内部表的修改会将修改直接同步给元数据，而对外部表的表结构和分区进行修改，则需要修复（MSCK REPAIR TABLE table_name;）自己套一个业务解释场景。

生产中99%都用到的是外部表，因为可以有一份备份，如果删除表用LOAD DATA LOCAL INPATH 在传一份就可以了。

生产中1%用到的内部表的情况，外面的数据传过来，你这里可建立外部表，如果表删没了再导一份。

DML 导入数据，95%的场景下用node把一个文件导进来就可以了，导的时候可以支持本地导，和HDFS上面导，导的时候95%以上的场景都是 OVERWRITE ，加载数据到表的时候数据的处理方式，覆盖（生产使用），在Hive里面从0.14版本是支持insert into的但是这种是比较鸡肋的功能，如果一定要用insert into的这种情况只有一种，在你做测试的时候造一点数据，不推荐使用

本期课程如下：

Select （查询）：

hive> select * from emp where deptno=10;
OK
7782    CLARK   MANAGER 7839    1981/6/9        2450.0  NULL    10
7839    KING    PRESIDENT       NULL    1981/11/17      5000.0  NULL    10
7934    MILLER  CLERK   7782    1982/1/23       1300.0  NULL    10
Time taken: 1.144 seconds, Fetched: 3 row(s)

hive> select * from emp where empno <= 7800;
OK
7369    SMITH   CLERK   7902    1980/12/17      800.0   NULL    20
7499    ALLEN   SALESMAN        7698    1981/2/20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981/2/22       1250.0  500.0   30
7566    JONES   MANAGER 7839    1981/4/2        2975.0  NULL    20
7654    MARTIN  SALESMAN        7698    1981/9/28       1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981/5/1        2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981/6/9        2450.0  NULL    10
7788    SCOTT   ANALYST 7566    1987/4/19       3000.0  NULL    20
Time taken: 0.449 seconds, Fetched: 8 row(s)

hive> select * from emp where empno <= 7800;
OK
7369    SMITH   CLERK   7902    1980/12/17      800.0   NULL    20
7499    ALLEN   SALESMAN        7698    1981/2/20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981/2/22       1250.0  500.0   30
7566    JONES   MANAGER 7839    1981/4/2        2975.0  NULL    20
7654    MARTIN  SALESMAN        7698    1981/9/28       1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981/5/1        2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981/6/9        2450.0  NULL    10
7788    SCOTT   ANALYST 7566    1987/4/19       3000.0  NULL    20
Time taken: 0.449 seconds, Fetched: 8 row(s)

hive> select * from emp where salary between 1000 and 1500;
OK
7521    WARD    SALESMAN        7698    1981/2/22       1250.0  500.0   30
7654    MARTIN  SALESMAN        7698    1981/9/28       1250.0  1400.0  30
7844    TURNER  SALESMAN        7698    1981/9/8        1500.0  0.0     30
7876    ADAMS   CLERK   7788    1987/5/23       1100.0  NULL    20
7934    MILLER  CLERK   7782    1982/1/23       1300.0  NULL    10
Time taken: 0.178 seconds, Fetched: 5 row(s)

hive> select * from emp limit 5;
OK
7369    SMITH   CLERK   7902    1980/12/17      800.0   NULL    20
7499    ALLEN   SALESMAN        7698    1981/2/20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981/2/22       1250.0  500.0   30
7566    JONES   MANAGER 7839    1981/4/2        2975.0  NULL    20
7654    MARTIN  SALESMAN        7698    1981/9/28       1250.0  1400.0  30
Time taken: 0.47 seconds, Fetched: 5 row(s)

hive> select * from emp where empno in(7566,7499);
OK
7499    ALLEN   SALESMAN        7698    1981/2/20       1600.0  300.0   30
7566    JONES   MANAGER 7839    1981/4/2        2975.0  NULL    20
Time taken: 0.4 seconds, Fetched: 2 row(s)

hive> select * from emp where comm is not null;
OK
7499    ALLEN   SALESMAN        7698    1981/2/20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981/2/22       1250.0  500.0   30
7654    MARTIN  SALESMAN        7698    1981/9/28       1250.0  1400.0  30
7844    TURNER  SALESMAN        7698    1981/9/8        1500.0  0.0     30
Time taken: 0.262 seconds, Fetched: 4 row(s)

聚合函数：

max/min/count/sum/ave 特点：多进一出，进来很多条记录出去只有一条记录

-- 查询部门编号为10的有多少条记录

hive> select count(1) from ruozedata_emp where deptno=10;
Query ID = hadoop_20180608111010_cdb33a86-762b-4a72-8f56-c94c9f719b7d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1528425516016_0001, Tracking URL = http://hadoop000:8088/proxy/application_1528425516016_0001/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1528425516016_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-06-08 11:18:19,145 Stage-1 map = 0%,  reduce = 0%
2018-06-08 11:18:26,615 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.08 sec
2018-06-08 11:18:33,001 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.97 sec
MapReduce Total cumulative CPU time: 4 seconds 970 msec
Ended Job = job_1528425516016_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.97 sec   HDFS Read: 8417 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 970 msec
OK
3  （部门编号10的有三条记录）
Time taken: 28.056 seconds, Fetched: 1 row(s)

-- 查询下是否是三条记录

hive> select * from ruozedata_emp where deptno=10;
OK
7782    CLARK   MANAGER 7839    1981-6-9        2450.0  NULL    10
7839    KING    PRESIDENT       NULL    1981-11-17      5000.0  NULL    10
7934    MILLER  CLERK   7782    1982-1-23       1300.0  NULL    10
Time taken: 0.107 seconds, Fetched: 3 row(s)

-- 求最大工资，最小工资，平均工资，工资的和

hive> select max(salary),min(salary),avg(salary),sum(salary) from ruozedata_emp;
Query ID = hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1528425516016_0002, Tracking URL = http://hadoop000:8088/proxy/application_1528425516016_0002/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1528425516016_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-06-08 11:27:54,739 Stage-1 map = 0%,  reduce = 0%
2018-06-08 11:28:00,082 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.46 sec
2018-06-08 11:28:06,527 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.18 sec
MapReduce Total cumulative CPU time: 3 seconds 180 msec
Ended Job = job_1528425516016_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.18 sec   HDFS Read: 10128 HDFS Write: 41 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 180 msec
OK
10300.0（最大工资） 800.0（最小工资）   2621.6666666666665（平均工资）      39325.0（工资的总和）
Time taken: 21.745 seconds, Fetched: 1 row(s)

分组函数 group by

-- 求部门的平均工资

hive> select deptno,avg(salary) from ruozedata_emp group by deptno;
Query ID = hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1528425516016_0003, Tracking URL = http://hadoop000:8088/proxy/application_1528425516016_0003/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1528425516016_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-06-08 11:31:49,358 Stage-1 map = 0%,  reduce = 0%
2018-06-08 11:31:53,622 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.05 sec
2018-06-08 11:31:59,910 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.72 sec
MapReduce Total cumulative CPU time: 2 seconds 720 msec
Ended Job = job_1528425516016_0003
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.72 sec   HDFS Read: 8490 HDFS Write: 65 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 720 msec
OK
NULL    10300.0
10      2916.6666666666665
20      2175.0
30      1566.6666666666667
Time taken: 18.807 seconds, Fetched: 4 row(s)

常见报错信息：

hive> select ename,deptno,avg(salary) from ruozedata_emp group by deptno;
FAILED: SemanticException [Error 10025]: Line 1:7 Expression not in GROUP BY key 'ename'       

select中出现的字段，如果没有出现在组函数/聚合函数中，必须出现在group by里面，否则就会产生报错

我把ename,添加到group by后面了，就没有产生报错信息。说明除了聚合函数不用添加到group by后面，

select出现的字段都需要加gourp by

hive> select ename,deptno,avg(salary) from ruozedata_emp group by ename,deptno;
Query ID = hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1528425516016_0004, Tracking URL = http://hadoop000:8088/proxy/application_1528425516016_0004/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1528425516016_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-06-08 11:38:54,714 Stage-1 map = 0%,  reduce = 0%
2018-06-08 11:38:59,953 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.16 sec
2018-06-08 11:39:06,262 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.55 sec
MapReduce Total cumulative CPU time: 2 seconds 550 msec
Ended Job = job_1528425516016_0004
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.55 sec   HDFS Read: 8851 HDFS Write: 238 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 550 msec
OK
ADAMS   20      1100.0
ALLEN   30      1600.0
BLAKE   30      2850.0
CLARK   10      2450.0
FORD    20      3000.0
HIVE    NULL    10300.0
JAMES   30      950.0
JONES   20      2975.0
KING    10      5000.0
MARTIN  30      1250.0
MILLER  10      1300.0
SCOTT   20      3000.0
SMITH   20      800.0
TURNER  30      1500.0
WARD    30      1250.0
Time taken: 18.978 seconds, Fetched: 15 row(s)

求每个部门(deptno)、工作岗位(job)的最高工资(salary)，很简单还是注意group by后面的字段内容

hive> select deptno,job,max(salary) from ruozedata_emp group by deptno,job;
Query ID = hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1528425516016_0005, Tracking URL = http://hadoop000:8088/proxy/application_1528425516016_0005/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1528425516016_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-06-08 13:06:16,651 Stage-1 map = 0%,  reduce = 0%
2018-06-08 13:06:22,003 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.62 sec
2018-06-08 13:06:28,306 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.28 sec
MapReduce Total cumulative CPU time: 3 seconds 280 msec
Ended Job = job_1528425516016_0005
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.28 sec   HDFS Read: 8488 HDFS Write: 177 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 280 msec
OK
NULL    PROGRAM 10300.0
10      CLERK   1300.0
10      MANAGER 2450.0
10      PRESIDENT       5000.0
20      ANALYST 3000.0
20      CLERK   1100.0
20      MANAGER 2975.0
30      CLERK   950.0
30      MANAGER 2850.0
30      SALESMAN        1600.0
Time taken: 21.136 seconds, Fetched: 10 row(s)

每个部门的平均工资大于2000
我们通常的情况下对分组情况后做筛选，是Hive 0.11版本以后的

【若泽大数据面试题】

where和having的区别：
where 是对当条记录进行筛选，

having是对分组后的结果进行筛选的

hive> select deptno avg(salary) from ruozedata_emp group by deptno where avg(salary) >2000 ;
FAILED: ParseException line 1:17 missing EOF at '(' near 'avg'（就算是查询语法错误，where应该在group by之前）

这个是分组函数结果后进行计算出平均工资大于2000的结果，所以用having：

hive> select deptno,avg(salary) from ruozedata_emp group by deptno having avg(salary) >2000;
Query ID = hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1528425516016_0006, Tracking URL = http://hadoop000:8088/proxy/application_1528425516016_0006/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1528425516016_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-06-08 13:17:37,251 Stage-1 map = 0%,  reduce = 0%
2018-06-08 13:17:42,490 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.17 sec
2018-06-08 13:17:48,706 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.29 sec
MapReduce Total cumulative CPU time: 3 seconds 290 msec
Ended Job = job_1528425516016_0006
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.29 sec   HDFS Read: 8929 HDFS Write: 43 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 290 msec
OK
NULL    10300.0
10      2916.6666666666665
20      2175.0
Time taken: 18.912 seconds, Fetched: 3 row(s)

case when then end 根据已有的字段取一个常用的标识：

hive> select ename, salary, 
    > case
    > when salary > 1 and salary <= 1000 then 'LOWER'
    > when salary > 1000 and salary <= 2000 then 'MIDDLE'
    > when salary > 2000 and salary <= 4000 then 'HIGH'
    > ELSE 'HIGHEST'
    > end
    > from ruozedata_emp;
OK
SMITH   800.0   LOWER
ALLEN   1600.0  MIDDLE
WARD    1250.0  MIDDLE
JONES   2975.0  HIGH
MARTIN  1250.0  MIDDLE
BLAKE   2850.0  HIGH
CLARK   2450.0  HIGH
SCOTT   3000.0  HIGH
KING    5000.0  HIGHEST
TURNER  1500.0  MIDDLE
ADAMS   1100.0  MIDDLE
JAMES   950.0   LOWER
FORD    3000.0  HIGH
MILLER  1300.0  MIDDLE
HIVE    10300.0 HIGHEST
Time taken: 0.134 seconds, Fetched: 15 row(s)

-- JOIN：

[hadoop@hadoop000 data]$ vi join_a.txt

1       ruoze
2       j
3       k

[hadoop@hadoop000 data]$ vi join_b.txt

1       30
2       29
4       21

-- 创建a表和b表：

hive> create table a(
    > id int, name string
    > ) row format delimited fields terminated by '\t';
OK
Time taken: 0.307 seconds
hive> create table b(
    > id int, age int
    > ) row format delimited fields terminated by '\t';
OK
Time taken: 0.082 seconds

-- 数据导入a表和b表

load data local inpath '/home/hadoop/data/join_a.txt' overwrite into table a;
load data local inpath '/home/hadoop/data/join_b.txt' overwrite into table b;

-- 查询结果为什么是NULL值

Hive中创建表加载数据的时候，分隔符与加载文件中的分隔符需要一致，才能得到正确的查询结果

hive> select * from a;
OK
NULL    NULL
NULL    NULL
NULL    NULL
Time taken: 0.051 seconds, Fetched: 3 row(s)
hive> select * from a;
OK
NULL    NULL
NULL    NULL
NULL    NULL
Time taken: 0.052 seconds, Fetched: 3 row(s)

将文件“/home/hadoop/data/join_a.txt”的每一行（id，name）之间使用一个Tab键分割。

hive> select * from a;
OK
1       ruoze
2       j
3       k 

hive> select * from b;
OK
1       30
2       29
4       21

inner join = join

a表joinb表

hive> select a.id,a.name,b.age from a join b on a.id=b.id;
Query ID = hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52
Total jobs = 1
Execution log at: /tmp/hadoop/hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52.log
2018-06-08 04:42:56     Starting to launch local task to process map join;      maximum memory = 477626368
2018-06-08 04:42:57     Dump the side-table for tag: 1 with group count: 3 into file: file:/tmp/hadoop/079e5638-99ad-41be-8fa3-ccdb1f0c780a/hive_2018-06-08_16-42-52_389_8099852311780202647-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile01--.hashtable
2018-06-08 04:42:57     Uploaded 1 File to: file:/tmp/hadoop/079e5638-99ad-41be-8fa3-ccdb1f0c780a/hive_2018-06-08_16-42-52_389_8099852311780202647-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile01--.hashtable (320 bytes)
2018-06-08 04:42:57     End of local task; Time Taken: 1.225 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1528425516016_0007, Tracking URL = http://hadoop000:8088/proxy/application_1528425516016_0007/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1528425516016_0007
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2018-06-08 16:43:03,901 Stage-3 map = 0%,  reduce = 0%
2018-06-08 16:43:10,137 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 2.6 sec
MapReduce Total cumulative CPU time: 2 seconds 600 msec
Ended Job = job_1528425516016_0007
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 1   Cumulative CPU: 2.6 sec   HDFS Read: 5564 HDFS Write: 18 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 600 msec
OK
1       ruoze   30
2       j       29
Time taken: 18.935 seconds, Fetched: 2 row(s)

左外连接（left join）
以左边的为基准，左边的数据全部数据全部出现，如果没有出现就赋null值

hive> select a.id,a.name,b.age from a right join b on a.id=b.id;
Query ID = hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52
Total jobs = 1
Execution log at: /tmp/hadoop/hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52.log
2018-06-08 04:56:03     Starting to launch local task to process map join;      maximum memory = 477626368
2018-06-08 04:56:04     Dump the side-table for tag: 0 with group count: 3 into file: file:/tmp/hadoop/079e5638-99ad-41be-8fa3-ccdb1f0c780a/hive_2018-06-08_16-56-00_679_234800897436568174-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile20--.hashtable
2018-06-08 04:56:04     Uploaded 1 File to: file:/tmp/hadoop/079e5638-99ad-41be-8fa3-ccdb1f0c780a/hive_2018-06-08_16-56-00_679_234800897436568174-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile20--.hashtable (327 bytes)
2018-06-08 04:56:04     End of local task; Time Taken: 0.905 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1528425516016_0009, Tracking URL = http://hadoop000:8088/proxy/application_1528425516016_0009/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1528425516016_0009
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2018-06-08 16:56:11,359 Stage-3 map = 0%,  reduce = 0%
2018-06-08 16:56:16,538 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.36 sec
MapReduce Total cumulative CPU time: 1 seconds 360 msec
Ended Job = job_1528425516016_0009
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 1   Cumulative CPU: 1.36 sec   HDFS Read: 5343 HDFS Write: 27 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 360 msec
OK
1       ruoze   30
2       j       29
NULL    NULL    21
Time taken: 16.918 seconds, Fetched: 3 row(s)

全连接（full）
全部补上，如果没有的数据就用null代替

hive> select a.id,a.name,b.age from a full join b on a.id=b.id;
Query ID = hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1528425516016_0010, Tracking URL = http://hadoop000:8088/proxy/application_1528425516016_0010/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1528425516016_0010
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2018-06-08 17:00:02,522 Stage-1 map = 0%,  reduce = 0%
2018-06-08 17:00:09,974 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 1.65 sec
2018-06-08 17:00:11,010 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.11 sec
2018-06-08 17:00:16,220 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.55 sec
MapReduce Total cumulative CPU time: 4 seconds 550 msec
Ended Job = job_1528425516016_0010
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 4.55 sec   HDFS Read: 10932 HDFS Write: 34 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 550 msec
OK
1       ruoze   30
2       j       29
3       k       NULL
NULL    NULL    21
Time taken: 20.559 seconds, Fetched: 4 row(s)

LEFT SEMI JOIN 是 IN/EXISTS 子查询的一种更高效的实现。
Hive 当前没有实现 IN/EXISTS 子查询，所以你可以用 LEFT SEMI JOIN 重写你的子查询语句。

LEFT SEMI JOIN 的限制是， JOIN 子句中右边的表只能在 ON 子句中设置过滤条件，在 WHERE 子句、SELECT 子句或其他地方过滤都不行。

hive> select * from a Left Semi Join b on a.id=b.id;
Query ID = hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52
Total jobs = 1
Execution log at: /tmp/hadoop/hadoop_20180608112727_3ba0221f-580c-4031-ae00-92876417af52.log
2018-06-08 05:12:45     Starting to launch local task to process map join;      maximum memory = 477626368
2018-06-08 05:12:46     Dump the side-table for tag: 1 with group count: 3 into file: file:/tmp/hadoop/079e5638-99ad-41be-8fa3-ccdb1f0c780a/hive_2018-06-08_17-12-42_081_4100879749531493003-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile31--.hashtable
2018-06-08 05:12:46     Uploaded 1 File to: file:/tmp/hadoop/079e5638-99ad-41be-8fa3-ccdb1f0c780a/hive_2018-06-08_17-12-42_081_4100879749531493003-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile31--.hashtable (314 bytes)
2018-06-08 05:12:46     End of local task; Time Taken: 1.322 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1528425516016_0011, Tracking URL = http://hadoop000:8088/proxy/application_1528425516016_0011/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1528425516016_0011
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2018-06-08 17:12:54,459 Stage-3 map = 0%,  reduce = 0%
2018-06-08 17:12:59,721 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 2.21 sec
MapReduce Total cumulative CPU time: 2 seconds 210 msec
Ended Job = job_1528425516016_0011
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 1   Cumulative CPU: 2.21 sec   HDFS Read: 5197 HDFS Write: 12 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 210 msec
OK
1       ruoze
2       j
Time taken: 18.701 seconds, Fetched: 2 row(s)

1. 相关说明

在Hive中处理数据时，当处理的一张表的数据量过大的时候，每次查询都是遍历整张表，显然对于计算机来说，是负担比较重的。所以我们可不可以针对数据进行分类，查询时只遍历该分类中的数据，这样就能有效的解决问题。所以就会Hive在表的架构下，就会有分区的这个概念，就是为了满足此需求。

分区表的一个分区对应hdfs上的一个目录

分区表包括静态分区表和动态分区表，根据分区会不会自动创建来区分

多级分区表，即创建的时候指定 PARTITIONED BY (event_month string,loc string)，根据顺序，级联创建 event_month=XXX/loc=XXX目录，其他和一级的分区表是一样的

2. 静态分区表

创建静态分区表，加载数据：

-- 创建一张分区表：

create database test1;
use test1;
hive> create table order_partition(
    > ordernumber string,
    > eventtime string
    > )
    > partitioned by (event_month string)
    > row format delimited fields terminated by '\t';

-- 建表语句，指定分区字段为event_month，这个字段是伪列
-- 会在数据load到表的这个分区时，在hdfs上创建名为event_month=2014-05的子目录

hive> desc formatted order_partition;
OK
# col_name              data_type               comment             
                 
ordernumber             string                                      
eventtime               string                                      
                 
# Partition Information (分区的信息)  
# col_name              data_type               comment             
                 
event_month  （分区的字段为事件的月份）           string（数据类型）    

# Detailed Table Information             
Database:               test1                    
Owner:                  hadoop                   
CreateTime:             Sat Jun 09 12:48:40 CST 2018     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://hadoop000:9000/user/hive/warehouse/test1.db/order_partition       
Table Type:             MANAGED_TABLE            
Table Parameters:                
        transient_lastDdlTime   1528519720          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             \t                  
        serialization.format    \t                  
Time taken: 0.283 seconds, Fetched: 33 row(s)

-- 先查看表内是否有数据：

hive> select * from order_partition;
OK
Time taken: 0.507 seconds

-- 先改partition的字符集

mysql> show databasesl
mysql> use ruozedata_basic03
mysql> alter table PARTITIONS convert to character set latin1;
mysql> alter table PARTITION_KEYS convert to character set latin1;

-- 创建一份数据

[hadoop@hadoop000 data]$ cat order.txt 
10703007267488  2014-05-01 06:01:12.334+01
10101043505096  2014-05-01 07:28:12.342+01
10103043509747  2014-05-01 07:50:12.33+01
10103043501575  2014-05-01 09:27:12.33+01
10104043514061  2014-05-01 09:03:12.324+01

-- 把数据导入到表内

LOAD DATA LOCAL INPATH "/home/hadoop/data/order.txt" 
OVERWRITE INTO TABLE order_partition
PARTITION (event_month='2014-05');

--- 查看order_partition是否有数据

hive> use test1;
hive> show tables;
OK
order_partition
hive> select * from order_partition where event_month='2014-05'; （如果是分区查询一定要带上分区的条件）
OK
10703007267488  2014-05-01      2014-05
10101043505096  2014-05-01      2014-05
10103043509747  2014-05-01      2014-05
10103043501575  2014-05-01      2014-05
10104043514061  2014-05-01      2014-05
Time taken: 0.168 seconds, Fetched: 5 row(s)

-- 然后可以查看到有一个分区

[hadoop@hadoop000 data]$ hadoop fs -ls /user/hive/warehouse/test1.db/order_partition/
Found 1 items
 2018-06-09 13:08 /user/hive/warehouse/test1.db/order_partition/event_month=2014-05

-- 我们再创建一个2014-06分区

[hadoop@hadoop000 data]$ hadoop fs -mkdir /user/hive/warehouse/test1.db/order_partition/event_month=2014-06

-- 再把数据放到event_month=2014-06里面去

[hadoop@hadoop000 data]$ hadoop fs -put order.txt /user/hive/warehouse/test1.db/order_partition/event_month=2014-0

-- 查看到已经把本地数据放到了hdfs上了

[hadoop@hadoop000 data]$ hadoop fs -ls /user/hive/warehouse/test1.db/order_partition/event_month=2014-06/
Found 1 items
208 2018-06-09 13:22 /user/hive/warehouse/test1.db/order_partition/event_month=2014-06/order.txt

-- 查询结果并没有数据存在表内

Time taken: 0.457 seconds, Fetched: 5 row(s)
hive> select * from order_partition where event_month='2014-06';
OK
Time taken: 0.124 seconds

-- 我们去MySQL查询看看是否有元数据进入了

mysql> use ruozedata_basic03
Database changed
mysql> select * from PARTITIONS;
+---------+-------------+------------------+---------------------+-------+--------+
| PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME        | SD_I | TBL_ID |
+---------+-------------+------------------+---------------------+-------+--------+
|    1  | 1528520920 |       0      | event_month=2014-05 |  32  |  31   |
+---------+-------------+------------------+---------------------+-------+--------+
1 row in set (0.00 sec)


mysql> select * from PARTITION_KEY_VALS;
+---------+--------------+-------------+
| PART_ID | PART_KEY_VAL | INTEGER_IDX |
+---------+--------------+-------------+
|   1   |   2014-05   |     0    |
+---------+--------------+-------------+
1 row in set (0.00 sec)

-- 我们需要修复MySQL表的元数据信息，Hive上才能查到到相应的数据结果

可以采用命令： MSCK REPAIR TABLE table_name; 
hive> MSCK REPAIR TABLE order_partition;
ok
partitions not in metastore: order_partition:event_month=2014-06
Repair: Added partition to metastore order_partition:event_month=2014-06 (添加了这个分区表到元数据信息里面去)
Time taken:0.245 seconds,Fetched:2 row(s)

这个命令会把 2014-06的信息从MySQL表里面刷新，从而可以再Hive上查询表可以查询的到信息同步过来了。
但有一个致命的缺点，它会把刷新所有MySQL表里面的信息，如果有一张表已经存放好几年了，用这个命令去执行的话
半天都反应不了，所以这个命令太暴力了，生产中禁止使用。

Add partition （添加一个Partition）

-- 创建一个新的分区表，把本地的数据放到hdfs分区表event_month=2014-07的上面

data]$ hadoop fs -mkdir /user/hive/warehouse/test1.db/order_partition/event_month=2014-07
data]$ hadoop fs -put  order.txt /user/hive/warehouse/test1.db/order_partition/event_month=2014-07

-- 用Hive查看表内是否有数据

hive> select * from order_partition where event_month='2014-07'; 
OK
Time taken: 0.091 seconds

-- 添加一个分区刷新MySQL表内的信息（生产中使用这种刷新的方式，只单单刷新一张表的信息）

hive> ALTER TABLE order_partition ADD IF NOT EXISTS
    > PARTITION (event_month='2014-07') ;
OK
Time taken: 0.177 seconds
hive> select * from order_partition where event_month='2014-07'; 
OK
10703007267488  2014-05-01      2014-07
10101043505096  2014-05-01      2014-07
10103043509747  2014-05-01      2014-07
10103043501575  2014-05-01      2014-07
10104043514061  2014-05-01      2014-07
Time taken: 0.115 seconds, Fetched: 5 row(s)

以上的方式是通过文本的形式上传本地数据的信息
总结：
select * from test1.order_partition;
-- 分区表全表扫描，不推荐
select * from test1.order_partition where event_month='2014-05';
-- 使用where子句，过滤分区字段，遍历某个分区
-- 以上两个SQL可以查到列event_month信息
-- 而使用hdfs dfs -cat看不到该列，说明分区表的分区列是伪列
-- 实际上是hdfs中的分区目录的体现

添加分区，insert 一张订单表

-- 创建一张新的分区表

create table order_4_partition( ordernumber string, eventtime string ) row format delimited fields terminated by '\t';

-- 把本地数据上传到hdfs上

hive> load data local inpath '/home/hadoop/data/order.txt' overwrite into table order_4_partition; 
Loading data to table test1.order_4_partition 
Table test1.order_4_partition stats: [numFiles=1, numRows=0, totalSize=208, rawDataSize=0] OK Time taken: 0.317 seconds

-- 插入一条数据查看数据是否存入表内

 hive> insert overwrite table order_partition      
 > partition(event_month='2014-08')     
 > select * from order_4_partition; 
 hive> select * from order_4_partition; 
 OK 10703007267488  2014-05-01 06:01:12.334+01 
 10101043505096  2014-05-01 07:28:12.342+01 
 10103043509747  2014-05-01 07:50:12.33+01 
 10103043501575  2014-05-01 09:27:12.33+01 
 10104043514061  2014-05-01 09:03:12.324+01

-- 数据更新到表内

hive> select * from order_partition where event_month='2014-08';
 OK 
 10703007267488  2014-05-01      2014-08 
 10101043505096  2014-05-01      2014-08 
 10103043509747  2014-05-01      2014-08 
 10103043501575  2014-05-01      2014-08 
 10104043514061  2014-05-01      2014-08

-- 查看MySQL信息

 mysql> select * from PARTITIONS; 
+---------+-------------+------------------+-----------------------------------+-------+--------+
| PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME                         | SD_ID | TBL_ID |
+---------+-------------+------------------+-----------------------------------+-------+--------+
|       1 |  1528520920 |                0 | event_month=2014-05               |    32 |     31 |
|       2 |  1528523750 |                0 | event_month=2014-07               |    33 |     31 |
|       3 |  1528525251 |                0 | event_month=2014-06               |    35 |     31 |
|       6 |  1528526037 |                0 | event_month=2014-08               |    42 |     31 |
|      11 |  1528531683 |                0 | event_month=2014-05/event_day=01  |    48 |     42 |
|      16 |  1528558645 |                0 | deptno=20                         |    56 |     45 |
|      17 |  1528558874 |                0 | deptno=10                         |    57 |     45 |
|      18 |  1528628193 |                0 | deptno=30                         |    59 |     46 |
|      19 |  1528628193 |                0 | deptno=__HIVE_DEFAULT_PARTITION__ |    60 |     46 |
|      20 |  1528628193 |                0 | deptno=20                         |    61 |     46 |
|      21 |  1528628193 |                0 | deptno=10                         |    62 |     46 |
+---------+-------------+------------------+-----------------------------------+-------+--------+

-- 查看Hive里面有多少个分区

hive> show partitions order_partition;

OK

event_month=2014-05event_month=2014-06event_month=2014-07

event_month=2014-08Time taken: 0.121 seconds, Fetched: 4 row(s)

以上讲解的是静态分区中的单级分区，分区里面的字段是不能出现在普通字段里面的

多级分区：（工作使用）

-- 创建一张多级分区表（多级分区表和单级分区表的区别，partitioned by 多了一些字段）

create table order_mulit_partition(
ordernumber string,
eventtime string
)
partitioned by (event_month string,event_day string)
row format delimited fields terminated by '\t';

-- 查看表信息

hive> desc formatted order_mulit_partition;
OK
# col_name              data_type               comment             
                 
ordernumber             string                                      
eventtime               string                                      
                 
# Partition Information          
# col_name              data_type               comment             
                 
event_month             string   （分区字段一）                                   
event_day               string  （分区字段二）                                    
                 

# Detailed Table Information             
Database:               test1                    
Owner:                  hadoop                   
CreateTime:             Sat Jun 09 15:46:27 CST 2018     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://hadoop000:9000/user/hive/warehouse/test1.db/order_mulit_partition         
Table Type:             MANAGED_TABLE            
Table Parameters:                
        transient_lastDdlTime   1528530387          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             \t                  
        serialization.format    \t                  
Time taken: 0.111 seconds, Fetched: 34 row(s)
hive>

-- 把数据导入HDFS上

hive> LOAD DATA LOCAL INPATH '/home/hadoop/data/order.txt' 
    > OVERWRITE INTO TABLE order_mulit_partition 
    > PARTITION(event_month='2014-05', event_day='01'); 
Loading data to table test1.order_mulit_partition partition (event_month=2014-05, event_day=01)
Partition test1.order_mulit_partition{event_month=2014-05, event_day=01} stats: [numFiles=1, numRows=0, 
totalSize=208, rawDataSize=0]
OK
Time taken: 0.992 seconds

-- 查看HDFS多级分区的目录结构

[hadoop@hadoop000 data]$ hadoop fs -ls /user/hive/warehouse/test1.db/order_mulit_partition/event_month=2014-05
Found 1 items08 /user/hive/warehouse/test1.db/order_mulit_partition/event_month=2014-05/event_day=01

 /user/hive/warehouse/test1.db/order_mulit_partition/event_month=2014-05/event_day=01/order.txt（订单）

-- 查分区表的记录

hive> select * from order_mulit_partition where event_month='2014-05' and event_day='01';
OK
10703007267488  2014-05-01 06:01:12.334+01      2014-05 01
10101043505096  2014-05-01 07:28:12.342+01      2014-05 01
10103043509747  2014-05-01 07:50:12.33+01       2014-05 01
10103043501575  2014-05-01 09:27:12.33+01       2014-05 01
10104043514061  2014-05-01 09:03:12.324+01      2014-05 01
Time taken: 0.294 seconds, Fetched: 5 row(s)

动态分区：

-- 以下的操作是静态分区，为了证明静态分区查询是非常麻烦和繁琐的，所以做这样一个实验
-- 先创建一张静态分区表

PARTITIONED by(deptno string) 红字的信息不能与字段里的信息重复，否则会报错
hive> create table ruozedata_static_emp 
    > (empno int, ename string, job string, mgr int, hiredate string, salary double, comm double)
    > PARTITIONED by(deptno string) 
    > ROW FORMAT DELIMITED 
    > FIELDS TERMINATED BY '\t' ;

-- 现在有两张表 ruozedata_emp 和 ruozedata_static_emp
-- 把ruozedata_emp里部门编号等于20的写到ruozedata_static_emp;

hive> select * from ruozedata_emp;
OK
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
7566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    20
7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-5-1        2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981-6-9        2450.0  NULL    10
7788    SCOTT   ANALYST 7566    1987-4-19       3000.0  NULL    20
7839    KING    PRESIDENT       NULL    1981-11-17      5000.0  NULL    10
7844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30
7876    ADAMS   CLERK   7788    1987-5-23       1100.0  NULL    20
7900    JAMES   CLERK   7698    1981-12-3       950.0   NULL    30
7902    FORD    ANALYST 7566    1981-12-3       3000.0  NULL    20
7934    MILLER  CLERK   7782    1982-1-23       1300.0  NULL    10
8888    HIVE    PROGRAM 7839    1988-1-23       10300.0 NULL    NULL

-- 先查看是否有数据信息

hive> select * from ruozedata_static_emp;
OK
Time taken: 0.52 seconds

-- 插入查询数据

hive> insert into table ruozedata_static_emp partition(deptno='20')
    > select empno,ename,job,mgr,hiredate,salary,comm from ruozedata_emp 
    > where deptno=20;
Query ID = hadoop_20180609165353_6aa814d6-fd99-4374-90f0-ff5a78ce59d1
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1528518767828_0002, Tracking URL = http://hadoop000:8088/proxy/application_1528518767828_0002/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1528518767828_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-06-09 23:37:16,744 Stage-1 map = 0%,  reduce = 0%
2018-06-09 23:37:23,236 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.77 sec
MapReduce Total cumulative CPU time: 2 seconds 770 msec
Ended Job = job_1528518767828_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop000:9000/user/hive/warehouse/hive3.db/ruozedata_static_emp/deptno=20/.
hive-staging_hive_2018-06-09_23-37-07_657_1948946046232412047-1/-ext-10000
Loading data to table hive3.ruozedata_static_emp partition (deptno=20)
Partition hive3.ruozedata_static_emp{deptno=20} stats: [numFiles=1, numRows=5, totalSize=214, rawDataSize=209]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 2.77 sec   HDFS Read: 4933 HDFS Write: 307 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 770 msec
OK
Time taken: 18.322 seconds

-- 查看更新后部门信息为20的表信息

hive> select * from ruozedata_static_emp where deptno='20';
OK
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
7566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    20
7788    SCOTT   ANALYST 7566    1987-4-19       3000.0  NULL    20
7876    ADAMS   CLERK   7788    1987-5-23       1100.0  NULL    20
7902    FORD    ANALYST 7566    1981-12-3       3000.0  NULL    20

-- 再插入查询数据信息为10的表信息

hive> insert into table ruozedata_static_emp partition(deptno='10')
    > select empno,ename,job,mgr,hiredate,salary,comm from ruozedata_emp 
    > where deptno=10;
Query ID = hadoop_20180609165353_6aa814d6-fd99-4374-90f0-ff5a78ce59d1
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1528518767828_0003, Tracking URL = http://hadoop000:8088/proxy/application_1528518767828_0003/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1528518767828_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-06-09 23:41:06,624 Stage-1 map = 0%,  reduce = 0%
2018-06-09 23:41:12,920 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.04 sec
MapReduce Total cumulative CPU time: 2 seconds 40 msec
Ended Job = job_1528518767828_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop000:9000/user/hive/warehouse/hive3.db/ruozedata_static_emp/deptno=10/.
hive-staging_hive_2018-06-09_23-41-00_574_8181222805563183320-1/-ext-10000
Loading data to table hive3.ruozedata_static_emp partition (deptno=10)
Partition hive3.ruozedata_static_emp{deptno=10} stats: [numFiles=1, numRows=3, totalSize=130, rawDataSize=127]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 2.04 sec   HDFS Read: 4933 HDFS Write: 223 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 40 msec
OK
Time taken: 13.9 seconds

-- 静态分区表有一个非常致命的缺点，如果需要大批量的查询所需要的数据，就只能这样一条条查询
-- 查询看看有多少分区表

hive> show partitions ruozedata_static_emp;
OK
deptno=10
deptno=20
Time taken: 0.065 seconds, Fetched: 2 row(s)

动态分区操作部分：

-- 创建一张动态分区表

    > (empno int, ename string, job string, mgr int, hiredate string, salary double, comm double)
    > PARTITIONED by(deptno string)
    > ROW FORMAT DELIMITED 
    > FIELDS TERMINATED BY '\t' ;
OK
Time taken: 0.145 seconds

-- 查看动态分区表是否有数据

hive> select * from ruozedata_dynamic_emp;
OK
Time taken: 0.083 seconds

-- 根据部门编号分区，原表中的部门编号字段就没有必要创建了
-- 而是由分区表创建的伪列来替代
set hive.exec.dynamic.partition.mode=nonstrict;-- 设置动态分区模式为非严格模式

insert into table ruozedata_dynamic_emp partition(deptno)
select empno,ename,job,mgr,hiredate,salary,comm,deptno from ruozedata_emp ;

-- 动态分区表的数据插入语句
-- partition(deptno) 而不是 partition(deptno=XXX)
-- select 子句从原表查出来的列数和列序要和分区表列数和列序保持一致
-- select 子句最后一列要为分区表的分区列
-- 不在需要where子句

-- 设置动态分区模式为非严格模式，这是Hive中常用的设置 key=value的方式
set hive.exec.dynamic.partition.mode=nonstrict;
语法格式：
set key=value 设置
set key；取值

手工创建hdfs目录和文件，添加分区的情况：
静态分区表如果手工创建对应的hdfs目录上传文件，而不使用分区创建命令和load数据到分区的命令，分区表中无法查到该分区信息，需要刷新，这种添加分区的途径是不合法的：

动态分区表:

-- 根据从前实验创建的emp表
-- 将emp表的数据按照部门分组，并将数据加载到其对应的分组中去

create table emp_partition(
empno int, 
ename string, 
job string, 
mgr int, 
hiredate string, 
sal double, 
comm double)
PARTITIONED BY (deptno int)
row format delimited fields terminated by '\t';

-- 根据部门编号分区，原表中的部门编号字段就没有必要创建了
-- 而是由分区表创建的伪列来替代
set hive.exec.dynamic.partition.mode=nonstrict; -- 设置动态分区模式为非严格模式

insert into table emp_partition partition(deptno)
select empno,ename,job,mgr,hiredate,sal,comm ,deptno from emp;

-- MySQL查询结果

mysql> select * from PARTITION_KEY_VALS;       
+---------+----------------------------+-------------+
| PART_ID | PART_KEY_VAL               | INTEGER_IDX |
+---------+----------------------------+-------------+
|       1 | 2014-05                    |           0 |
|       2 | 2014-07                    |           0 |
|       3 | 2014-06                    |           0 |
|       6 | 2014-08                    |           0 |
|      11 | 2014-05                    |           0 |
|      11 | 01                         |           1 |
|      16 | 20                         |           0 |
|      17 | 10                         |           0 |
|      18 | 30                         |           0 |
|      19 | __HIVE_DEFAULT_PARTITION__ |           0 |
|      20 | 20                         |           0 |
|      21 | 10                         |           0 |
+---------+----------------------------+-------------+

mysql> select * from PARTITIONS;       
+---------+-------------+------------------+-----------------------------------+-------+--------+
| PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME                         | SD_ID | TBL_ID |
+---------+-------------+------------------+-----------------------------------+-------+--------+
|       1 |  1528520920 |                0 | event_month=2014-05               |    32 |     31 |
|       2 |  1528523750 |                0 | event_month=2014-07               |    33 |     31 |
|       3 |  1528525251 |                0 | event_month=2014-06               |    35 |     31 |
|       6 |  1528526037 |                0 | event_month=2014-08               |    42 |     31 |
|      11 |  1528531683 |                0 | event_month=2014-05/event_day=01  |    48 |     42 |
|      16 |  1528558645 |                0 | deptno=20                         |    56 |     45 |
|      17 |  1528558874 |                0 | deptno=10                         |    57 |     45 |
|      18 |  1528628193 |                0 | deptno=30                         |    59 |     46 |
|      19 |  1528628193 |                0 | deptno=__HIVE_DEFAULT_PARTITION__ |    60 |     46 |
|      20 |  1528628193 |                0 | deptno=20                         |    61 |     46 |
|      21 |  1528628193 |                0 | deptno=10                         |    62 |     46 |
+---------+-------------+------------------+-----------------------------------+-------+--------+
11 rows in set (0.00 sec)

hive> select * from ruozedata_dynamic_emp;
OK
7782    CLARK   MANAGER 7839    1981-6-9        2450.0  NULL    10
7839    KING    PRESIDENT       NULL    1981-11-17      5000.0  NULL    10
7934    MILLER  CLERK   7782    1982-1-23       1300.0  NULL    10
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
7566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    20
7788    SCOTT   ANALYST 7566    1987-4-19       3000.0  NULL    20
7876    ADAMS   CLERK   7788    1987-5-23       1100.0  NULL    20
7902    FORD    ANALYST 7566    1981-12-3       3000.0  NULL    20
7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-5-1        2850.0  NULL    30
7844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30
7900    JAMES   CLERK   7698    1981-12-3       950.0   NULL    30
8888    HIVE    PROGRAM 7839    1988-1-23       10300.0 NULL    __HIVE_DEFAULT_PARTITION__
Time taken: 0.817 seconds, Fetched: 15 row(s)

动态分区必须是非严格模式：（生产都是非严格模式）

hive> set hive.exec.dynamic.partition.mode;
hive.exec.dynamic.partition.mode=nonstrict

Hive的内置函数:

SHOW FUNCTIONS; （查看所有内置函数）
DESCRIBE FUNCTION <function_name>;
DESCRIBE FUNCTION EXTENDED <function_name>;

hive> desc function upper;
OK
upper(str) - Returns str with all characters changed to uppercase
Time taken: 0.201 seconds, Fetched: 1 row(s)

hive> desc function EXTENDED upper;
OK
upper(str) - Returns str with all characters changed to uppercase
Synonyms: ucase
Example:
  > SELECT upper('Facebook') FROM src LIMIT 1;
  'FACEBOOK'
Time taken: 0.066 seconds, Fetched: 5 row(s)

-- 查看函数具体语法

hive> DESCRIBE FUNCTION case;
OK
CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END - When a = b, returns c; when a = d, return e; else return f

hive> DESCRIBE FUNCTION EXTENDED case;

hive> DESCRIBE FUNCTION EXTENDED case;
OK
CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END - When a = b, returns c; when a = d, return e; else return f
Example:
 SELECT
 CASE deptno
   WHEN 1 THEN Engineering
   WHEN 2 THEN Finance
   ELSE admin
 END,
 CASE zone
   WHEN 7 THEN Americas
   ELSE Asia-Pac
 END
 FROM emp_de
tails

强悍de巨人

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
【若泽大数据实战第十八天】Hive - DML 分区表内部函数

前言：上一堂课回顾总结：大数据开发最主要做底层的优化架构的优化，这些东西才是核心，写代码用的不多，做平台化，开发一个平台直接在页面上拖拽就可以了，底层的数据平台都已经封装好了，只是会某一个功能，只能说是大数据开发的入门，初级程序员做的事。Oracle在大数据几乎不用，只用一些轻量级的数据库。讲了DDL DML，Hive数据模型的抽象，顶层是Database，下一层是table，再下一层是parti...
复制链接

扫一扫