Hive学习笔记（11）- Hive新功能 import,export; order by,sort by,distribute by, cluster by

最新推荐文章于 2024-08-08 00:52:05 发布

一角残叶

最新推荐文章于 2024-08-08 00:52:05 发布

阅读量231

点赞数

分类专栏： # Hive学习笔记FB 文章标签： hive 的 export 和 import order by sort by distribute by

本文链接：https://blog.csdn.net/u012292754/article/details/86530539

版权

Hive学习笔记FB 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

1 hive 的 `import,export`

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport

1.1 `export`

这里的导出路径指的是 HDFS 里的路径

EXPORT TABLE tablename [PARTITION (part_column="value"[, ...])]
  TO 'export_target_path' [ FOR replication('eventid') ]

export table emp to '/user/hive/warehouse/test_export/emp_exp';
在这里插入图片描述

2 Sort

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

order by: 全局排序，一个 reduce
sort by : 每个 reduce 内部进行排序，全局不是排序
distribute by : 类似 MR 中的 partition,进行分区，结合 sort by 使用；
cluster by : 当使用 distribute 和 sort 字段相同时，使用

2.1 `order by`（谨慎使用）

全局数据排序
SELECT * FROM emp order by empno desc;

hive (default)> SELECT * FROM emp order by empno desc;
Query ID = hadoop_20190117103838_eb7585e6-fbbc-4fc8-978a-fb5216cc70d0
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1547692669080_0013, Tracking URL = http://node1:8088/proxy/application_1547692669080_0013/
Kill Command = /home/hadoop/apps/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1547692669080_0013
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-01-17 22:04:54,709 Stage-1 map = 0%,  reduce = 0%
2019-01-17 22:04:58,936 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.34 sec
2019-01-17 22:05:05,106 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.78 sec
MapReduce Total cumulative CPU time: 2 seconds 780 msec
Ended Job = job_1547692669080_0013
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.78 sec   HDFS Read: 8185 HDFS Write: 674 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 780 msec
OK
emp.empno	emp.ename	emp.job	emp.mgr	emp.hiredate	emp.sal	emp.comm	emp.deptno
7934	Miller	clerk	7782	1982-1-23	1300.0	10.0	NULL
7902	Ford	analyst	7566	1981-12-3	3000.0	20.0	NULL
7900	Jamas	clerk	7698	1981-12-3	950.0	30.0	NULL
7876	Adams	clerk	7788	1987-5-23	1100.0	20.0	NULL
7844	Turner	salesman	7698	1981-9-8	1500.0	0.0	30
7839	King	president	NULL	5000.0	10.0	NULL	NULL
7788	Scott	analyst	7566	1987-4-19	3000.0	20.0	NULL
7782	Clark	manager	7839	1981-6-9	2450.0	10.0	NULL
7698	Blake	manager	7839	1981-5-1	2850.0	30.0	NULL
7654	Martin	salesman	7698	1981-9-28	1250.0	1400.0	30
7566	Jones	manager	7839	1981-4-2	2975.0	20.0	NULL
7521	Ward	salsesman	7698	1981-2-22	1250.0	500.0	30
7499	Allen	salesman	7698	1981-2-20	1600.0	300.0	30
7369	Smith	clerk	7902	1980-12-17	800.0	20.0	NULL
Time taken: 17.23 seconds, Fetched: 14 row(s)

2.2 `sort by`

对每一个 reduce 内部数据进行排序

set mapreduce.job.reduces=3;

INSERT overwrite local directory '/home/hadoop/tempdata/sortby_res'
SELECT * FROM emp sort by empno asc;

在这里插入图片描述

2.3 `distribute by`

分区 partition
类似于 MapReduce 中分区 partition,对数据进行分区，结合 sort by 使用
distribute by必须要在sort by 前面

INSERT overwrite local directory '/home/hadoop/tempdata/distby_res'
SELECT * FROM emp distribute by deptno sort by empno asc;

2.4 `cluster by`

当 distribute by 和 sort by 字段相同时，可以使用

一角残叶

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

Hive学习笔记（11）- Hive新功能 import,export; order by,sort by,distribute by, cluster by

1 hive 的 import,export

1.1 export

2 Sort

2.1 order by（谨慎使用）

2.2 sort by

2.3 distribute by

2.4 cluster by