Hive中关于DML的一些事儿

最新推荐文章于 2024-09-04 17:28:16 发布

一抹米粒

最新推荐文章于 2024-09-04 17:28:16 发布

阅读量111

点赞数

分类专栏： hive 文章标签： hive

本文链接：https://blog.csdn.net/nj_hao/article/details/107535354

版权

hive 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

CTE 和嵌套查询

CTE(Common Table Express)

-- CTE语法
with t1 as (select ...)select * from t1；

CTE嵌套
写语句不需要考虑顺序

嵌套查询

select * from(select ...) [as] t1;

join

hive仅支持等值查询

问：join有几种连接方式？

内连接，外连接，交叉连接，等值连接

内连接(inner join)

-- 左右两表都存在于连接条件相匹配的数据保留
select * from emp e join dept d on e.deptno=d.deptno;

外连接(outer join)

左外连接

-- join 左侧满足where条件的将被保留
select * from emp e left join dept d on e.deptno=d.deptno;

右外连接

-- join 右侧满足where条件的将被保留
select * from emp e right join dept d on e.deptno=d.deptno;

满外连接

-- 返回所有表中满足where的记录
select * from emp e full join dept d on e.deptno=d.deptno;

交叉连接(cross join)

德卡尔积式连接

隐式连接(implicit join)

hive中的map-join和reduce-join

hive.auto.convert.join

0.11版本默认为true

hive.mapjoin.smalltable.filesize/hive.smalltable.filesize
由此参数决定小表的总大小，大小为25M左右，大概两三万行

Map join

小表关联大表(小表在左，大表在右)
可进行不等值计算

开启mapjoin(运行时自动将连接转换为mapjoin)

set hive.auto.convert.join = true（默认值）

在UNION ALL, LATERAL VIEW, GROUP BY/JOIN/SORT BY/CLUSTER BY/DISTRIBUTE BY等操作后面
在UNION, JOIN 以及其他 MAPJOIN之前

Union

所有子集数据必须具有相同的名称和类型

所有子集数据必须具有相同的名称和类型
集合其他操作可以使用JOIN/OUTER JOIN来实现(差集、交集)

union：去重
union：不去重

insert

数据插入到表

-- INSERT支持OVERWRITE(覆盖)和INTO(追加)
INSERT OVERWRITE/INTO TABLE tablename1 
[PARTITION (partcol1=val1, partcol2=val2 ...)] 
select fileds,... from tb_other

table 关键字可选

INSERT OVERWRITE TABLE test select 'hello'; -- INSERT不支持的写法
insert into employee select * from ctas_employee; -- 通过查询语句插入
-- 多插入

from ctas_employee              --高性能：只需扫描一次输入数据
insert overwrite table employee select *
insert overwrite table employee_internal select *;

-- 插入到分区
from ctas_patitioned 
insert overwrite table employee PARTITION (year, month)
select *,'2018','09';

-- 通过指定列插入(insert into可以省略table关键字)  
insert into employee(name) select 'John' from test limit 1;
							-- 指定列有利于 data schema changes

-- 通过指定值插入
insert into employee(name) value('Judy'),('John');

数据插入/导出到文件

insert导出

只支持overwrite
支持来自同一个数据源/表的多次插入
LOCAL：写入本地文件系统
默认数据以TEXT格式写入，列由^A分隔
支持自定义分隔符导出文件为不同格式,CSV,JSON等

-- 从同一数据源插入本地文件，hdfs文件，表
from ctas_employee
insert overwrite local directory '/tmp/out1'  select *
insert overwrite directory '/tmp/out1' select *
insert overwrite table employee_internal select *;

-- 以指定格式插入数据
insert overwrite directory '/tmp/out3'
row format delimited fields terminated by ','
select * from ctas_employee;

-- 其他方式从表获取文件
hdfs dfs -getmerge <table_file_path>

hadoop命令导出到本地

hive (default)> dfs -get /user/hive/warehouse/student/month=201709/000000_0
/opt/module/datas/export/student3.txt;

hive shell 命令导出

基本语法：（hive -f/-e 执行语句或者脚本 > file）

[atguigu@hadoop102 hive]$ bin/hive -e 'select * from default.student;' >
 /opt/module/datas/export/student4.txt;

Export导出到HDFS上

除数据库，可导入导出所有数据和元数据

(defahiveult)> export table default.student to
 '/user/hive/warehouse/export/student';
 
 # import导入数据
IMPORT TABLE employee FROM '/tmp/output3';
IMPORT TABLE employee_partitioned partition (year=2014, month=11) FROM '/tmp/output5';

Sqoop

目前不知道

数据排序

order by

employee_id 	1~100
employee_id>95	96~100		1
employee_id<=95	1~95		0

select * from employee_hr
order by case when employee_id>95 then 1 else 0 end desc
limit 10;

sort by

distrubute by

distrubuted by 语句要写在sort by

cluster by

分组

Group by

除了聚合函数，其他选中的字段必须出现值group by 中有体现

if(表达式，条件满足的值，条件不满足的值)

case when 条件表达式 then 条件满足的值
	 when 条件表达式 then 条件满足的值
     when 条件表达式 then 条件满足的值
	 else 条件不满足的值

having

用于筛选group by

一抹米粒

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hive中关于DML的一些事儿

CTE 和嵌套查询CTE(Common Table Express)-- CTE语法with t1 as (select ...)select * from t1；CTE嵌套写语句不需要考虑顺序嵌套查询select * from(select ...) [as] t1;joinhive仅支持等值查询问：join有几种连接方式？内连接，外连接，交叉连接，等值连接内连接(inner join)-- 左右两表都存在于连接条件相匹配的数据保留select * from e
复制链接

扫一扫

专栏目录