Hive基础知识点总结-DML

最新推荐文章于 2022-10-26 20:10:29 发布

sofency

最新推荐文章于 2022-10-26 20:10:29 发布

阅读量142

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/qq_43079376/article/details/108083716

版权

大数据专栏收录该内容

13 篇文章 0 订阅

订阅专栏

数据导入

向表中加载数据的基本语法

load data [local] inpath '/opt/module/datas/student.txt' [overwrite] into table student [partition (partcol1=val1,....)]

具体的参数介绍

1. load data 表示加载数据
2. local 表示从本地加载数据到hive表，否则从hdfs加载数据到hive表
3. inpath 表示加载数据的路径
4. overwrite 表示覆盖表中已有的数据　否则表示追加
5. into table  表示加载哪张表
6. student 表示具体的表
7. partition 表示上传到指定的分区

通过查询语句向表中插入数据
将8月份的数据插入到9月份

insert into table student partition(month='2020-09')
	   select id,name from student where month='2020-08'

以覆盖模式写入

insert overwrite  table student partition (month = '201708')   
 	   select id,name from student where month = 201709;

多插入模式(根据多张表查询结果) 将9月份的数据分别插入到6月和7月

from  student 
insert overwrite table student partition(month='201707')  
select id,name from student where month = '201709' 
insert overwrite table student partition(month='201706')  
select id,name from student where month = '201709' ;

根据查询结果创建表
create table if not exists student1 as select id,name from student;

创建表时并指定在hdfs上的位置

create external table if not exists student5(
id int,name string
)
row format delimited fiels terminated by '\t'
location '/student';

import 数据到指定的hive表

import table student2 partition(month ='202009') from '/usr/hive/warehouse/export/student'  (hdfs)

数据导出

hive表数据导出

查询结果导出到本地/usr/local/student目录下
insert overwrite local directory  '/usr/local/student'  select * from student;  

//将查询的数据格式化导出到本地（或者hdfs）
insert overwrite ［local］ directory '/usr/local/student' 
	row format delimited fields terminated by '\t'
	collection items terminated by "_"  
	map keys terminated by ":"
	select * from student;

hadoop　命令导出到本地
dfs -get /usr/local/data/hive/student/month=201708/00000_0 /usr/local/student.txt
hive shell 命令导出到本地
bin/hive -e 'select * from default.student;' > /usr/local/hive/student.txt
export 导出到hdfs上
export table default.student to '/usr/local/hive/student.txt'

清空数据

truncate table test;//清空表test的数据

排序

全局排序　order by
desc 降序 asc 升序
select * from student order by id asc;
每个mapreduce里面内部排序(Sort by)
对于大规模的数据集orderby的效率非常低。在很多情况下，并不需要全局排序，此时可以使用sort by 也可以使用asc和desc
设置reduce的个数
set mapreduce.job.reduces=3;仅对当前进程有效进程关闭之后恢复到默认的值.
分区排序distribute by
规则：
1. distribute by的分区规则是根据分区字段的hash码与reduce的个数进行模除后，余数相同的分到一个区．
2. hive要求distribute by语句写在sort by 语句之前
cluster by
当distribute by 与sort by字段相同时，可以使用cluster by方式,但是只能按照升序进行排列.

分桶及抽样查询

创建分桶表

create table school(id int, name string)
clustered by(id) 
into 4 buckets
row format delimited fields terminated by '\t';

记得设置属性开启分桶或者在配置文件中永久修改
set hive.enforce.bucketing=true;

如何分桶
Hive的分桶采用对分桶字段的值进行哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中

抽样查询

对于非常大的数据集，有时用户需要使用的是一个具有代表性的查询结果而不是全部结果。Hive可以通过对表进行抽样来满足这个需求．

hive (default)> select * from stu_buck tablesample(bucket 1 out of 4 on id);
注：tablesample是抽样语句，
语法：TABLESAMPLE(BUCKET x OUT OF y on field)。
y必须是table总bucket数的倍数或者因子。hive根据y的大小，决定抽样的比例。
例如，table总共分了4份，当y=2时，抽取(4/2=)2个bucket的数据，
当y=8时，抽取(4/8=)1/2个bucket的数据。
x表示从哪个bucket开始抽取，如果需要取多个分区，以后的分区号为当前分区号加上y。
例如，table总bucket数为4，tablesample(bucket 1out of 2)，表示总共抽取（4/2=）2个bucket的数据，抽取第1(x)个和第3(x+y)个bucket的数据;

sofency

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hive基础知识点总结-DML

hive的修改表操作alter table table_name rename to new_table_name添加列信息alter table dept_partition add columns(depedesc string);更新列信息alter table dept_partition change column deptdesc desc int;替换列alter table dept_partition replace columns(deptno string,dname s
复制链接

扫一扫

专栏目录