hive中导入导出数据的方式,静态分区和动态分区

最新推荐文章于 2023-06-23 19:29:10 发布

晓晓很可爱

最新推荐文章于 2023-06-23 19:29:10 发布

阅读量1.8k

点赞数 3

分类专栏： hive 文章标签： hive

本文链接：https://blog.csdn.net/Fresh_man888/article/details/109059034

版权

hive 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

0.导入数据的多种方法:

1) 直接将数据文件上传到对应的表的目录下 ;hdfs dfs -put 本地文件 hdfs中表对应的目录;

2)使用命令导入本地文件: 如果是本地数据原理就是将本地数据上传到指定的表目录下

load data local inpath "本地文件" into table 表名;

3) 使用命令导入hdfs 中的文件: 如果是HDFS数据原理是将HDFS的数据移动到指定的表目录下

load data inpath "hdfs文件的路径" into table 表名;

4)建表的时候,指定文件所在的路径

create table tb_teacher(
name string,
friends array<string>,
children map<string, int>,
address struct<street:string, city:string>
)
row format delimited fields terminated by ','
location "HDFS的路径" ;

5)覆盖导入数据

覆盖导入本地文件:

load data local inpath "本地文件的路径" overwrite into table 表名;

覆盖导入hdfs 文件

load data inpath "hdfs 中的路径" overwrite into table 表名;

6) import 一定是导出的数据才能导入

导出数据到HDFS中
export table tb_product to
 '/user/hive/warehouse/export/product';
 
导入数据方式
import table tb_product2  from
 '/user/hive/warehouse/export/product';

7)insert 的方式导入数据

insert into tb_a values() XX这种一条一条插入数据是不使用的
insert into tb_a select id , name , age from tb_b ; -- 直接插入,前提是先创建好表
insert overwrite table tb_demo1 select * from tb_demo2 ; --覆盖插入

8)create as 将结果数据直接保存在一个新的表中 ,这是时候表会自动创建,并且会字段结构和查询出来的格式一致;

create table if not exists tb_phone as select * from tb_product where cate = '手机' ;

1. 导出数据的多种方法:

1.1 insert 方式导出数据:

insert overwrite local  directory "/doit18/demo1"
select * from tb_demo1 ;  -- 导出数据到本地的文件夹下

insert overwrite  directory "/doit18/demo1"
select * from tb_demo1 ; -- 导出数据到HDFS的文件夹下

insert overwrite local  directory "/doit18/demo1"
row format delimited fields terminated by "," 
select * from tb_demo1 ;  -- 导出数据到指定的文件夹下,并且指定数据的字段分隔符

1.2 shell 端执行命令导出数据

shell方式
hive -e "sql语句" >> (追加到)本地文件中
hive -f 文件 >(覆盖原来文件) >>(追加到) 文本文件中 -f表示可以执行我们将sql语句写在本地文件中,然后用-f命令去执行这个文件中的sql;
设置定任务调度脚本(SQL) hive -e/-f

1.3 使用命令导出 hdfs dfs -get 表中的文件
hive> dfs -get -ls

1.4 export 方式:

export table tb_demo1 to
'/user/hive/warehouse/export/demo1';

1.5 sqoop 数据迁移工具

2. 分区表和静态分区\动态分区

2.1 什么是分区表:

数据分区的概念以及存在很久了，通常使用分区来水平分散压力，将数据从物理上移到和使用最频繁的用户更近的地方，以及实现其目的。hive中表处理的数据在对应的一个文件夹中;

hive中处理的数据在HDFS中 , select * from tb_name where dt=2020-06-18 ;
查询表中的数据是加载HDFS中对应表文件夹下的数据 ,文件夹下的数据很多,将数据全部加载以后再筛选过滤出数据, 显然效率低 ,Hive中的分区表起始就是根据某中维度将数据分文件夹管理 ,当安装这种维度查询的时候,直接从对应的文件夹下加载数,效率更高!

hive中有分区表的概念，我们可以看到分区具重要性能优势，而且分区表还可以将数据以一种符合逻辑的方式进行组织，比如分层存储

分区表分别有静态分区和动态分区 !

2.2 静态分区表:一级静态分区

1) 在创建表的同时,指定分区字段:

create table tb_name(
uid int,
logintime string,
enum int
)
partitioned by (dt string)  ---- 指定分区的字段
row format delimited fields terminated by",";

2) 上传结构化数据到本地文件中

例如数据为

06-18.txt
01,2020-06-18,200
02,2020-06-18,200
03,2020-06-18,100
03,2020-06-18,200
04,2020-06-18,200
05,2020-06-18,20
06,2020-06-18,100
07,2020-06-18,200
08,2020-06-18,200
09,2020-06-18,100
10,2020-06-18,200
06-19.txt
11,2020-06-19,20
14,2020-06-19,200
15,2020-06-19,20
16,2020-06-19,100
17,2020-06-19,200
18,2020-06-19,200
19,2020-06-19,100
12,2020-06-19,200
13,2020-06-19,100
13,2020-06-19,200

3) 上传之后的数据在表的目录下,当我们执行select * from tb_name, 就会加载这个表目录下的所有文件,然后根据条件全表检索然后在过滤,效率低假如我们查询数据 select id , name from tb_user ; 加载整个文件夹中的所有的数据
假如查询 select id , name from tb_user where login_date="2020-10-13"; select id , name from tb_user where login_date="2020-10-12"; 会加载整个文件夹中所有的数据然后过滤出结果

4) 将数据按照分区字段上传到hdfs中

create table  tb_p_order(
oid int ,
dt string ,
cost double 
)
partitioned  by (dy string)
row format delimited fields terminated by "," ;
 
 
load data local inpath "/hive/data/06-18.txt" into table  tb_p_order  partition(dy="06-18");
load data local inpath "/hive/data/06-19.txt" into table  tb_p_order  partition(dy="06-19");
	
	0: jdbc:hive2://linux01:10000> select * from tb_p_order where  dy="06-18";
+-----------------+----------------+------------------+----------------+
| tb_p_order.oid  | tb_p_order.dt  | tb_p_order.cost  | tb_p_order.dy  |
+-----------------+----------------+------------------+----------------+
| 1               | 2020-06-18     | 200.0            | 06-18          |
| 2               | 2020-06-18     | 200.0            | 06-18          |
| 3               | 2020-06-18     | 100.0            | 06-18          |
| 3               | 2020-06-18     | 200.0            | 06-18          |
| 4               | 2020-06-18     | 200.0            | 06-18          |
| 5               | 2020-06-18     | 20.0             | 06-18          |
| 6               | 2020-06-18     | 100.0            | 06-18          |
| 7               | 2020-06-18     | 200.0            | 06-18          |
| 8               | 2020-06-18     | 200.0            | 06-18          |
| 9               | 2020-06-18     | 100.0            | 06-18          |
| 10              | 2020-06-18     | 200.0            | 06-18          |
+-----------------+----------------+------------------+----------------+

5) 查看HDFS中数据存储

2.2 二级静态分区:

将数据按照层级关系再细分比如年为一级分区年下面有月分区

create table tb_partition2(
id int ,
name string ,
gender string ,
birthday string 
)
partitioned  by (y string , m string)
row format delimited fields terminated by "," ;
			   
a.txt
1001,ls,M,90-01-05
1002,zs,M,90-01-06
1003,ww,F,90-01-07
 
b.txt
1001,ls2,M,90-02-05
1002,zs2,M,90-02-06
1003,ww2,F,90-02-07
 
c.txt
1001,ls2,M,95-02-05
1002,zs2,M,95-02-06
1003,ww2,F,95-02-07
 
d.txt
1001,ls2,M,95-03-05
1002,zs2,M,95-03-06
1003,ww2,F,95-03-07
 
 
load data local  inpath "/hive/data/a.txt"  into  table tb_partition2 partition(y='90',m='01');
load data local  inpath "/hive/data/b.txt"  into  table tb_partition2 partition(y='90',m='02');
load data local  inpath "/hive/data/c.txt"  into  table tb_partition2 partition(y='95',m='02');
load data local  inpath "/hive/data/d.txt"  into  table tb_partition2 partition(y='95',m='03');

2.3 二级静态分区,数据在hdfs中的目录结构为:

数据在HDFS 中的目录结构是这样的

/tb_partition2/
/90/ ----这个值得是年字段的分区
               /01 ---这个是年下的月字段的分区
               /02 --- 这个是年下的月分区
               /95/ --- 这个指的是年字段的分区
/02 --- 这个是年下的月字段的分区
/03 --- 这个是年下的月字段的分区

2.4.动态分区表

上述是静态分区 , 静态分区是数据原本已经安装某维度保存在了不同的文件中了 , 如果想要根据查询的数据的某个属性进行分区 ,就是动态分区!

数据如下 ,想要根据性别, 或者是地理位置来进行分区,那么就是根据每个属性的值来进行分区的!!

user.txt  --->数据源
u001 ZSS 23 M beijing
u002 YMM 33 F nanjing
u003 LSS 43 M beijing
u004 ZY 23 F beijing
u005 ZM 23 M beijing
u006 CL 23 M dongjing
u007 LX 23 F beijing
u008 YZ 23 M beijing
u009 YM 23 F nanjing
u010 XM 23 M beijing
u011 XD 23 F beijing
u012 LH 23 M dongjing
u013 FTM 23 F dongjing

创建动态分区的步骤:

1) 创建普通表, 并且导入数据

create  table  if not exists  tb_user(
uid string ,
name  string ,
age int ,
gender string ,
address string 
)
row format delimited fields  terminated by  " " ;
load  data local  inpath " linux中的文件路径"  into table  tb_user ;

2) 创建分区表

create  table  if not exists  tb_p_user(
uid string ,
name  string ,
age int ,
gender string ,
address string 
)
partitioned  by (addr string)
row format delimited fields  terminated by  " " ;

3) 开启动态分区功能,注意这两条命令,没重启一次客户端就需要执行一次

set hive.exec.dynamic.partition=true ;
set hive.exec.dynamic.partition.mode=nonstrick;  可以从普通表中导入数据

4) 动态导入数据

普通表5个字段
分区表 5个主字段 1 个分区字段
插入数据的时候字段个数类型一致最后一个字段就是分区字段

insert into tb_p_user partition(addr)
select uid , name , age , gender , address , address from  tb_user ;

晓晓很可爱

关注

3
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
hive中导入导出数据的方式,静态分区和动态分区

0.导入数据的多种方法:1)直接将数据文件上传到对应的表的目录下 ;hdfs dfs -put 本地文件 hdfs中表对应的目录;2)使用命令导入本地文件: 如果是本地数据原理就是将本地数据上传到指定的表目录下load data local inpath "本地文件" into table 表名;3) 使用命令导入hdfs 中的文件: 如果是HDFS数据原理是将HDFS的数据移动到指定的表目录下load data inpath "hdfs文件的路径" into table ...
复制链接

扫一扫