Hive中的数据导入和导出、分区表和分桶表的使用与区别详解

最新推荐文章于 2024-01-16 19:12:32 发布

皮哥四月红

最新推荐文章于 2024-01-16 19:12:32 发布

阅读量2.6k

点赞数 3

分类专栏： Hive 文章标签：大数据 hive

本文链接：https://blog.csdn.net/weixin_43230682/article/details/108050133

版权

Hive 专栏收录该内容

21 篇文章 12 订阅

订阅专栏

一、Hive数据导入

1 直接向表中插入数据（强烈不推荐使用）

hive (myhive)> create table score3 like score;
hive (myhive)> insert into table score3 partition(month ='201807') values ('001','002','100');

2 通过load加载数据（必须掌握）

语法：

 hive> load data [local] inpath 'dataPath' [overwrite] into table student [partition (partcol1=val1,…)];

通过load方式加载数据

hive (myhive)> load data local inpath '/xsluo/install/hivedatas/score.csv' overwrite into table score partition(month='201806');

3 通过查询加载数据（必须掌握）

通过查询方式加载数据
语法；官网地址

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;

例子

hive (myhive)> create table score5 like score;
hive (myhive)> insert overwrite table score5 partition(month = '201806') select s_id,c_id,s_score from score;

4 查询语句中创建表并加载数据（as select）

将查询的结果保存到一张表当中去

hive (myhive)> create table score6 as select * from score;

5 创建表时指定location

创建表，并指定在hdfs上的位置

hive (myhive)> create external table score7 (s_id string,c_id string,s_score int) row format delimited fields terminated by '\t' location '/myscore7';

上传数据到hdfs上，我们也可以直接在hive客户端下面通过dfs命令来进行操作hdfs的数据

hive (myhive)> dfs -mkdir -p /myscore7;
hive (myhive)> dfs -put /xsluo/install/hivedatas/score.csv /myscore7;

查询数据

hive (myhive)> select * from score7;

6 export导出与import 导入 hive表数据（内部表操作）

hive (myhive)> create table teacher2 like teacher;
-- 导出到hdfs路径
hive (myhive)> export table teacher to  '/xsluo/teacher';
hive (myhive)> import table teacher2 from '/xsluo/teacher';

二、Hive数据导出

1 insert 导出

表 -> 文件
官方文档
语法

INSERT OVERWRITE [LOCAL] DIRECTORY directory1
  [ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
  SELECT ... FROM ...

将查询的结果导出到本地

insert overwrite local directory '/xsluo/install/hivedatas/stu' select * from stu;

将查询的结果格式化导出到本地

insert overwrite local directory '/xsluo/install/hivedatas/stu2' row format delimited fields terminated by ',' select * from stu;

将查询的结果导出到HDFS上==(没有local)==

insert overwrite directory '/xsluo/hivedatas/stu' row format delimited fields terminated by  ','  select * from stu;

2 Hive Shell 命令导出

基本语法：
- hive -e "sql语句" > file
- hive -f sql文件 > file
- 在linux命令行中，运行如下命令；导出myhive.stu表的数据到本地磁盘文件/xsluo/install/hivedatas/student1.txt

hive -e 'select * from myhive.stu;' > /xsluo/install/hivedatas/student1.txt

3 export导出到HDFS上

export table  myhive.stu to '/xsluo/install/hivedatas/stuexport';

三、Hive的分区表

1.1 为什么要分区？

如果hive当中所有的数据都存入到一个文件夹下面，那么在使用MR计算程序的时候，读取一整个目录下面的所有文件来进行计算，就会变得特别慢，因为数据量太大了

实际工作当中一般都是计算前一天的数据，所以我们只需要将前一天的数据挑出来放到一个文件夹下面即可，专门去计算前一天的数据。

这样就可以使用hive当中的分区表，通过分文件夹的形式，将每一天的数据都分成为一个文件夹，然后我们计算数据的时候，通过指定前一天的文件夹即可只计算前一天的数据。

在大数据中，最常用的一种思想就是分治，我们可以把大的文件切割划分成一个个的小的文件，这样每次操作一个小的文件就会很容易了，同样的道理，在hive当中也是支持这种思想的，就是我们可以把大的数据，按照每天，或者每小时进行切分成一个个的小的文件，这样去操作小的文件就会容易得多了

在文件系统上建立文件夹，把表的数据放在不同文件夹下面，加快查询速度。

创建分区表语法：

hive (myhive)> create table score(s_id string, c_id string, s_score int) partitioned by (month string) row format delimited fields terminated by '\t';

创建一个表带多个分区

hive (myhive)> create table score2 (s_id string,c_id string, s_score int) partitioned by (year string, month string, day string) row format delimited fields terminated by '\t';

加载数据到分区表当中去

hive (myhive)>load data local inpath '/xsluo/install/hivedatas/score.csv' into table score partition (month='201806');

加载数据到多分区表当中去

hive (myhive)> load data local inpath '/xsluo/install/hivedatas/score.csv' into table score2 partition(year='2018', month='06', day='01');

查看分区

hive (myhive)> show  partitions  score;

添加一个分区

hive (myhive)> alter table score add partition(month='201805');

同时添加多个分区

hive (myhive)> alter table score add partition(month='201804') partition(month = '201803');

注意：添加分区之后就可以在hdfs文件系统当中看到表下面多了一个文件夹

删除分区

hive (myhive)> alter table score drop partition(month = '201806');

1.2 外部分区表综合练习

需求描述：

现在有一个文件score.csv文件，里面有三个字段，分别是s_id string, c_id string, s_score int
字段都是使用 \t进行分割
存放在集群的这个目录下/scoredatas/day=20200808，这个文件每天都会生成，存放到对应的日期文件夹下面去
文件别人也需要公用，不能移动
请创建hive对应的表，并将数据加载到表中，进行数据统计分析，且删除表之后，数据不能删除

需求实现:

数据准备:

node03执行以下命令，将数据上传到hdfs上面去

将我们的score.csv上传到node03服务器的/xsluo/install/hivedatas目录下，然后将score.csv文件上传到HDFS的/scoredatas/day=20200808目录下

cd /xsluo/install/hivedatas/

hdfs dfs -mkdir -p /scoredatas/day=20200808

hdfs dfs -put score.csv /scoredatas/day=20200808/

1）创建外部分区表，并指定文件数据存放目录

hive (myhive)> create external table scoredatas(s_id string, c_id string, s_score int)

             > partitioned by (day string)

             > row format delimited

             > fields terminated by '\t'

             > location '/scoredatas';

2）导入数据到hive表

进行表的修复，说白了就是建立我们表与我们数据文件之间的一个关系映射。

此处，scoredatas表已经是一个外部表，而且已经指向了 hdfs 的 /scoredatas 目录，而要将 /scoredatas 下的数据导入 scoredatas 表，其实就是将 /scoredatas 下的一个天目录映射到 scoredatas 表的一个天分区，以下三种方式都可以：

方式一：msck repair table scoredatas;

方式二：load data inpath '/scoredatas/day=20200808' into table scoredatas partition(day='20200808');

方式三：alter table scoredatas add partition(day='20200808') location '/scoredatas/day=20200808';

三种方式效率上都差不多，其中load导入数据的方式最为常用，这里用msck方式演示一下：

hive (myhive)> msck repair table scoredatas;

修复成功之后即可看到数据已经全部加载到表当中去了。

四、Hive的分桶表

如下图所示：

1.1 分桶表原理

分桶是相对分区进行更细粒度的划分

Hive表或分区表可进一步的分桶
分桶将整个数据内容按照某列取hash值，对桶的个数取模的方式决定该条记录存放在哪个桶当中；具有相同hash值的数据进入到同一个文件中
比如按照name属性分为3个桶，就是对name属性值的hash值对3取摸，按照取模结果对数据分桶。
- 取模结果为==0==的数据记录存放到一个文件
- 取模结果为==1==的数据记录存放到一个文件
- 取模结果为==2==的数据记录存放到一个文件

1.2 作用

取样sampling更高效。没有分桶的话需要扫描整个数据集。
提升某些查询操作效率，例如map side join

1.3 案例演示：创建分桶表

在创建分桶表之前要执行的命令

set hive.enforce.bucketing=true; 开启对分桶表的支持
set mapreduce.job.reduces=4; 设置与桶相同的reduce个数（默认只有一个reduce）

进入hive客户端然后执行以下命令：

use myhive;
set hive.enforce.bucketing=true; 
set mapreduce.job.reduces=4;  

-- 创建分桶表
create table myhive.user_buckets_demo(id int, name string)
clustered by(id) 
into 4 buckets 
row format delimited fields terminated by '\t';

-- 创建普通表
create table user_demo(id int, name string)
row format delimited fields terminated by '\t';

准备数据文件 buckets.txt

#在linux当中执行以下命令
cd /xsluo/install/hivedatas/
vim user_bucket.txt

1   anzhulababy1
2   anzhulababy2
3   anzhulababy3
4   anzhulababy4
5   anzhulababy5
6   anzhulababy6
7   anzhulababy7
8   anzhulababy8
9   anzhulababy9
10  anzhulababy10

加载数据到普通表 user_demo 中

load data local inpath '/xsluo/install/hivedatas/user_bucket.txt' overwrite into table user_demo;

加载数据到桶表user_buckets_demo中

insert into table user_buckets_demo select * from user_demo;

hdfs上查看表的数据目录

抽样查询桶表的数据

官网地址
tablesample抽样语句语法：tablesample(bucket x out of y)
- x表示从第几个桶开始做数据采样
- y与进行采样的桶数的个数、每个采样桶的采样比例有关；

  select * from user_buckets_demo ;
  需要采样的总桶数 = 分桶数/y = 结果ret
  分两种情况
  
  情况一：ret>1
  需要采样的总桶数 = 分桶数/y = 4/2 = 2个
  即从2个桶进行数据的采样
  x = 1 先从第1个桶中取出数据
  x+y = 1+2 = 3 再从第3个桶中取出数据
  
  情况二：ret<1
  假设还是此表user_buckets_demo，分桶数是4
  x=1
  y=8
  ∴需要采样的总桶数 = 分桶数/y = 4/8 = 0.5
  ret<1，只能从1个桶进行数据的采样
  x = 1 从第1个桶中取出0.5一半的数据

五、分区表和分桶表的区别

Hive中的分区表是通过分文件夹的形式，把表的数据按照某个维度进行分目录存储在不同的分区（文件夹）下面，后期按照不同的目录查询数据，不需要进行全量扫描，提升查询效率，如下图所示：

而Hive中的分桶表则是相对分区进行更细粒度的划分，Hive表或分区表都可作进一步的分桶。分桶其实就是将整个Hive表或者分区表中的数据通过分文件的形式将数据存储在不同的桶（文件）下面，原理是按照某列取hash值，然后再通过对桶的个数取模来决定该条数据存放在哪个桶（文件）当中。对于一些大的（分区）表，没有分桶的话需要扫描整个数据集，就会变得特别慢，因为数据量太大了。在分桶之后就可以进一步提升某些查询操作效率，而且取样sampling也更高效，如下图所示：

皮哥四月红

关注

3
点赞
踩
20

收藏

觉得还不错? 一键收藏
0
评论
Hive中的数据导入和导出、分区表和分桶表的使用与区别详解

目录一、Hive数据导入二、Hive数据导出三、Hive的分区表1.1 为什么要分区？1.2 外部分区表综合练习四、Hive的分桶表1.1 分桶表原理1.2 作用1.3 案例演示：创建分桶表五、分区表和分桶表的区别一、Hive数据导入1 直接向表中插入数据（强烈不推荐使用）hive (myhive)> create table score3 like score;hive (myhive)> insert into table score3
复制链接

扫一扫