小文件治理之hive文件合并：hive小文件合并的三种方法

*星星之火*

已于 2022-03-23 18:09:46 修改

阅读量1.7w

点赞数 1

分类专栏：数据治理 hive 文章标签： hdfs

于 2022-03-23 15:08:43 首次发布

本文链接：https://blog.csdn.net/spark_dev/article/details/123686277

版权

hive 同时被 2 个专栏收录

8 篇文章

订阅专栏

数据治理

5 篇文章

订阅专栏

本文介绍了在Hive中处理大量小文件时，concatenate方法和insertoverwrite方法的使用，包括各自优缺点，以及如何通过insertoverwriteselect*技巧去掉日期字段。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

前言

hive分区下，有很多小文件，例如一个分区有1000个文件，但每个文件大小是10k，数仓大量这种小文件。
小文件太多，需要消耗hdfs存储资源，mr,spark计算的任务数。
为了处理小文件，需要对它们进行合并。

一、concatenate方法

#对于非分区表
alter table tablename concatenate;
#对于分区表
alter table tablename partition(dt=20201224) concatenate;

优点： 使用方便
缺点： concatenate 命令只支持 RCFILE 和 ORC 文件类型，需要执行多次，才能把文件合并为1个。

二、insert overwrite方法

insert overwrite table tableName partition(dt=2022031100)
select  
  column1,column2
from
tableName 
where dt=2022031100

缺点： select 的字段需要自己拼起来，select * 的话，由于带有dt字段，无法写入新分区。

优点： 支持所有数据类型

三、insert overwrite select * 用法

从select * 中去掉一列的方法：
insert overwrite tableA select (name)?+.+ from test;

hive> set hive.cli.print.header=true;
hive> select * from test;
hook status=true,operation=QUERY
OK
name    friends children        address
songsong        ["bingbing","lili"]     {"xiao song":18,"xiaoxiao song":19}     {"street":"hui long guan","city":"beijing"}
yangyang        ["caicai","susu"]       {"xiao yang":18,"xiaoxiao yang":19}     {"street":"chao yang","city":"beijing"}
Time taken: 0.14 seconds, Fetched: 2 row(s)

从select * 中去掉列 address


hive> select `(address)?+.+` from test;
hook status=true,operation=QUERY
OK
name    friends children
songsong        ["bingbing","lili"]     {"xiao song":18,"xiaoxiao song":19}
yangyang        ["caicai","susu"]       {"xiao yang":18,"xiaoxiao yang":19}
Time taken: 0.144 seconds, Fetched: 2 row(s)

用这个方法就能去掉分区表的日期字段

注意，语法生效，需要设置