Hive concatenate 可以合并小文件,语法如下:
ALTER TABLE/PARTITION XXX CONCATENATE;
事务表
如果是事务表,会触发启动一个 major compaction,并等待 major compaction 结束,测试如下:
- 创建表并插入数据
create table t7 (c1 int) stored as orc tblproperties('transactional'='true');
insert into t7 values(1);
insert into t7 values(2);
insert into t7 values(3);
- 查看 文件
drwxr-xr-x - houzhizhen supergroup 0 2022-03-08 17:10 hdfs://localhost:9000/user/hive/warehouse/test.db/t7/delta_0000001_0000001_0000
drwxr-xr-x - houzhizhen supergroup 0 2022-03-08 17:11 hdfs://localhost:9000/user/hive/warehouse/test.db/t7/delta_0000002_0000002_0000
drwxr-xr-x - houzhizhen supergroup 0 2022-03-08 17:11 hdfs://localhost:9000/user/hive/warehouse/test.db/t7/delta_0000003_0000003_0000
- 执行
concatenate;
可以看到生成一个 Compaction。
hive> alter table t7 concatenate;
Compaction enqueued with id 803
.
Compaction with id 803 finished with status: succeeded
OK
Time taken: 14.086 seconds
- Compaction 后的文件
可以看到只有一个数据文件。
Found 3 items
-rw-r--r-- 1 houzhizhen supergroup 48 2022-03-08 17:12 hdfs://localhost:9000/user/hive/warehouse/test.db/t7/base_0000003/_metadata_acid
-rw-r--r-- 1 houzhizhen supergroup 1 2022-03-08 17:12 hdfs://localhost:9000/user/hive/warehouse/test.db/t7/base_0000003/_orc_acid_version
-rw-r--r-- 1 houzhizhen supergroup 616 2022-03-08 17:12 hdfs://localhost:9000/user/hive/warehouse/test.db/t7/base_0000003/bucket_00000
非事务表
如果是非事务表,会启动一个 Merge 的 Job 提交到计算平台,并等待结束,测试如下:
- 创建表并插入数据
create table t7 (c1 int) stored as orc;
insert into t7 values(1);
insert into t7 values(2);
insert into t7 values(3);
- 查看 文件
-rw-r--r-- 1 houzhizhen supergroup 188 2022-03-08 17:14 hdfs://localhost:9000/user/hive/warehouse/test.db/t7/000000_0
-rw-r--r-- 1 houzhizhen supergroup 188 2022-03-08 17:14 hdfs://localhost:9000/user/hive/warehouse/test.db/t7/000000_0_copy_1
-rw-r--r-- 1 houzhizhen supergroup 188 2022-03-08 17:14 hdfs://localhost:9000/user/hive/warehouse/test.db/t7/000000_0_copy_2
- 执行
concatenate;
可以看到生成一个Job。
hive> alter table t7 concatenate;
2022-03-08 17:15:06 Running Dag: dag_1646730822691_0001_4
2022-03-08 17:15:06 Running Dag: dag_1646730822691_0001_4
Status: Running (Executing on YARN cluster with App id application_1646730822691_0001)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
File Merge container INITIALIZING -1 0 0 -1 0 0
----------------------------------------------------------------------------------------------
VERTICES: 00/01 [>>--------------------------] 0% ELAPSED TIME: 0.00 s
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
File Merge ..... container SUCCEEDED 1 1 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 0.24 s
----------------------------------------------------------------------------------------------
Status: DAG finished successfully in 0.24 seconds
Query Execution Summary
----------------------------------------------------------------------------------------------
OPERATION DURATION
----------------------------------------------------------------------------------------------
Compile Query 0.07s
Prepare Plan 0.03s
Get Query Coordinator (AM) 0.00s
Submit Plan 0.05s
Start DAG 0.03s
Run DAG 0.24s
----------------------------------------------------------------------------------------------
Task Execution Summary
----------------------------------------------------------------------------------------------
VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS
----------------------------------------------------------------------------------------------
File Merge 0.00 0 0 0 0
----------------------------------------------------------------------------------------------
Loading data to table test.t7
Table test.t7 stats: [numFiles=1, numRows=3, totalSize=361, rawDataSize=12]
OK
Time taken: 0.672 seconds
- Compaction 后的文件
可以看到只有一个数据文件。
-rw-r--r-- 1 houzhizhen supergroup 361 2022-03-08 17:15 hdfs://localhost:9000/user/hive/warehouse/test.db/t7/000000_0