Spark 作业的 commit 提交机制 - Spark并发更新ORC表失败的问题原因与解决方法

最新推荐文章于 2024-09-10 22:06:58 发布

GXYKJ

最新推荐文章于 2024-09-10 22:06:58 发布

阅读量875

点赞数 10

文章标签： spark 大数据分布式

本文链接：https://blog.csdn.net/xylf1988/article/details/140467459

版权

1 问题现象

多个Spark 作业并发更新同一张ORC表时，部分作业可能会因为某些临时文件不存在而失败退出，典型报错日志如下：

org.apache.spark.SparkException: Job aborted. Caused by: java.io.FileNotFoundException: File hdfs://kxc-cluster/user/hive/warehouse/hstest_dev.db/test_test_test/_temporary/0/task_202309041054037826600124725546762_0176_m_000002/g=2 does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:981)

2 问题原因

该问题的原因是spark不支持对同一张ORC/PARQUET非分区表或ORC/PARQUET分区表的同一个分区的并发更新，甚至也不支持以静态分区模式并发更新 ORC/PARQUET分区表的不同分区，其底层细节跟 spark作业两阶段提交机制的实现算法有关，详情见后文。

3.问题解决

解决方案1：对于分区表，尽量使用动态分区模式替代静态分区模式: 比如使用insert overwrite table table1 partition (part_date) select client_id, 20230911 as part_date from table0 替代 insert overwrite table table1 partition (part_date=20230911) select client_id from table0; （此时每个作业都有自己独立的临时目录，且位于目录如.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66下，所以互不冲突）；
解决方案2：配置spark使用hive serde 而不是spark built-in data source writer: 即配置参数spark.sql.hive.convertInsertingPartitionedTable=false 和spark.sql.hive.convertMetastoreOrc=false，（此时底层使用 hive serde的commit算法，每个作业都有自己独立的临时目录，且位于目录如.hive-staging_hive_2023-09-08_17-35-01_497_4555303478309834157-59下，所以互不冲突）；
解决方案3：配置fileoutputcommitter 不对临时目录进行清理，即配置spark参数spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped=true；

上述方案各自的局限性如下：

方案1只适用于分区表；
方案2在spark/hive的互操作性可能有局限：即spark/hive能否正常读写彼此生成的数据，取决于 spark/hive的版本是否兼容（以及某些相关参数的具体配置）；
方案3需要异步手动清理临时目录，否则日积月累临时目录下会有多个空目录（不是空文件）；

4 技术背景-概述

SPARK作业采用了两阶段提交的机制，会对 task/job分别进行提交，其细节如下：

{appAttemptId} ，作为本次运行的临时输出目录，其中 ${output.dir} 即对应表的根存储路径如/user/hive/warehouse/test.db/tableA；
JOB底层的task开始运行时，会进一步创建临时目录
{appAttemptId}/_temporary/${taskAttemptId}，作为该task的临时输出目录；
某task运行完毕后，会检查是否需要commit该任务（如果开启了推测执行机制，有些TASK可能会不需要commit），如果需要 commit, 则会将输出文件
{appAttemptId}/_temporary/{fileName} 移动到
{appAttemptId}/${fileName} ；
所有TASK执行完毕后，则会提交作业，此时会将所有输出文件
{appAttemptId}/移动到最终目录
{output.dir} 下；
作业提交完毕之后，如果没有显示配置spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped=true，则会清理临时目录，将${output.dir}/_temporary 目录删除；
在采用动态分区模式插入分区表时，还会使用到暂存目录，即临时目录，此时对应的是
{output.dir}/.spark-staging-{jobId}/_temporary,比如/user/hive/warehouse/test.db/tableA/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary;且该暂存目录在作业提交后，总是会被删除；
当显示配置mapreduce.fileoutputcommitter.algorithm.version=2时，上述task提交的细节略有不同（底层会将

{appAttemptId}/_temporary/

{fileName} 直接移动到 ${output.dir}下）；
正是因为上述提交task/job的细节，所以spark不支持对同一张ORC/PARQUET非分区表或ORC/PARQUET分区表的同一个分区的并发更新，也不支持以静态分区模式并发更新 ORC/PARQUET分区表的不同分区；

5 技术背景-相关源码及相关参数

6 技术背景-spark并发插入非分区表

Job/task执行过程中会生成临时文件：/user/hive/warehouse/test_liming.db/table1/_temporary/0/_temporary/attempt_202309080930006805722025796783378_0038_m_000000_158/part-00000-a1e1410f-6ca1-4d8b-92b6-78883c9e9a22-c000.zlib.orc
Task commit后会生成文件：hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1/_temporary/0/task_202309080928591897349793317265177_0025_m_000000
Job commit后会生成文件：/user/hive/warehouse/test_liming.db/table1/part-00000-8448b8b5-01b1-4348-8f91-5d3acd682f81-c000.zlib.orc
执行过程中截图如下：

相关关键日志如下：

关键日志-成功的task：
23/09/08 09:26:29 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/09/08 09:26:29 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/09/08 09:26:29 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
23/09/08 09:26:30 INFO HadoopShimsPre2_7: Can't get KeyProvider for ORC encryption from hadoop.security.key.provider.path.
23/09/08 09:26:30 INFO PhysicalFsWriter: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_test_test1/.hive-staging_hive_2023-09-08_09-26-26_158_278404270035841685-3/-ext-10000/_temporary/0/_temporary/attempt_202309080926277463773081868267263_0002_m_000000_2/part-00000-6c45455c-0201-4ad8-9459-fa8b77f37d0e-c000 with stripeSize: 67108864 blockSize: 268435456 compression: Compress: ZLIB buffer: 262144
23/09/08 09:26:30 INFO WriterImpl: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_test_test1/.hive-staging_hive_2023-09-08_09-26-26_158_278404270035841685-3/-ext-10000/_temporary/0/_temporary/attempt_202309080926277463773081868267263_0002_m_000000_2/part-00000-6c45455c-0201-4ad8-9459-fa8b77f37d0e-c000 with stripeSize: 67108864 options: Compress: ZLIB buffer: 262144
23/09/08 09:26:49 INFO FileOutputCommitter: Saved output of task 'attempt_202309080926277463773081868267263_0002_m_000000_2' to hdfs://nameservice1/user/hive/warehouse/test_test_test1/.hive-staging_hive_2023-09-08_09-26-26_158_278404270035841685-3/-ext-10000/_temporary/0/task_202309080926277463773081868267263_0002_m_000000
23/09/08 09:26:49 INFO SparkHadoopMapRedUtil: attempt_202309080926277463773081868267263_0002_m_000000_2: Committed. Elapsed time: 13 ms.
23/09/08 09:26:49 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 2541 bytes result sent to driver

关键日志-失败的task:
23/09/08 10:22:02 WARN DataStreamer: DataStreamer Exception
java.io.FileNotFoundException: File does not exist: /user/hive/warehouse/test_liming.db/table1/_temporary/0/_temporary/attempt_202309081021577566806638904497462_0003_m_000000_10/part-00000-211a80a3-2cce-4f25-8c10-bfa5ecbd421f-c000.zlib.orc (inode 21688384) Holder DFSClient_attempt_202309081021525253836824694806862_0001_m_000003_4_991233622_49 does not have any open files.

7 技术背景-spark采用静态分区模式并发插入分区表的不同分区

Job/task执行过程中会生成文件：/user/hive/warehouse/test_liming.db/table1_pt/_temporary/0/_temporary/attempt_202309081055288611730301255924365_0005_m_000000_20/g=22/part-00000-88afa539-25ba-4b1d-bd6d-df445863dd8d.c000.zlib.orc
Task commit后会生成文件：hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/_temporary/0/task_202309081054408080671360087873016_0001_m_000000
Job commit后会生成最终文件：/user/hive/warehouse/test_liming.db/table1_pt/g=22/part-00000-0732dc56-ae0f-4c32-8347-012870ad7ab1.c000.zlib.orc
执行过程中截图如下：

关键日志如下：

关键日志-成功的task：
23/09/08 10:54:48 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/09/08 10:54:48 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/09/08 10:54:48 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
23/09/08 10:54:48 INFO CodeGenerator: Code generated in 25.249011 ms
23/09/08 10:54:48 INFO CodeGenerator: Code generated in 14.669298 ms
23/09/08 10:54:48 INFO CodeGenerator: Code generated in 37.39972 ms
23/09/08 10:54:48 INFO HadoopShimsPre2_7: Can't get KeyProvider for ORC encryption from hadoop.security.key.provider.path.
23/09/08 10:54:48 INFO PhysicalFsWriter: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/_temporary/0/_temporary/attempt_202309081054408080671360087873016_0001_m_000000_1/g=20/part-00000-92811aeb-309c-4c23-acdd-b8286feadcd4.c000.zlib.orc with stripeSize: 67108864 blockSize: 268435456 compression: Compress: ZLIB buffer: 262144
23/09/08 10:54:48 INFO WriterImpl: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/_temporary/0/_temporary/attempt_202309081054408080671360087873016_0001_m_000000_1/g=20/part-00000-92811aeb-309c-4c23-acdd-b8286feadcd4.c000.zlib.orc with stripeSize: 67108864 options: Compress: ZLIB buffer: 262144
23/09/08 10:55:03 INFO FileOutputCommitter: Saved output of task 'attempt_202309081054408080671360087873016_0001_m_000000_1' to hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/_temporary/0/task_202309081054408080671360087873016_0001_m_000000
23/09/08 10:55:03 INFO SparkHadoopMapRedUtil: attempt_202309081054408080671360087873016_0001_m_000000_1: Committed. Elapsed time: 9 ms.
23/09/08 10:55:03 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 3255 bytes result sent to driver

关键日志-job commit报错：
23/09/08 10:55:22 ERROR FileFormatWriter: Aborting job 966601b8-2679-4dc3-86a1-cebc34d9b8c9.
java.io.FileNotFoundException: File hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/_temporary/0 does not exist.
关键日志-task commit报错：
23/09/08 10:55:43 WARN DataStreamer: DataStreamer Exception
java.io.FileNotFoundException: File does not exist: /user/hive/warehouse/test_liming.db/table1_pt/_temporary/0/_temporary/attempt_202309081055288611730301255924365_0005_m_000000_20/g=22/part-00000-88afa539-25ba-4b1d-bd6d-df445863dd8d.c000.zlib.orc (inode 21689816) Holder DFSClient_NONMAPREDUCE_2024885185_46 does not have any open files.

8 技术背景-spark采用动态分区模式插入分区表不同分区

Job/task执行过程中会生成文件：hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/_temporary/attempt_202309081121348303587356551291178_0001_m_000002_3/g=23/part-00002-fc1a9f7a-5729-498e-b710-249e90217f66.c000.zlib.orc
Task commit后会生成文件：hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/task_202309081121348303587356551291178_0001_m_000002
Job commit后会生成文件：/user/hive/warehouse/test_liming.db/table1_pt/g=23/part-00002-2587b707-7675-4547-8ffb-63e2114d1c9b.c000.zlib.orc
执行过程中截图如下：

关键日志如下：

关键日志-所有task所有Job都是成功的：
23/09/08 11:21:45 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/09/08 11:21:45 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/09/08 11:21:45 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/09/08 11:21:45 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/09/08 11:21:45 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
23/09/08 11:21:45 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
23/09/08 11:21:45 INFO CodeGenerator: Code generated in 44.80136 ms
23/09/08 11:21:45 INFO CodeGenerator: Code generated in 16.168217 ms
23/09/08 11:21:45 INFO CodeGenerator: Code generated in 53.060559 ms
23/09/08 11:21:45 INFO HadoopShimsPre2_7: Can't get KeyProvider for ORC encryption from hadoop.security.key.provider.path.
23/09/08 11:21:45 INFO HadoopShimsPre2_7: Can't get KeyProvider for ORC encryption from hadoop.security.key.provider.path.
23/09/08 11:21:45 INFO PhysicalFsWriter: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/_temporary/attempt_202309081121348303587356551291178_0001_m_000002_3/g=23/part-00002-fc1a9f7a-5729-498e-b710-249e90217f66.c000.zlib.orc with stripeSize: 67108864 blockSize: 268435456 compression: Compress: ZLIB buffer: 262144
23/09/08 11:21:45 INFO PhysicalFsWriter: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/_temporary/attempt_202309081121344786622643944446305_0001_m_000000_1/g=21/part-00000-fc1a9f7a-5729-498e-b710-249e90217f66.c000.zlib.orc with stripeSize: 67108864 blockSize: 268435456 compression: Compress: ZLIB buffer: 262144
23/09/08 11:21:45 INFO WriterImpl: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/_temporary/attempt_202309081121344786622643944446305_0001_m_000000_1/g=21/part-00000-fc1a9f7a-5729-498e-b710-249e90217f66.c000.zlib.orc with stripeSize: 67108864 options: Compress: ZLIB buffer: 262144
23/09/08 11:21:45 INFO WriterImpl: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/_temporary/attempt_202309081121348303587356551291178_0001_m_000002_3/g=23/part-00002-fc1a9f7a-5729-498e-b710-249e90217f66.c000.zlib.orc with stripeSize: 67108864 options: Compress: ZLIB buffer: 262144
23/09/08 11:22:04 INFO FileOutputCommitter: Saved output of task 'attempt_202309081121344786622643944446305_0001_m_000000_1' to hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/task_202309081121344786622643944446305_0001_m_000000
23/09/08 11:22:04 INFO SparkHadoopMapRedUtil: attempt_202309081121344786622643944446305_0001_m_000000_1: Committed. Elapsed time: 18 ms.
23/09/08 11:22:04 INFO FileOutputCommitter: Saved output of task 'attempt_202309081121348303587356551291178_0001_m_000002_3' to hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/task_202309081121348303587356551291178_0001_m_000002
23/09/08 11:22:04 INFO SparkHadoopMapRedUtil: attempt_202309081121348303587356551291178_0001_m_000002_3: Committed. Elapsed time: 9 ms.
23/09/08 11:22:04 INFO Executor: Finished task 2.0 in stage 1.0 (TID 3). 3470 bytes result sent to driver

9 技术背景-spark通过多个作业采用动态分区模式和静态分区模式分表插入分区表的不同分区

经测试，只要以静态分区形式插入数据的作业数不超过2个（以动态分区形式插入数据的作业可以有多个），就不会报错。
执行过程中截图如下：

10 技术背景-配置spark使用hive serde 而不是spark built-in data source writer

配置spark使用hive serde 而不是spark built-in data source writer，即配置参数spark.sql.hive.convertInsertingPartitionedTable=false; spark.sql.hive.convertMetastoreOrc=false;（可以在 kyuubi-default.conf 或spark-default.conf中配置；可以user/session 级别配置)，此后对非分区表，分区表静态分区模式，分区表动态分区农事，分别进行测试。
Job/task执行过程中会生成文件：hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1/.hive-staging_hive_2023-09-08_17-35-01_497_4555303478309834157-59/-ext-10000/_temporary/0/_temporary/attempt_20230908173501912656469073753420_0059_m_000000_59/part-00000-6d83cb93-228e-4717-bf77-83e36c10cbe8-c000
Task commit后会生成文件：hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1/.hive-staging_hive_2023-09-08_17-34-58_485_4893366407663162793-58/-ext-10000/_temporary/0/task_202309081734587917020092949673358_0058_m_000000
Job commit后会生成文件：/user/hive/warehouse/test_liming.db/table1_pt/g=23/part-00002-6efd7b7b-9a44-410a-b15d-1c5ee49a523f.c000
关键日志如下：

关键日志如下-所有JOB/TASK都是成功的：
23/09/08 17:35:01 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/09/08 17:35:01 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/09/08 17:35:01 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
23/09/08 17:35:01 INFO PhysicalFsWriter: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1/.hive-staging_hive_2023-09-08_17-35-01_497_4555303478309834157-59/-ext-10000/_temporary/0/_temporary/attempt_20230908173501912656469073753420_0059_m_000000_59/part-00000-6d83cb93-228e-4717-bf77-83e36c10cbe8-c000 with stripeSize: 67108864 blockSize: 268435456 compression: Compress: ZLIB buffer: 262144

23/09/08 17:35:02 INFO FileOutputCommitter: Saved output of task 'attempt_202309081734587917020092949673358_0058_m_000000_58' to hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1/.hive-staging_hive_2023-09-08_17-34-58_485_4893366407663162793-58/-ext-10000/_temporary/0/task_202309081734587917020092949673358_0058_m_000000
23/09/08 17:35:02 INFO SparkHadoopMapRedUtil: attempt_202309081734587917020092949673358_0058_m_000000_58: Committed. Elapsed time: 4 ms.
23/09/08 17:35:02 INFO Executor: Finished task 0.0 in stage 58.0 (TID 58). 2498 bytes result sent to driver

23/09/08 17:35:42 INFO FileUtils: Creating directory if it doesn't exist: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.hive-staging_hive_2023-09-08_17-35-42_083_5954793858553566623-61
23/09/08 17:35:42 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/09/08 17:35:42 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/09/08 17:35:42 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
23/09/08 17:49:00 INFO Hive: New loading path = hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.hive-staging_hive_2023-09-08_17-48-20_910_862844915956183505-137/-ext-10000/g=21 with partSpec {g=21}
23/09/08 17:49:00 INFO Hive: New loading path = hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.hive-staging_hive_2023-09-08_17-48-20_910_862844915956183505-137/-ext-10000/g=22 with partSpec {g=22}
23/09/08 17:49:00 INFO Hive: New loading path = hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.hive-staging_hive_2023-09-08_17-48-20_910_862844915956183505-137/-ext-10000/g=23 with partSpec {g=23}
23/09/08 17:49:00 INFO TrashPolicyDefault: Moved: 'hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/g=21/part-00000-aac3aa0e-5de4-4b8d-ae29-725fb692ed1c.c000' to trash at: hdfs://nameservice1/user/hive/.Trash/Current/user/hive/warehouse/test_liming.db/table1_pt/g=21/part-00000-aac3aa0e-5de4-4b8d-ae29-725fb692ed1c.c000
23/09/08 17:49:00 INFO FileUtils: Creating directory if it doesn't exist: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/g=21
23/09/08 17:49:00 INFO TrashPolicyDefault: Moved: 'hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/g=23/part-00002-aac3aa0e-5de4-4b8d-ae29-725fb692ed1c.c000' to trash at: hdfs://nameservice1/user/hive/.Trash/Current/user/hive/warehouse/test_liming.db/table1_pt/g=23/part-00002-aac3aa0e-5de4-4b8d-ae29-725fb692ed1c.c000
23/09/08 17:49:00 INFO FileUtils: Creating directory if it doesn't exist: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/g=23
23/09/08 17:49:01 INFO TrashPolicyDefault: Moved: 'hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/g=22/part-00001-aac3aa0e-5de4-4b8d-ae29-725fb692ed1c.c000' to trash at: hdfs://nameservice1/user/hive/.Trash/Current/user/hive/warehouse/test_liming.db/table1_pt/g=22/part-00001-aac3aa0e-5de4-4b8d-ae29-725fb692ed1c.c000
23/09/08 17:49:01 INFO FileUtils: Creating directory if it doesn't exist: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/g=22
23/09/08 17:49:01 INFO Hive: Loaded 3 partitions

执行过程中截图如下-非分区表：

执行过程中截图如下-静态分区模式：

执行过程中截图如下-动态分区模式：

11 技术背景-配置不清理临时目录

配置不清理作业执行过程中的临时目录，即配置spark参数spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped=true，此后对非分区表，分区表静态分区模式，分区表动态分区农事，分别进行测试。
注意此时作业执行结束后，会残留_temporary目录，需要异步手动清理。
执行过程中截图如下-非分区表：

执行过程中截图如下-分区表-静态分区：

执行过程中截图如下-分区表-动态分区执行过程中会生成：/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-7809b23e-e675-42f4-93fd-97e6467ed5e4/_temporary/0 但执行结束会清理掉/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-7809b23e-e675-42f4-93fd-97e6467ed5e4/ 最终只剩下：

GXYKJ

关注

10
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
Spark 作业的 commit 提交机制 - Spark并发更新ORC表失败的问题原因与解决方法

该问题的原因是spark不支持对同一张ORC/PARQUET非分区表或ORC/PARQUET分区表的同一个分区的并发更新，甚至也不支持以静态分区模式并发更新 ORC/PARQUET分区表的不同分区，其底层细节跟 spark作业两阶段提交机制的实现算法有关，详情见后文。
复制链接

扫一扫