将数据插入动态分区可能会导致短时间内(map任务)产生大量的分区(大于分区列的值去重后的数量),导致资源消耗过大,因此可以设置以下3个用于保护自己的参数。
- Dynamic partition insert could potentially be a resource hog in that it could generate a large number of partitions in a short time. To get yourself buckled, we define three parameters:
- hive.exec.max.dynamic.partitions.pernode (default value being 100) is the maximum dynamic partitions that can be created by each mapper or reducer. If one mapper or reducer created more than that the threshold, a fatal error will be raised from the mapper/reducer (through counter) and the whole job will be killed. 每个mapper或reducer可以创建的最大动态分区。
- hive.exec.max.dynamic.partitions (default value being 1000) is the total number of dynamic partitions could be created by one DML. If each mapper/reducer did not exceed the limit but the total number of dyn

本文介绍了Hive在插入动态分区表时可能面临的资源消耗问题,并提出了通过设置`hive.exec.max.dynamic.partitions.pernode`、`hive.exec.max.dynamic.partitions`和`hive.exec.max.created.files`三个参数进行限制。当遇到太多动态分区导致错误时,可以通过在mapper中使用`distribute by`对分区列进行排序,减少分区数量。在实际工作中,如果情况不复杂,直接按日期和省份处理数据,不需要使用`distribute by`进行优化。
最低0.47元/天 解锁文章
2590

被折叠的 条评论
为什么被折叠?



