ElasticSearch 7.3堆内存使用率高及频繁触发Data too large熔断问题优化方法

最新推荐文章于 2024-06-27 11:03:55 发布

三苦

最新推荐文章于 2024-06-27 11:03:55 发布

阅读量2.6k

点赞数

分类专栏： ElasticSearch 文章标签： elasticsearch

本文链接：https://blog.csdn.net/avenger19/article/details/120435993

版权

ElasticSearch 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

该问题常见于7.3版本的ES。国内没有查到相关资料。严重的时候甚至会导致数据节点脱离集群。
ES的官方论坛有个2019年的帖子：
CircuitBreakingException: [parent] Data too large IN ES 7.x

ES的这个熔断本来是一种用来防止触发OOM的自我保护机制，其实在日志中看到它并不能算是大问题，但如果出现得太过频繁就不太对劲了，要么是现有集群规模已经撑不起业务量，要么是配置有问题。
上面这个帖子中，Henning Andersen提出的解决方案是修改GC的配置，并且说明7.3版本中的默认的设置是“mistake”——IHOP设置为了75。
我们在github上找到了Andersen针对此问题的一个提交：
Fix G1 GC default IHOP

G1 GC were setup to use an InitiatingHeapOccupancyPercent of 75. This
could leave used memory at a very high level for an extended duration,
triggering the real memory circuit breaker even at low activity levels.
The value is a threshold for old generation usage relative to total heap
size and thus it should leave room for the new generation. Default in
G1 is to allow up to 60 percent for new generation and this could mean that the
threshold was effectively at 135% heap usage. GC would still kick in of course and
eventually enough mixed collections would take place such that adaptive adjustment
of IHOP kicks in.
The JVM has adaptive setting of the IHOP, but this does not kick in
until it has sampled a few collections. A newly started, relatively
quiet server with primarily new generation activity could thus
experience heap above 95% frequently for a duration.
The changes here are two-fold:

Use 30% default for IHOP (the JVM default of 45 could still mean
105% heap usage threshold and did not fully ensure not to hit the
circuit breaker with low activity)
Set G1ReservePercent=25. This is used by the adaptive IHOP mechanism,
meaning old/mixed GC should kick in no later than at 75% heap. This
ensures IHOP stays compatible with the real memory circuit breaker also
after being adjusted by adaptive IHOP.

根据这位老哥的解释，G1 GC默认允许最高60%的new generation，如果初始化就75%的话，那最高就可能消耗135%的堆内存。垃圾回收器可以根据实际情况调整优化，但前提是收集到足够的样本，最终还是会生效的。（从我们的观察来看，审慎表示怀疑。节点进程哪怕运行数月之久熔断仍被高频触发）。

这个提交从7.4开始一直保持到了7.10，并从7.11版本开始，从默认的jvm.options里移除了（仍然支持显式指定），而在代码中增加了一个自动优化机制：
（elasticsearch/distribution/tools/launchers/src/main/java/org/elasticsearch/tools/launchers/JvmErgonomics.java）

    /**
     * Chooses additional JVM options for Elasticsearch.
     *
     * @param userDefinedJvmOptions A list of JVM options that have been defined by the user.
     * @return A list of additional JVM options to set.
     */
    static List<String> choose(final List<String> userDefinedJvmOptions) throws InterruptedException, IOException {
        final List<String> ergonomicChoices = new ArrayList<>();
        final Map<String, JvmOption> finalJvmOptions = JvmOption.findFinalOptions(userDefinedJvmOptions);
        final long heapSize = JvmOption.extractMaxHeapSize(finalJvmOptions);
        final long maxDirectMemorySize = JvmOption.extractMaxDirectMemorySize(finalJvmOptions);
        if (maxDirectMemorySize == 0) {
            ergonomicChoices.add("-XX:MaxDirectMemorySize=" + heapSize / 2);
        }

        final boolean tuneG1GCForSmallHeap = tuneG1GCForSmallHeap(heapSize);
        final boolean tuneG1GCHeapRegion = tuneG1GCHeapRegion(finalJvmOptions, tuneG1GCForSmallHeap);
        final boolean tuneG1GCInitiatingHeapOccupancyPercent = tuneG1GCInitiatingHeapOccupancyPercent(finalJvmOptions);
        final int tuneG1GCReservePercent = tuneG1GCReservePercent(finalJvmOptions, tuneG1GCForSmallHeap);

        if (tuneG1GCHeapRegion) {
            ergonomicChoices.add("-XX:G1HeapRegionSize=4m");
        }
        if (tuneG1GCInitiatingHeapOccupancyPercent) {
            ergonomicChoices.add("-XX:InitiatingHeapOccupancyPercent=30");
        }
        if (tuneG1GCReservePercent != 0) {
            ergonomicChoices.add("-XX:G1ReservePercent=" + tuneG1GCReservePercent);
        }

        return ergonomicChoices;
    }

    static int tuneG1GCReservePercent(final Map<String, JvmOption> finalJvmOptions, final boolean tuneG1GCForSmallHeap) {
        JvmOption g1GC = finalJvmOptions.get("UseG1GC");
        JvmOption g1GCReservePercent = finalJvmOptions.get("G1ReservePercent");
        if (g1GC.getMandatoryValue().equals("true")) {
            if (g1GCReservePercent.isCommandLineOrigin() == false && tuneG1GCForSmallHeap) {
                return 15;
            } else if (g1GCReservePercent.isCommandLineOrigin() == false && tuneG1GCForSmallHeap == false) {
                return 25;
            }
        }
        return 0;
    }

我们可以看到，默认值跟Anderson当时修改的大致还吻合的（除了增加了个15% Reserve的情况），所以可以认为这个设置是可靠且经过考验的了。

故最简单直接的优化方法，即

升级到7.4或更高版本

此外还可以：

不用G1 GC，改回CMS

（如使用目前ES官方发布的7.3.1版本的rpm包安装则默认启动已经是使用CMS GC了，有兴趣可以对比github上7.3分支下的jvm.options文件和用rpm安装后的/etc/elasticsearch/jvm.options文件内容）
或按照Andersen老哥的办法，

修改G1ReservePercent=25和InitiatingHeapOccupancyPercent=30

当然，上面所说的解决办法都必须基于一个前提：即ES集群的容量规划是合理的，能撑得起计划中的业务量，否则说啥都是扯

三苦

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
ElasticSearch 7.3堆内存使用率高及频繁触发Data too large熔断问题优化方法

该问题常见于ES7，且低于7.4的ES。国内没有查到相关资料。严重的时候甚至会导致数据节点脱离集群。ES的官方论坛有个2019年的帖子：CircuitBreakingException: [parent] Data too large IN ES 7.xES的这个熔断本来是一种用来防止触发OOM，甚至进程崩溃退出的自我保护机制，其实在日志中看到它并不能算是大问题，但如果出现得太过频繁就不太对劲了，要么是现有集群规模已经撑不起业务量，要么是配置有问题。上面这个帖子中，Henning Andersen提
复制链接

扫一扫

专栏目录