问题背景:
flink默认jobmanager.memory.process.size 2G、taskmanager.memory.process.size 4G,但是我们当前任务数据量小,且资源不过,需要降低每个Flink任务的资源。
问题产生:
修改jobmanager 1G、taskmanager 2G,提交任务直接报错,查看日志,具体为:
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.configuration.IllegalConfigurationException: Sum of configured Framework Heap Memory (128.000mb (134217728 bytes)), Framework Off-Heap Memory (128.000mb (134217728 bytes)), Task Off-Heap Memory (1024.000mb (1073741824 bytes)), Managed Memory (634.880mb (665719939 bytes)) and Network Memory (158.720mb (166429984 bytes)) exceed configured Total Flink Memory (1.550gb (1664299824 bytes)).
问题原因:
过滤日志有用信息:Sum of configured Framework Heap Memory , Framework Off-Heap Memory , Task Off-Heap Memory, Managed Memory and Network Memory exceed configured Total Flink Memory。
参考Flink内存模型:
已经flink设置的几个内存参数:
配置项 | TaskManager 配置参数 | JobManager 配置参数 |
Flink 总内存 | taskmanager.memory.flink.size | jobmanager.memory.flink.size |
进程总内存 | taskmanager.memory.process.size | jobmanager.memory.process.size |
分析得出导出出现问题的原因是只将jobmanager.memory.process.size 、taskmanager.memory.process.size这两个参数改小(当前是1G、2G的资源量级),但是其他参数还是使用的默认值(2G、4G的资源量级),并没有同比例减小,导致那些参数和这2个参数不匹配,那些参数中设置的值最终加和超过了我所设置的值。
问题解决:
方法一:修改,flink-conf.yaml,将以jobmanager和taskmanager开头除了上面设置的那两个,其他的全部注释掉。举例如下:
jobmanager.memory.process.size: 1GB
# jobmanager.out-err-to-log: true
# jobmanager.web.403-redirect-url: https://192.168.31.155:28443/web/pages/error/403.html
# jobmanager.web.404-redirect-url: https://192.168.31.155:28443/web/pages/error/404.html
# jobmanager.web.415-redirect-url: https://192.168.31.155:28443/web/pages/error/415.html
# jobmanager.web.500-redirect-url: https://192.168.31.155:28443/web/pages/error/500.html
# jobmanager.web.access-control-allow-origin: *
# jobmanager.web.accesslog.enable: false
# jobmanager.web.allow-access-address: *
# jobmanager.web.backpressure.cleanup-interval: 600000
# jobmanager.web.backpressure.delay-between-samples: 50
# jobmanager.web.backpressure.num-samples: 100
# jobmanager.web.backpressure.refresh-interval: 60000
# jobmanager.web.checkpoints.disable: false
# jobmanager.web.checkpoints.history: 10
# jobmanager.web.expires-time: 0
# jobmanager.web.history: 5
# taskmanager.debug.memory.log-interval: 5000
# taskmanager.debug.memory.log: false
# taskmanager.initial-registration-pause: 500 ms
# taskmanager.max-registration-pause: 30 s
# taskmanager.maxRegistrationDuration: 5 min
# taskmanager.memory.jvm-overhead.fraction: 0.1
# taskmanager.memory.jvm-overhead.max: 10GB
# taskmanager.memory.managed.fraction: 0.4
taskmanager.memory.process.size: 2GB
# taskmanager.memory.segment-size: 32768
# taskmanager.memory.task.off-heap.size: 1GB
# taskmanager.network.detailed-metrics: false
# taskmanager.network.memory.buffers-per-channel: 2
# taskmanager.network.memory.floating-buffers-per-gate: 8
# taskmanager.network.memory.fraction: 0.1
# taskmanager.network.memory.max: 5GB
# taskmanager.network.memory.min: 64mb
# taskmanager.network.netty.client.connectTimeoutSec: 300
# taskmanager.network.netty.client.numThreads: -1
# taskmanager.network.netty.num-arenas: -1
# taskmanager.network.netty.sendReceiveBufferSize: 4096
# taskmanager.network.netty.server.backlog: 0
# taskmanager.network.netty.server.numThreads: -1
# taskmanager.network.netty.transport: nio
# taskmanager.network.numberOfBuffers: 2048
# taskmanager.network.request-backoff.initial: 100
方法二:修改时,除了那两个参数,再多修改一个参数taskmanager.memory.task.off-heap.size,让他也同比例减小即可(默认1G,修改为512MB)。
当然再将其他和量值(比值的不用改)的参数同比例减少一半也可以行。我知将off-heap减小一般就可正常提交运行了,所以后面其他的参数并没有尝试。