按照网上说的,这是因为hadoop集群资源不足造成的。 并且多数情况是由于分配的虚拟内存超出限制。
根据分析,Flink应用确实存在大量的缓存数据,而设置的taskmanager内存只有2G,当程序运行一段时间后就会出现以下类似错误:
Diagnostics: Container [pid=6386,containerID=container_1521277661809_0006_01_000001] is running beyond virtual memory limits. Current usage: 250.5 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1521277661809_0006_01_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 6386 6384 6386 6386 (bash) 0 0 108625920 331 /bin/bash -c /usr/local/jdk/bin/java -Xmx424m -Dlog.file=/usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.log -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnApplicationMasterRunner 1> /usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.out 2> /usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.err
|- 6401 6386 6386 6386 (java) 388 72 2287009792 63800 /usr/local/jdk/bin/java -Xmx424m -Dlog.file=/usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.log -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnApplicationMasterRunner
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Failing this attempt. Failing the application.
按照网上的说法有两个解决方案:
1、修改yarn-site.xml配置,将 hadoop 的检查虚拟内存关闭掉
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
2、调整taskmanager的内存
我选择了2,可能存在的问题是当数据量上升估计还得出问题。1还没有尝试,但关闭了可能会掩盖其它问题,暂不考虑。
3、使用RocksDB
先看RocksDBStateBackend.java的介绍
/**
* A State Backend that stores its state in {@code RocksDB}. This state backend can
* store very large state that exceeds memory and spills to disk.
*
* <p>All key/value state (including windows) is stored in the key/value index of RocksDB.
* For persistence against loss of machines, checkpoints take a snapshot of the
* RocksDB database, and persist that snapshot in a file system (by default) or
* another configurable state backend.
*
* <p>The behavior of the RocksDB instances can be parametrized by setting RocksDB Options
* using the methods {@link #setPredefinedOptions(PredefinedOptions)} and
* {@link #setRocksDBOptions(RocksDBOptionsFactory)}.
*/
public class RocksDBStateBackend extends AbstractStateBackend implements ConfigurableStateBackend {
RocksDB使用的是堆外内存,不受JVM的限制,但是会损伤一些性能。