一、前文
二、发现问题
- 每次将IoTDB服务起来三四个小时后,服务总会崩溃。
- 从服务器内存使用率上可以发现,在IoTDB服务崩溃时,内存会空出一部分。
- 查看
log_datanode_all.log
日志,但是没发现有效的信息。
2024-07-30 00:50:28,685 [pool-37-IoTDB-ClientRPC-Processor-1] ERROR o.a.t.s.TThreadPoolServer$WorkerProcess:258 - Thrift Error occurred during processing of message.
org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection timed out (Read failed)
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:178)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:109)
at org.apache.iotdb.rpc.TElasticFramedTransport.readFrame(TElasticFramedTransport.java:122)
at org.apache.iotdb.rpc.TElasticFramedTransport.read(TElasticFramedTransport.java:117)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:109)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:463)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:361)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:244)
at org.apache.iotdb.db.protocol.thrift.ProcessorWithMetrics.process(ProcessorWithMetrics.java:50)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:248)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.net.SocketException: Connection timed out (Read failed)
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:176)
... 12 common frames omitted
2024-07-30 03:05:15,533 [nioEventLoopGroup-3-2] INFO i.m.broker.MQTTConnection:316 - Client didn't supply any credentials and MQTT anonymous mode is disabled. CId=CENSYS
三、分析问题
- 暂时没有发现太多问题,那就增加日志,再次复现问题
- 在
datanode-env.sh
中打开IOTDB_JMX_OPTS
配置,增加OOM日志记录。
# if you want to dump the heap memory while OOM happening, you can use the following command, remember to replace /tmp/heapdump.hprof with your own file path and the folder where this file is located needs to be created in advance
IOTDB_JMX_OPTS="$IOTDB_JMX_OPTS -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/datanode_heapdump.hprof"
四、复现问题
- 复现问题,从
/temp/datanode_heapdump.hprof
发现了有效信息。 - 毫无疑问,确实是OOM内存溢出导致的服务崩溃。
五、进一步分析问题
- 原因在于,我在
datanode_env.sh
中把默认的2G内存改成512M。 - 至于我修改
MEMORY_SIZE=512M
的原因在于:IoTDB 入门教程 问题篇①——内存不足导致datanode服务无法启动
# You can set DataNode memory size, example '2G' or '2048M'
MEMORY_SIZE=512M
- 我修改了DataNode内存大小,却没修改其他关联的内存大小,比如
dn_thrift_max_frame_size
dn_thrift_max_frame_size
默认512M,再增加其他程序需要的内存,很容易就挤爆DataNode的内存。- 所以导致服务因为OOM而崩溃
# thrift max frame size, 512MB by default
# Datatype: int
# dn_thrift_max_frame_size=536870912
六、解决问题
- 在
iotdb-datanode.properties
中,修改dn_thrift_max_frame_size=16777216
# thrift max frame size, 512MB by default
# Datatype: int
dn_thrift_max_frame_size=16777216
七、总结
- 配置文件中的参数,不可随意修改,大多有关联性。
- 就算修改了,也要记下修改记录。
- 如果出现问题,则可以通过回退参数的修改,来定位问题。
觉得好,就一键三连呗(点赞+收藏+关注)