现象
通过消息队列(RabbitMQ)将队列内的消息消费存入MongoDB,因为本来内容就不多所以基本应该是实时的,但是却不知道为什么会偶发消息消费推迟20~30分钟的情况,排查了RabbitMQ和MongoDB,网上没有找到相关问题的帖子,之后排查推测是 MongoDB 连接的问题
报错日志
2023-12-20 15:03:38.051 [org.springframework.amqp.rabbit.RabbitListenerEndpointContainer#3-1] WARN org.mongodb.driver.connection - Got socket exception on connection [connectionId{localValue:109, serverValue:1993}] to test-mongodb03:17017. All connections to test-mongodb03:17017 will be closed.
2023-12-20 15:03:38.051 [org.springframework.amqp.rabbit.RabbitListenerEndpointContainer#3-1] INFO org.mongodb.driver.connection - Closed connection [connectionId{localValue:109, serverValue:1993}] to test-mongodb03:17017 because there was a socket exception raised by this connection.
2023-12-20 15:03:38.052 [org.springframework.amqp.rabbit.RabbitListenerEndpointContainer#3-1] INFO org.mongodb.driver.cluster - No server chosen by ReadPreferenceServerSelector{readPreference=primary} from cluster description ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=test-mongodb02:17017, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[6, 0, 10]}, minWireVersion=0, maxWireVersion=17, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=39535116, setName='test', canonicalAddress=test-mongodb02:17017, hosts=[test-mongodb03:17017, test-mongodb02:17017, test-mongodb01:17017], passives=[], arbiters=[], primary='test-mongodb03:17017', tagSet=TagSet{[]}, electionId=null, setVersion=1, lastWriteDate=Wed Dec 20 15:03:29 GMT+08:00 2023, lastUpdateTimeNanos=5350705627747041}, ServerDescription{address=test-mongodb03:17017, type=UNKNOWN, state=CONNECTING}, ServerDescription{address=test-mongodb01:17017, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[6, 0, 10]}, minWireVersion=0, maxWireVersion=17, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=39518862, setName='test', canonicalAddress=test-mongodb01:17017, hosts=[test-mongodb03:17017, test-mongodb02:17017, test-mongodb01:17017], passives=[], arbiters=[], primary='test-mongodb03:17017', tagSet=TagSet{[]}, electionId=null, setVersion=1, lastWriteDate=Wed Dec 20 15:03:29 GMT+08:00 2023, lastUpdateTimeNanos=5350705837909970}]}. Waiting for 30000 ms before timing out
2023-12-20 15:03:38.091 [cluster-ClusterId{value='65797632227a450fe86e23d5', description='null'}-test-mongodb03:17017] INFO org.mongodb.driver.cluster - Discovered replica set primary test-mongodb03:17017
2023-12-20 15:03:38.091 [org.springframework.amqp.rabbit.RabbitListenerEndpointContainer#3-1] INFO org.mongodb.driver.connection - Closed connection [connectionId{localValue:110, serverValue:1992}] to test-mongodb03:17017 because there was a socket exception raised on another connection from this pool.
2023-12-20 15:03:38.091 [org.springframework.amqp.rabbit.RabbitListenerEndpointContainer#3-1] INFO org.mongodb.driver.connection - Closed connection [connectionId{localValue:111, serverValue:1994}] to test-mongodb03:17017 because there was a socket exception raised on another connection from this pool.
2023-12-20 15:03:38.362 [org.springframework.amqp.rabbit.RabbitListenerEndpointContainer#3-1] INFO org.mongodb.driver.connection - Opened connection [connectionId{localValue:112, serverValue:2004}] to test-mongodb03:17017
2023-12-20 15:03:38.683 [org.springframework.amqp.rabbit.RabbitListenerEndpointContainer#3-1] INFO org.mongodb.driver.connection - Opened connection [connectionId{localValue:113, serverValue:2005}] to test-mongodb03:17017
2023-12-20 15:03:40.218 [org.springframework.amqp.rabbit.RabbitListenerEndpointContainer#3-1] INFO org.mongodb.driver.connection - Opened connection [connectionId{localValue:114, serverValue:2008}] to test-mongodb03:17017
解决方式
根据MongoDB官方文档
FAQ: MongoDB Diagnostics — MongoDB Manual
sysctl net.ipv4.tcp_keepalive_time
获得的值小于等于120(秒)(2分钟)会比较好,但是目前设置的是1200(秒)(20分钟)
也基本符合测试那边的问题,就是数据差不多延迟了二三十分钟(虽然官方文档中说大于5分钟的都无效,会被设置为5分钟)
修改语句
sudo sysctl -w net.ipv4.tcp_keepalive_time=120
要防止重启之后配置变回原来的还得改配置文件,这边没有权限后续就没有操作了,不过步骤写的很清楚,交给运维就行
另外根据官方文档
FAQ: MongoDB Diagnostics — MongoDB Manual
还需要修改 net.ipv4.tcp_retries2 配置,通过以下语句查询,小于等于5会比较好
sysctl net.ipv4.tcp_retries2
可以看到系统原来设置的是15,这边通过
sudo sysctl -w net.ipv4.tcp_retries2=5
进行修改,修改后可以看到值已经变成5了
要防止重启之后配置变回原来的还得改配置文件,跟上面一样是同一个文件,交给运维就行
经过修复之后,该异常已经一周并未出现,推测问题已经解决