业务中通过rsyslog监控本地文件收集了一些redis和mc的慢日志,推到elk集群分析,这些日志一天一个文件,每晚零点5分通过计划任务用软连接的方式将新的文件固定到指定文件下,但是最近发现日志丢了很多,分析中发现了一个深坑,先说下现有的配置:
rsyslog的配置如下,监控固定的文件:
local6.* @r-log.cms.sina.cn:1514
main_queue(
queue.workerthreads="10" # threads to work on the queue
queue.dequeueBatchSize="1000" # max number of messages to process at once
queue.size="50000" # max queue size
)
module(load="imfile")
module(load="omkafka")
$PreserveFQDN on
############################ php curl log #####################################
input(type="imfile"
File="/data1/ms/log/php_common/php_slow_log"
Tag="tag1"
Facility="local5"
freshStartTail="on"
deleteStateOnFileDelete="off"
reopenOnTruncate="on" ##日志切割Rsyslog不去读取文件,使用此选项
)
local5.* @r-log.cms.sina.cn:1515
########################## redis log ########################################
$template redislog7215,"%$myhostname%`redis7215`%msg%"
ruleset(name="redislog7215") {
action(
broker=["10.13.88.190:9092","10.13.88.191:9092","10.13.88.192:9092","10.13.88.193:9092"]
type="omkafka"
topic="redis-log"
template="redislog7215"
partitions.auto="on"
)
}
input(type="imfile"
File="/data1/ms/log/front/redis7215.log"
Tag=""
ruleset="redislog7215"
freshStartTail="on"
reopenOnTruncate="on"
)
$template redislog7216,"%$myhostname%`redis7216`%msg%"
ruleset(name="redislog7216") {
action(
broker=["10.13.88.190:9092","10.13.88.191:9092","10.13.88.192:9092","10.13.88.193:9092"]
type="omkafka"
topic="redis-log"
template="redislog7216"
partitions.auto="on"
)
}
input(type="imfile"
File="/data1/ms/log/front/redis7216.log"
Tag=""
ruleset="redislog7216"
freshStartTail="on"
reopenOnTruncate="on"
)
$template redislog7242,"%$myhostname%`redis7242`%msg%"
ruleset(name="redislog7242") {
action(
broker=["10.13.88.190:9092","10.13.88.191:9092","10.13.88.192:9092","10.13.88.193:9092"]
type="omkafka"
topic="redis-log"
template="redislog7242"
partitions.auto="on"
)
}
input(type="imfile"
File="/data1/ms/log/front/redis7242.log"
Tag=""
ruleset="redislog7242"
freshStartTail="on"
reopenOnTruncate="on"
)
$template redislog7243,"%$myhostname%`redis7243`%msg%"
ruleset(name="redislog7243") {
action(
broker=["10.13.88.190:9092","10.13.88.191:9092","10.13.88.192:9092","10.13.88.193:9092"]
type="omkafka"
topic="redis-log"
template="redislog7243"
partitions.auto="on"
)
}
input(type="imfile"
File="/data1/ms/log/front/redis7243.log"
Tag=""
ruleset="redislog7243"
freshStartTail="on"
reopenOnTruncate="on"
)
#################### mc .log ###############################
$template mc40016,"%$myhostname%`mc40016`%msg%"
ruleset(name="mc40016") {
action(
broker=["10.13.88.190:9092","10.13.88.191:9092","10.13.88.192:9092","10.13.88.193:9092"]
type="omkafka"
topic="cms-front-mc"
template="mc40016"
partitions.auto="on"
)
}
input(type="imfile"
File="/data1/ms/log/front/memcached_get_err_40016.log"
Tag=""
ruleset="mc40016"
freshStartTail="on"
reopenOnTruncate="on"
)
$template mc40023,"%$myhostname%`mc40023`%msg%"
ruleset(name="mc40023") {
action(
broker=["10.13.88.190:9092","10.13.88.191:9092","10.13.88.192:9092","10.13.88.193:9092"]
type="omkafka"
topic="cms-front-mc"
template="mc40023"
partitions.auto="on"
)
}
input(type="imfile"
File="/data1/ms/log/front/memcached_get_err_40023.log"
Tag=""
ruleset="mc40023"
freshStartTail="on"
reopenOnTruncate="on"
)
每晚凌晨5分进行计划任务切换,计划任务与脚本如下:5 0 * * * root sh /usr/local/script/elkslowlog.sh &> /dev/null
#!/bin/bash
DATE=`date +%F`
DATE2=`date +%Y%m%d`
ln -sf /data1/ms/log/php_common/curl-$DATE /data1/ms/log/php_common/php_slow_log
ln -sf /data1/ms/log/front/redis7215.$DATE2.log /data1/ms/log/front/redis7215.log
ln -sf /data1/ms/log/front/redis7216.$DATE2.log /data1/ms/log/front/redis7216.log
ln -sf /data1/ms/log/front/redis7242.$DATE2.log /data1/ms/log/front/redis7242.log
ln -sf /data1/ms/log/front/redis7243.$DATE2.log /data1/ms/log/front/redis7243.log
ln -sf /data1/ms/log/front/memcached_get_err_40023.$DATE2.log /data1/ms/log/front/memcached_get_err_40023.log
ln -sf /data1/ms/log/front/memcached_get_err_40016.$DATE2.log /data1/ms/log/front/memcached_get_err_40016.log
问题分析:看着似乎没什么问题,但分析后发现,同样的收集方法,curl的php_slow_log就没有问题,丢日志的mc和redis有个共同特点,就是量不大,有时候半个小时可能就有1条,继续考虑其中的不同,于是想到,可能就是在零点5分创建超连接的瞬间,当天的日志文件还未生成,也就是建立了一个空连接,后面当日志文件真的生成后,rsyslog未检测到超连接对应文件的变化,所以一天的日志都没有,于是测试,果然如此,测试过程不说了。
解决方案:修改脚本文件,在创建超连接时,检测当天日志文件是否生成,如未生成,用echo写入一个空行(空行不会影响收集)生成文件,再进行超连接转换,脚本修改如下#!/bin/bash
DATE=`date +%F`
DATE2=`date +%Y%m%d`
echo >> /data1/ms/log/php_common/curl-$DATE && ln -sf /data1/ms/log/php_common/curl-$DATE /data1/ms/log/php_common/php_slow_log
echo >> /data1/ms/log/front/redis7215.$DATE2.log && ln -sf /data1/ms/log/front/redis7215.$DATE2.log /data1/ms/log/front/redis7215.log
echo >> /data1/ms/log/front/redis7216.$DATE2.log && ln -sf /data1/ms/log/front/redis7216.$DATE2.log /data1/ms/log/front/redis7216.log
echo >> /data1/ms/log/front/redis7242.$DATE2.log && ln -sf /data1/ms/log/front/redis7242.$DATE2.log /data1/ms/log/front/redis7242.log
echo >> /data1/ms/log/front/redis7243.$DATE2.log && ln -sf /data1/ms/log/front/redis7243.$DATE2.log /data1/ms/log/front/redis7243.log
echo >> /data1/ms/log/front/memcached_get_err_40023.$DATE2.log && ln -sf /data1/ms/log/front/memcached_get_err_40023.$DATE2.log /data1/ms/log/front/memcached_get_err_40023.log
echo >> /data1/ms/log/front/memcached_get_err_40016.$DATE2.log && ln -sf /data1/ms/log/front/memcached_get_err_40016.$DATE2.log /data1/ms/log/front/memcached_get_err_40016.log
好了,问题圆满解决。
最后于 2019-7-12
被矢量比特编辑
,原因: