前言
作为运维每天都需要关注主站的5xx,4xx情况以及那个接口的问题,之前的做法是通过nginx 本身的health模块获取当前访问量然后分析之后写入influxdb里面,grafana用来读取 5xx出图,并且每分钟把5xx写入到openfalcon里面作为阈值报警但是这样还是不是实时的,并且出现问题好要自己去过滤nginx日志太麻烦。
第一版结构
所有的API日志通过rsyslog 打到logserver上,然后部署 一个logstash对日志目录进行分析过过滤出5xx和做GEOIP替换写入到es里面,这么做之后日常没有出现大规模的问题还好,一旦出现突发的5xx或者4xx事件,就会造成写入es的延时增大,这个时候快速报警系统从es里面读5xx和接口就会失效。
input {
file {
path => "/data/logs/nginx/*/One/*.log"
codec => json
discover_interval => "10"
close_older => "5"
sincedb_path => "/data/logs/.sincedb/.sincedb"
}
}
filter {
if [clientip] != "-" {
geoip {
source => "clientip"
target => "geoip"
database => "/usr/local/logstash-6.5.4/config/GeoIP2-City.mmdb"
add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}" ]
}
mutate {
convert => [ "[geoip][coordinates]","float"]
}
}
}
output {
if [status] in ["500","502","404","401","403","499","504"] {
elasticsearch {
hosts => ["172.16.8.166:9200"]
index => "logstash-nginx-%{+YYYY.MM.dd}"
}
stdout {
codec => rubydebug
}
}
}
第二版结构
前端改为filebeat全量去读日志,每小时日志大概25G左右,(如果用filebeat官网的性能是跟不上的,后来在网上找到了一个优化过的filebeat,作者实测25MB/s的速度),后面接一个kafka集群,创建了一个8分区的topic.然后后面挂2个logstash节点,通过下图可以看到低峰期有8.5K的QPS,峰值大概有13K左右,证明filebeat完全能够hold的住当前的数据量,那么如果再有延时的情况,只需要增加logstash节点和es节点就足够了,扩容很方便,目前2个logstash的读取能够完全hold住
filebeat的配置,之前也尝试过filebeat直接写多个logstash结果发现性能还没第一版好
filebeat.registry_file: "/srv/registry"
filebeat.spool_size: 25000
filebeat.publish_async : true
filebeat.queue_size: 10000
filebeat.prospectors:
- input_type: log
paths:
- /data/logs/nginx/*/One/*.log
scan_frequency: 1s
tail_files: true
idle_timeout: 5s
json.keys_under_root: true
json.overwrite_keys: true
harvester_buffer_size: 409600
enabled: true
output.kafka:
hosts: ["in-prod-common-uploadmonitor-1:9092"]
topic: "nginxAccessLog"
enabled: true
logstash配置
input {
kafka {
bootstrap_servers => "in-prod-common-uploadmonitor-1:9092"
client_id => "logstash_nginx_log_group1"
group_id => "logstash_nginx_log_group"
consumer_threads => 2
decorate_events => true
topics => ["nginxAccessLog"]
codec => "json"
}
}
filter {
if [clientip] != "-" {
geoip {
source => "clientip"
target => "geoip"
database => "/usr/local/logstash/config/GeoIP2-City.mmdb"
add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}" ]
}
mutate {
convert => [ "[geoip][coordinates]","float"]
}
}
}
output {
if [status] in ["500","502","404","401","403","499"] {
elasticsearch {
hosts => ["172.16.8.166:9200"]
index => "logstash-nginx-%{+YYYY.MM.dd}"
}
stdout {
codec => rubydebug
}
}
}
kibana实时5xx,4xx地区分布和主机接口图
通过报警工具实时去拉最新的5xx进行接口预报,可以做到秒级别的报警,这样运维的工作就大大的降低了