1、packetbeat抓取网口包丢包
当packetbeat采用抓取网口码流并解析的方式时,很容易发现,当网口流量过大(消息并发量较大)通过kibana前端统计消息数量,存在丢失的问题。这个问题与网口流量速率有关,并发越大,丢包率越高,这应该是packetbeat本身的pipeline机制有关,如果pipeline缓冲区满则默认丢弃新来的event,经过实验还没能找到不丢包的方法,最多就是消息并发量小一点,丢包率会小一点。这篇文章主要是想描述packetbeat通过解析pcap文件的方式丢包问题。
2、pcaketbeat解析pcap文件丢包
当packetbeat直接抓取网口包并解析时,运行packetbeat是阻塞的,而当packetbeat指定pcap文件进行解析时,是非阻塞的,这就导致,当配置了上报event缓冲区数目且不为0的情况下,如果packetbeat解析完了整个pcap文件,但缓冲区还有未上报的event,此时packetbeat解析完就停止了,没有上报的event就会被丢弃不再上报,这就导致kibana前端统计消息数量缺失的问题。
解决办法是,调整packetbeat的配置文件,配置event上报数量最小值为0,这样配置的效果就是,当缓冲区存在未上报的event,立即上报,当packetbeat停止时,缓冲区也不存在未上报的event了。
贴一下相关配置:
# ================================== General ===================================
# The name of the shipper that publishes the network data. It can be used to group
# all the transactions sent by a single shipper in the web interface.
# If this options is not defined, the hostname is used.
#name:
# The tags of the shipper are included in their own field with each
# transaction published. Tags make it easy to group servers by different
# logical properties.
#tags: ["service-X", "web-tier"]
# Optional fields that you can specify to add additional information to the
# output. Fields can be scalar values, arrays, dictionaries, or any nested
# combination of these.
#fields:
# env: staging
# If this option is set to true, the custom fields are stored as top-level
# fields in the output document instead of being grouped under a fields
# sub-dictionary. Default is false.
#fields_under_root: false
# Internal queue configuration for buffering events to be published.
queue:
# Queue type by name (default 'mem')
# The memory queue will present all available events (up to the outputs
# bulk_max_size) to the output, the moment the output is ready to server
# another batch of events.
mem:
# Max number of events the queue can buffer.
events: 4096
# Hints the minimum number of events stored in the queue,
# before providing a batch of events to the outputs.
# The default value is set to 2048.
# A value of 0 ensures events are immediately available
# to be sent to the outputs.
flush.min_events: 0
# Maximum duration after which events are available to the outputs,
# if the number of events stored in the queue is < `flush.min_events`.
flush.timeout: 1s
# The disk queue stores incoming events on disk until the output is
# ready for them. This allows a higher event limit than the memory-only
# queue and lets pending events persist through a restart.
#disk:
# The directory path to store the queue's data.
#path: "${path.data}/diskqueue"
# The maximum space the queue should occupy on disk. Depending on
# input settings, events that exceed this limit are delayed or discarded.
#max_size: 10GB
# The maximum size of a single queue data file. Data in the queue is
# stored in smaller segments that are deleted after all their events
# have been processed.
#segment_size: 1GB
# The number of events to read from disk to memory while waiting for
# the output to request them.
#read_ahead: 512
# The number of events to accept from inputs while waiting for them
# to be written to disk. If event data arrives faster than it
# can be written to disk, this setting prevents it from overflowing
# main memory.
#write_ahead: 2048
# The duration to wait before retrying when the queue encounters a disk
# write error.
#retry_interval: 1s
# The maximum length of time to wait before retrying on a disk write
# error. If the queue encounters repeated errors, it will double the
# length of its retry interval each time, up to this maximum.
#max_retry_interval: 30s
# Sets the maximum number of CPUs that can be executing simultaneously. The
# default is the number of logical CPUs available in the system.
#max_procs:
But!!! 这样配置的话缺点也很明显:没有缓冲机制,有一个event就上报一次,比如数据存入elasticsearch,此时每个event都会导致packetbeat和es交互一次,对应cpu占用就会很高。