Kafka Producer通过参数调优提高数据发送速度

最新推荐文章于 2024-04-13 21:52:21 发布

Daphnis_z

最新推荐文章于 2024-04-13 21:52:21 发布

阅读量3.9k

点赞数 4

分类专栏：大数据文章标签：大数据 kafka

本文链接：https://blog.csdn.net/Daphnisz/article/details/116018083

版权

大数据专栏收录该内容

24 篇文章 7 订阅

订阅专栏

1. 前言

最近项目现场发生了日志文件积压的情况，日志文件的大概处理流程是：读取日志文件->进行结构化->发送到 kafka

以一个 7.57M的日志文件为例（约58400条日志信息），程序需要 16.8s才能将其处理完，也就是说平均每秒只能处理约 3500条数据，经过验证，其中瓶颈在数据发送到 Kafka这一步。这个根据经验，明显肯定是没有达到 Kafka的瓶颈的，应该是自己 producer程序有问题，于是去 Kafka官网一查：

Single producer thread, 3x asynchronous replication
786,980 records/sec
(75.1 MB/sec)

跟官网给出的性能数据一比，那就老老实实去找自己程序的问题吧。

根据经验，第一个想到的就是去进行客户端参数调优。接下来就介绍下 kafka producer的重要配置项，然后给出参数调优的方案和最终结果。

2. Kafka Producer重要配置项

2.1 batch.size

The producer will attempt to batch records together into fewer requests whenever multiple records are being sent to the same partition. This helps performance on both
 the client and the server. This configuration controls the default batch size in bytes.
No attempt will be made to batch records larger than this size.
Requests sent to brokers will contain multiple batches, one for each partition with data available to be sent.

A small batch size will make batching less common and may reduce throughput (a batch size of zero will disable batching entirely). A very large batch size may use 
memory a bit more wastefully as we will always allocate a buffer of the specified batch size in anticipation of additional records.

Type:	int
Default:	16384
Valid Values:	[0,...]
Importance:	medium

kafka 并不会傻到来一条就发送一条数据，会进行缓存，然后批量发送，这个参数就是控制每批发送的数据量上限

2.2 linger.ms

The producer groups together any records that arrive in between request transmissions into a single batched request. Normally this occurs only under load when records 
arrive faster than they can be sent out. However in some circumstances the client may want to reduce the number of requests even under moderate load. This setting 
accomplishes this by adding a small amount of artificial delay—that is, rather than immediately sending out a record the producer will wait for up to the given delay to
 allow other records to be sent so that the sends can be batched together. This can be thought of as analogous to Nagle's algorithm in TCP. This setting gives the upper 
 bound on the delay for batching: once we get batch.size worth of records for a partition it will be sent immediately regardless of this setting, however if we have fewer 
 than this many bytes accumulated for this partition we will 'linger' for the specified time waiting for more records to show up. This setting defaults to 0 (i.e. no delay).
  Setting linger.ms=5, for example, would have the effect of reducing the number of requests sent but would add up to 5ms of latency to records sent in the absence of load.

Type:	long
Default:	0
Valid Values:	[0,...]
Importance:	medium

kafka 会对需要发送的数据进行缓存，这个配置就是控制缓存数据的最长时间

2.3 acks

The number of acknowledgments the producer requires the leader to have received before considering a request complete. This controls the durability of records that 
are sent. The following settings are allowed:

acks=0 If set to zero then the producer will not wait for any acknowledgment from the server at all. The record will be immediately added to the socket buffer and
 considered sent. No guarantee can be made that the server has received the record in this case, and the retries configuration will not take effect (as the client won't 
 generally know of any failures). The offset given back for each record will always be set to -1.
acks=1 This will mean the leader will write the record to its local log but will respond without awaiting full acknowledgement from all followers. In this case should the
 leader fail immediately after acknowledging the record but before the followers have replicated it then the record will be lost.
acks=all This means the leader will wait for the full set of in-sync replicas to acknowledge the record. This guarantees that the record will not be lost as long as at least 
one in-sync replica remains alive. This is the strongest available guarantee. This is equivalent to the acks=-1 setting.
Type:	string
Default:	1
Valid Values:	[all, -1, 0, 1]
Importance:	high

这个参数控制 producer发送数据后是否等待服务端的确认消息，有点 UDP和 TCP的感觉在里面

2.4 compression.type

Specify the final compression type for a given topic. This configuration accepts the standard compression codecs ('gzip', 'snappy', 'lz4', 'zstd'). It additionally accepts 
'uncompressed' which is equivalent to no compression; and 'producer' which means retain the original compression codec set by the producer.

Type:	string
Default:	producer
Valid Values:	
Importance:	high
Update Mode:	cluster-wide

这个参数很重要！从前面已经知道了 kafka会缓存需要发送的数据，那么在批量发送数据时是不是可以进行压缩呢？答案是肯定的，就是这个参数控制使用的压缩算法。不知道为什么 kafka默认是不进行压缩的，这点略坑，具体原因后面有时间再研究。

3. 如何提高 Producer数据发送速度

先把调优方案贴出来：

batch_size=563840 ——默认值是 16384
linger_ms=30000 ——默认值是 0
acks=0 ——默认值是 1
compression_type="gzip" ——默认值是 None

讲下参数调优的思路：

适当调大 batch.size和 linger.ms

这两个参数是配合起来使用的，目的就是缓存更多的数据，减少客户端发起请求的次数。这两个参数根据实际情况调整，注意要适量
关闭数据发送确认机制

灵感来自 UDP协议，适用于对数据完整性要求不高的场景，比如日志，丢几条无所谓那种
制定数据发送时的压缩算法

这是本次调优的大招，让 kafka使用 gzip算法将需要发送的数据先压缩后发送。

4. 程序调优前后数据对比

名称	文件大小	单个文件数据量	文件总数	耗时	平均处理每个文件耗时
原程序	7.57M	58400	20	336s	16.8s
调优后	7.57M	58400	20	207s	10.3s

结果还是很喜人的，在同样的情况下，调优后的程序处理能力提高了约 40%。

5. 总结

程序经过调优，数据处理能力从 3500条/秒 到 5700条/秒，虽然跟 kafka官网给出的性能数据相差甚远（机器处理能力、Kafka集群、网络IO等都不同），但是 40%的性能提升还是很不错的。

Daphnis_z

关注

4
点赞
踩
13

收藏

觉得还不错? 一键收藏
0
评论
Kafka Producer通过参数调优提高数据发送速度

1. 前言最近项目现场发生了日志文件积压的情况，日志文件的大概处理流程是：读取日志文件->进行结构化->发送到 kafka以一个 7.57M的日志文件为例（约58400条日志信息），程序需要 16.8s才能将其处理完，也就是说平均每秒只能处理约 3500条数据，经过验证，其中瓶颈在数据发送到 Kafka这一步。这个根据经验，明显肯定是没有达到 Kafka的瓶颈的，应该是自己 producer程序有问题，于是去 Kafka官网一查：Single producer thread, 3x a
复制链接

扫一扫