kafka 学习笔记（3）

最新推荐文章于 2024-09-14 22:07:12 发布

weixin_33701251

最新推荐文章于 2024-09-14 22:07:12 发布

阅读量80

点赞数

文章标签：大数据 python 网络

原文链接：https://my.oschina.net/dongtianxi/blog/715013

版权

2019独角兽企业重金招聘Python工程师标准>>>

A Kafka client that publishes records to the Kafka cluster.

向kafka集群发布记录的 kafka 客户端。

The producer is thread safe and sharing a single producer instance across threads will generally be faster than having multiple instances.

生产者是线程安全的，而且，多线程共享同一个producer实例通常比多个producer实例更快。

Here is a simple example of using the producer to send records with strings containing sequential numbers as the key/value pairs.

这里是一个简单的例子，使用producer发送字符串数据，包含key和value。

Properties props = new Properties();
 props.put("bootstrap.servers", "localhost:9092");
 props.put("acks", "all");
 props.put("retries", 0);
 props.put("batch.size", 16384);
 props.put("linger.ms", 1);
 props.put("buffer.memory", 33554432);
 props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
 props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

 Producer<String, String> producer = new KafkaProducer<>(props);
 for(int i = 0; i < 100; i++)
     producer.send(new ProducerRecord<String, String>("my-topic", Integer.toString(i), Integer.toString(i)));

 producer.close();

The producer consists of a pool of buffer space that holds records that haven't yet been transmitted to the server as well as a background I/O thread that is responsible for turning these records into requests and transmitting them to the cluster. Failure to close the producer after use will leak these resources.

一个producer由几部分组成：1、一个buff poll，保存尚未发送的数据；2、一个后台运行的I/O线程，负责执行数据发送。producer使用完毕后，务必执行close操作，否则将会造成资源的泄漏。

The send() method is asynchronous. When called it adds the record to a buffer of pending record sends and immediately returns. This allows the producer to batch together individual records for efficiency.

send()方法是异步的。当调用它时，它将记录添加到缓冲区中，并立即返回。这使得producer能够批量的执行数据的生产。

The acks config controls the criteria under which requests are considered complete. The "all" setting we have specified will result in blocking on the full commit of the record, the slowest but most durable setting.

acks有3个可能的值，0：客户端不必等待任何的server响应；1：leader of partition将会在把数据写入自己的log之后，响应客户端，而不必等待其他的follower完成同步的操作；all：leader和follower全部完成log写入操作。服务器才会响应客户端。相比之下，all最慢但是可靠性更好。

If the request fails, the producer can automatically retry, though since we have specified retries as 0 it won't. Enabling retries also opens up the possibility of duplicates (see the documentation on message delivery semantics for details).

如果请求失败，生产者可以自动重试，但是我们已经设置retries = 0，那么重试将不会发生。如果我们开启了重试，可能会出现重复记录的问题。

The producer maintains buffers of unsent records for each partition. These buffers are of a size specified by the batch.size config. Making this larger can result in more batching, but requires more memory (since we will generally have one of these buffers for each active partition).

producer保持每个partition的未发送数据的缓冲区。这些缓冲的大小由batch.size配置指定。如果增大这个配置，可以一次性执行更大的批量操作，但需要更多的内存（因为我们通常会有一个缓冲区为每个partition）。

By default a buffer is available to send immediately even if there is additional unused space in the buffer. However if you want to reduce the number of requests you can set linger.ms to something greater than 0. This will instruct the producer to wait up to that number of milliseconds before sending a request in hope that more records will arrive to fill up the same batch. This is analogous to Nagle's algorithm in TCP. For example, in the code snippet above, likely all 100 records would be sent in a single request since we set our linger time to 1 millisecond. However this setting would add 1 millisecond of latency to our request waiting for more records to arrive if we didn't fill up the buffer. Note that records that arrive close together in time will generally batch together even with linger.ms=0 so under heavy load batching will occur regardless of the linger configuration; however setting this to something larger than 0 can lead to fewer, more efficient requests when not under maximal load at the cost of a small amount of latency.

默认情况下，缓冲区可以立即发送，即使在缓冲区中有额外的未使用的空间。但是如果你想减少请求的数量，可以设置linger.ms > 0。producer会等待一段时间（单位是毫秒）之后在进行发送，以期获得更大的批量操作。例如，在上面的代码片段，设置linger.ms = 1, 可能会有100条记录被批量发送。但是，如果在1毫秒的时间内，没有跟多的数据到达缓冲区，那么这1毫秒的等待仅仅是增加了延迟，而没有达到任何正面的效果。需要注意的是，如果在短时间内，大量的数据到达缓冲区，即使 linger.ms = 0 ，仍然会发生批量操作。

The buffer.memory controls the total amount of memory available to the producer for buffering. If records are sent faster than they can be transmitted to the server then this buffer space will be exhausted. When the buffer space is exhausted additional send calls will block. The threshold for time to block is determined by max.block.ms after which it throws a TimeoutException.

buffer.memory控制提供给producer的缓冲内存总量，如果该缓冲区的写入速率长时间大于输出速率，那么这个缓冲区将耗尽。当缓冲区耗尽后，额外的发送调用将被阻塞。阻塞一段时间之后（max.block.ms ），将会抛出一个TimeoutException。

The key.serializer and value.serializer instruct how to turn the key and value objects the user provides with their ProducerRecord into bytes. You can use the included ByteArraySerializer or StringSerializer for simple string or byte types.

key.serializer 和 value.serializer负责把record当中key和value 分别转换为byte数组，kafka提供了一组简单的序列化class。

转载于:https://my.oschina.net/dongtianxi/blog/715013