kafka官网示例说明--KafkaProducer

最新推荐文章于 2024-04-22 18:58:35 发布

盛装吾步

最新推荐文章于 2024-04-22 18:58:35 发布

阅读量2.9k

点赞数 1

分类专栏： hadoop 文章标签： kafka producer api

hadoop 专栏收录该内容

19 篇文章 1 订阅

订阅专栏

The producer is thread safe and sharing a single producer instance across threads will generally be faster than having multiple instances.

producere是线程安全的。所以多线程调用的时候，使用单个producer实例即可。

Here is a simple example of using the producer to send records with strings containing sequential numbers as the key/value pairs.

下面是一个简单的producer实例：

Properties props = new Properties();
     props.put("bootstrap.servers", "localhost:9092");
     props.put("acks", "all");
     props.put("retries", 0);
     props.put("batch.size", 16384);
     props.put("linger.ms", 1);
     props.put("buffer.memory", 33554432);
     props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
     props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
    
     Producer<String, String> producer = new KafkaProducer<>(props);
     for (int i = 0; i < 100; i++)
         producer.send(new ProducerRecord<String, String>("my-topic", Integer.toString(i), Integer.toString(i)));
    
     producer.close();

The producer consists of a pool of buffer space that holds records that haven't yet been transmitted to the server as well as a background I/O thread that is responsible for turning these records into requests and transmitting them to the cluster. Failure to close the producer after use will leak these resources.

生产者维护了一个缓存池：用来缓存所有还没有发送到服务端的记录和一些后台的IO进程（用来组装REQUEST和发送这些record到集群中），producer使用后如果没有正确的关闭，会导致丢失这些资源和记录。

The send() method is asynchronous. When called it adds the record to a buffer of pending record sends and immediately returns. This allows the producer to batch together individual records for efficiency.

send方法是异步的。每当调用他去增加一条记录附加到缓存时，他会立即返回。他可以允许producer把所有个别的记录集中在一起发送，以提高性能。

The acks config controls the criteria under which requests are considered complete. The "all" setting we have specified will result in blocking on the full commit of the record, the slowest but most durable setting.

acks配置用来控制确认请求是否一定被完成。当指定“all”配置时，返回结果将会被阻塞，直到记录被完整的提交。这个配置效率很低，但是是最可靠的。

If the request fails, the producer can automatically retry, though since we have specified retries as 0 it won't. Enabling retries also opens up the possibility of duplicates (see the documentation on message delivery semantics for details).

如果请求失败，producer会自动重试，除非我们指定的retries参数为0。开启retries同样也可能导致数据重复的可能性。（因为数据被多次发送。但0.10后支持幂等的特性，可以开启保证）

The producer maintains buffers of unsent records for each partition. These buffers are of a size specified by the batch.size config. Making this larger can result in more batching, but requires more memory (since we will generally have one of these buffers for each active partition).

producer维护了每个分区未被发送记录的缓存。这些缓存大小是由batch.size配置指定的。配置这个参数足够大，可以缓存更多的记录，但同时也需要更多的内存。（因为每一个分区都有一个这样的缓存。）

By default a buffer is available to send immediately even if there is additional unused space in the buffer. However if you want to reduce the number of requests you can set linger.ms to something greater than 0. This will instruct the producer to wait up to that number of milliseconds before sending a request in hope that more records will arrive to fill up the same batch. This is analogous to Nagle's algorithm in TCP. For example, in the code snippet above, likely all 100 records would be sent in a single request since we set our linger time to 1 millisecond. However this setting would add 1 millisecond of latency to our request waiting for more records to arrive if we didn't fill up the buffer.

Note that records that arrive close together in time will generally batch together even with linger.ms=0 so under heavy load batching will occur regardless of the linger configuration; however setting this to something larger than 0 can lead to fewer, more efficient requests when not under maximal load at the cost of a small amount of latency.

默认情况下，缓存可以被立即发送，尽管缓存还有额外的未用空间。但是如果你想减少请求的次数，你可以设置linger.ms为大于0的某些值。这个参数会通知producer在发送前去等待指定时间的毫秒数。以便有足够的记录到达并填满同一个批次。

这个类似于TCP中的纳格算法。例如，以上的代码中，所有的100条记录会被发送，在同一个request里面，知道我们设置linger time为1毫秒。然而这个设置增加了1ms延迟等待更多的记录到达，如果缓存没有被填满的话。

需要注意在高负载的情况下，相近时间的记录。通常尽管设置了linger.ms=0 ，也会在同一批处理。所以在高负载批次发生时，会忽略linger配置。

不管怎样，设置大于0的值可以导致较少、更多有效率的请求，在不是较大压力加载，而是小批量加载的时候。

The buffer.memory controls the total amount of memory available to the producer for buffering. If records are sent faster than they can be transmitted to the server then this buffer space will be exhausted. When the buffer space is exhausted additional send calls will block. The threshold for time to block is determined by max.block.ms after which it throws a TimeoutException.

buffer.memory 缓存大小控制了可以被producer使用的总内存。（producer缓存是按topic分区来算的）。如果记录发送的速度快于传输到服务器的速度，那么这个缓存可能会耗尽。当缓存的空间被耗尽，额外的发送请求将会被阻塞。阻塞时间的临界值由max.block.ms决定。之后将会抛出一个TimeoutException异常。

The key.serializer and value.serializer instruct how to turn the key and value objects the user provides with their ProducerRecord into bytes. You can use the included ByteArraySerializer or StringSerializer for simple string or byte types.

key.serializer和value.serializer指导用户提供的ProducerRecord中key和value怎么转化为bytes。对于简单的字符串和byte类型，你可以使用包括ByteArraySerializer和StringSerializer。

From Kafka 0.11, the KafkaProducer supports two additional modes: the idempotent producer and the transactional producer. The idempotent producer strengthens Kafka's delivery semantics from at least once to exactly once delivery. In particular producer retries will no longer introduce duplicates. The transactional producer allows an application to send messages to multiple partitions (and topics!) atomically.

从KAFKA0.11开始，KafkaProducer支持两种额外的模式：幂等的producer和事务型producer。幂等的producer加强了kafka的传递（传递最少一次以保证至少一次较精确的传递）。比较特殊的是，producer的retries参数不再导致消息的重复（幂等性保证了有且数据只会被接受一次）。事务型的producer允许一个应用发送消息到多个partitions或者topics。

To enable idempotence, the enable.idempotence configuration must be set to true. If set, the retries config will be defaulted to Integer.MAX_VALUE, the max.inflight.requests.per.connection config will be defaulted to 1, and acks config will be defaulted to all. There are no API changes for the idempotent producer, so existing applications will not need to be modified to take advantage of this feature.

想要开启幂等性，enable.idempotence这个配置必须设置为true。如果设置了，retries将会默认设置为Integer.MAX_VALUE。max.inflight.requests.per.connection这个配置将会默认为1.而且acks的配置将会默认为ALL。幂等producer的api是咩有改变的。因此已经存在的application不需要去修改去适应这个特性。

To take advantage of the idempotent producer, it is imperative to avoid application level re-sends since these cannot be de-duplicated. As such, if an application enables idempotence, it is recommended to leave the retries config unset, as it will be defaulted to Integer.MAX_VALUE. Additionally, if a send(ProducerRecord) returns an error even with infinite retries (for instance if the message expires in the buffer before being sent), then it is recommended to shut down the producer and check the contents of the last produced message to ensure that it is not duplicated. Finally, the producer can only guarantee idempotence for messages sent within a single session.

为了发挥幂等producer的优势。需要尽量避免应用级别重发，因此这个可能会导致消息重复。同样的，如果一个应用开启了幂等性。建议不要设置retries参数，因为他默认被设置为Integer.MAX_VALUE，此外，如果一次发送（ProducerRecord）重试了无穷次还是返回一个错误（比如这个消息在发送之前，在缓存里到期了），因此建议停止producer并且检查最后一个消息的内容，以确认这个消息没有重复。最后，producer只能保证一个事务中的消息发送的幂等性。

To use the transactional producer and the attendant APIs, you must set the transactional.id configuration property. If the transactional.id is set, idempotence is automatically enabled along with the producer configs which idempotence depends on. Further, topics which are included in transactions should be configured for durability. In particular, the replication.factor should be at least 3, and the min.insync.replicas for these topics should be set to 2. Finally, in order for transactional guarantees to be realized from end-to-end, the consumers must be configured to read only committed messages as well.

为了使用事务型producer和attendant api，你必须设置transactional.id配置信息。如果transactional.id被设置。幂等性及其依赖的producer配置是自动被开启的。进一步的，包含这个事务的topic也将会配置以保证持久化。特别说明，replication.factor 必须至少设置为3，而且这些topic的min.insync.replicas参数需要设置为2。最后，为了事务型保证实现端到端。消费者必须配置只读取commit的消息。

The purpose of the transactional.id is to enable transaction recovery across multiple sessions of a single producer instance. It would typically be derived from the shard identifier in a partitioned, stateful, application. As such, it should be unique to each producer instance running within a partitioned application.

transactional.id参数的目的是为了单个producer的事务可以通过多个session来恢复。他通常从一个分区、状态和应用中的分片标识符中获得。
同时，对于每个被分区的应用的producer来说，这个参数需要是的。

All the new transactional APIs are blocking and will throw exceptions on failure. The example below illustrates how the new APIs are meant to be used. It is similar to the example above, except that all 100 messages are part of a single transaction.

所有新的交易型API都是阻塞的，并且都会跑出异常和失败。

下面的例子举例说明：新的api怎么去用。他和上面的例子很接近。除了100条信息是一个事务中的一部分。

Properties props = new Properties();
     props.put("bootstrap.servers", "localhost:9092");
     props.put("transactional.id", "my-transactional-id");
     Producer<String, String> producer = new KafkaProducer<>(props, new StringSerializer(), new StringSerializer());
    
     producer.initTransactions();
    
     try {
         producer.beginTransaction();
         for (int i = 0; i < 100; i++)
             producer.send(new ProducerRecord<>("my-topic", Integer.toString(i), Integer.toString(i)));
         producer.commitTransaction();
     } catch (ProducerFencedException | OutOfOrderSequenceException | AuthorizationException e) {
         // We can't recover from these exceptions, so our only option is to close the producer and exit.
         producer.close();
     } catch (KafkaException e) {
         // For all other exceptions, just abort the transaction and try again.
         producer.abortTransaction();
     }
     producer.close();

盛装吾步

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
kafka官网示例说明--KafkaProducer

The producer is thread safe and sharing a single producer instance across threads will generally be faster than having multiple instances.producere是线程安全的。所以多线程调用的时候，使用单个producer实例即可。
复制链接

扫一扫