kafka个人学习总结2020-12-28

最新推荐文章于 2023-11-29 21:41:09 发布

停下看着

最新推荐文章于 2023-11-29 21:41:09 发布

阅读量246

点赞数

分类专栏：大数据技术文章标签： kafka 大数据

本文链接：https://blog.csdn.net/qq_42197982/article/details/111866000

版权

大数据技术专栏收录该内容

1 篇文章 0 订阅

订阅专栏

一、product

核心参数

kafka0.11官方文档http://kafka.apache.org/0110/documentation.html#producerconfigs

1.常用参数

 	    props.put("bootstrap.servers", "192.168.3.106:9092");
        /**
         * acks = 0: 表示produce请求立即返回，不需要等待leader的任何确认。
         *          这种方案有最高的吞吐率，但是不保证消息是否真的发送成功。
         *
         * acks = 1: 表示leader副本必须应答此produce请求并写入消息到本地日志，之后produce请求被认为成功. 如果leader挂掉有数据丢失的风险

         * acks = -1或者all: 表示分区leader必须等待消息被成功写入到所有的ISR副本(同步副本)中才认为produce请求成功。
         *   这种方案提供最高的消息持久性保证，但是理论上吞吐率也是最差的。
         */
        props.put("acks","all");      
        props.put("min.insync.replicas", 2);     //配合acks=all使用 此配置指定了成功写入的副本应答的最小数    默认值为1              
        //重试的次数
        props.put("retries", 3);
        //配合retries使用，为1保证提交的数据顺序性
        props.put("max.in.flight.requests.per.connection",1);
        //批处理数据的大小，每次写入多少数据到topic
        props.put("batch.size", 16384);
        //可以延长多久发送数据
        props.put("linger.ms", 1);
        props.put("max.request.size", 1048576); //单条消息默认最大长度1M
        //缓冲区的大小
        props.put("buffer.memory", 33554432);
        //指定key和value的序列化器
        props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        //添加自定义分区函数
        //props.put("partitioner.class","com.kaikeba.partitioner.MyPartitioner");

回调函数样例代码

public class KafkaProduct {

    public static void main(String[] args) {
        Connection conn = null;
        PreparedStatement prst = null;
        ResultSet rs = null;

        Properties prop = new Properties();
        prop.put("bootstrap.servers", "docker01:9092,docker02:9092,docker03:9092");
        prop.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        prop.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        prop.put("acks", "all");
        //prop.put("min.insync.replicas", 1);     //配合acks=all使用 此配置指定了成功写入的副本应答的最小数
        prop.put("retries", 3);
        //prop.put("max.in.flight.requests.per.connection", 1) ;                //重试机制 值为1 保证批次顺序
        prop.put("batch.size", 32768);  //批次发送大小限制 32KB
        prop.put("linger.ms", 5);       //批次发送时间限制
        prop.put("max.request.size", 1048576); //单条消息默认最大长度1M
        prop.put("buffer.memory", 33554432); //缓冲区大小 默认32M

        Producer<String, String> producer = new KafkaProducer<>(prop);

        try {
            //创建MySQL链接
            conn = JdbcUtils.getMysqlConn("jdbc:mysql://localhost:3306/inceptor?useSSL=false", "root", "root");
            prst = conn.prepareStatement("SELECT * FROM t01_std_dwd_jbxx_djrkxx_info");
            rs = prst.executeQuery();
            long start_time = System.currentTimeMillis();
            while (rs.next()){
                //记录行数据转换成josn格式
                String register = JSONObject.toJSONString(new Register(rs));
                //回调函数使用
                producer.send(new ProducerRecord<>("register1", register), new Callback() {
                    @Override
                    public void onCompletion(RecordMetadata recordMetadata, Exception e) {
                        if ( e == null){
                            log.info("{}", "数据发送成功");
                        } else {
                            log.error("数据发送失败{}", register);
                        }
                    }
                });
            }

            long end_time = System.currentTimeMillis();

            log.info(end_time - start_time + "");

        } catch (SQLException e) {
            e.printStackTrace();
        } finally {
            try {
                conn.close();
                prst.close();
                rs.close();
            } catch (SQLException e) {
                e.printStackTrace();
            }

            producer.close();

        }


    }
}

幂等与事务（了解）

二、 broker

具体参数可查官网文档

名词ISR HW LEO EPOCH

ISR

	光是依靠多副本机制能保证Kafka的高可用性，但是能保证数据不丢失吗？
	不行，因为如果leader宕机，但是leader的数据还没同步到follower上去，此时即使选举了follower作为新的leader，当时刚才的数据已经丢失了。

	ISR是：in-sync replica，就是跟leader partition保持同步的follower partition的数量，只有处于ISR列表中的follower才可以在leader宕机之后被选举为新的leader，因为在这个ISR列表里代表他的数据跟leader是同步的。

	如果要保证写入kafka的数据不丢失，首先需要保证ISR中至少有一个follower，其次就是在一条数据写入了leader partition之后，要求必须复制给ISR中所有的follower partition，才能说代表这条数据已提交，绝对不会丢失，这是Kafka给出的承诺

HW&LEO原理

LEO

last end offset，日志末端偏移量，标识当前日志文件中下一条待写入的消息的offset。举一个例子，若LEO=10，那么表示在该副本日志上已经保存了10条消息，位移范围是[0，9]。

Highwatermark，俗称高水位，它标识了一个特定的消息偏移量（offset），消费者只能拉取到这个offset之前的消息。任何一个副本对象的HW值一定不大于其LEO值。
	小于或等于HW值的所有消息被认为是“已提交的”或“已备份的”。HW它的作用主要是用来判断副本的备份进度.
	
	下图表示一个日志文件，这个日志文件中只有9条消息，第一条消息的offset（LogStartOffset）为0，最有一条消息的offset为8，offset为9的消息使用虚线表示的，代表下一条待写入的消息。日志文件的 HW 为6，表示消费者只能拉取offset在 0 到 5 之间的消息，offset为6的消息对消费者而言是不可见的。

leader持有的HW即为分区的HW,同时leader所在broker还保存了所有follower副本的leo
  
  （1）关系：leader的leo >= follower的leo >= leader保存的follower的leo >= leader的hw >= follower的hw
  （2）原理：上面关系反应出各个值的更新逻辑的先后

保持数据一致性原理（丢数据）

列举两种情况，可查看其他博客

leader挂了，选一个follower作为新的leader，未同步最新数据，以新的leader的LEO去同步其他follower，超过新的leader的LEO做截断
ledaer挂了，当新的leader同步LEO之前，此时新进数据，导致leo一样，数据不一致，可以了解下epoch

kafka为什么快

顺序写
零拷贝
文件名称索引与稀疏索引
利用page cache 减少使用jvm，避免gc

文件存储机制

三、consumer

具体参数可查官网文档

核心参数

1.常用参数

        props.put("bootstrap.servers", "192.168.3.106:9092");
        //props.put("client.id", "consumer-root1-2");
        props.put("group.id", "root6");
        //props.put("auto.offset.reset", "earliest");   //earliest如果没有历史偏移量则从头开始消费   latest没有历史偏移量则从最新一条开始消费
        props.put("enable.auto.commit", "true");        //后台自动提交
        props.put("auto.commit.interval.ms", "1000");   //提交频率
        props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");

消费者高级API

1.自动提交代码

//todo:需求：开发kafka消费者代码（自动提交偏移量）
public class KafkaConsumerStudy {
    public static void main(String[] args) {
        //准备配置属性
        Properties props = new Properties();
        //kafka集群地址
        props.put("bootstrap.servers", "node01:9092,node02:9092,node03:9092");
        //消费者组id
        props.put("group.id", "consumer-test");
        //自动提交偏移量
        props.put("enable.auto.commit", "true");
        //自动提交偏移量的时间间隔
        props.put("auto.commit.interval.ms", "1000");
        //默认是latest
        //earliest: 当各分区下有已提交的offset时，从提交的offset开始消费；无提交的offset时，从头开始消费
        //latest: 当各分区下有已提交的offset时，从提交的offset开始消费；无提交的offset时，消费新产生的该分区下的数据
        //none : topic各分区都存在已提交的offset时，从offset后开始消费；只要有一个分区不存在已提交的offset，则抛出异常
        props.put("auto.offset.reset","earliest");
        props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);
        //指定消费哪些topic
        consumer.subscribe(Arrays.asList("test"));
        while (true) {
            //不断的拉取数据
            ConsumerRecords<String, String> records = consumer.poll(100);
            for (ConsumerRecord<String, String> record : records) {
                //该消息所在的分区号
                int partition = record.partition();
                //该消息对应的key
                String key = record.key();
                //该消息对应的偏移量
                long offset = record.offset();
                //该消息内容本身
                String value = record.value();
                System.out.println("partition:"+partition+"\t key:"+key+"\toffset:"+offset+"\tvalue:"+value);
            }
        }
    }
}

2.手动提交代码

//todo:需求：开发kafka消费者代码（手动提交偏移量）
public class KafkaConsumerControllerOffset {
    public static void main(String[] args) {
        Properties props = new Properties();
        props.put("bootstrap.servers", "node01:9092,node02:9092,node03:9092");
        props.put("group.id", "controllerOffset");
        //关闭自动提交，改为手动提交偏移量
        props.put("enable.auto.commit", "false");
        props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);
        //指定消费者要消费的topic
        consumer.subscribe(Arrays.asList("test"));

        //定义一个数字，表示消息达到多少后手动提交偏移量
        final int minBatchSize = 20;

        //定义一个数组，缓冲一批数据
        List<ConsumerRecord<String, String>> buffer = new ArrayList<ConsumerRecord<String, String>>();
        while (true) {
            ConsumerRecords<String, String> records = consumer.poll(100);
            for (ConsumerRecord<String, String> record : records) {
                buffer.add(record);
            }
            if (buffer.size() >= minBatchSize) {
                //insertIntoDb(buffer);  拿到数据之后，进行消费
                System.out.println("缓冲区的数据条数："+buffer.size());
                System.out.println("我已经处理完这一批数据了...");
                //同步提交 还有异步提交，可以了解下
                consumer.commitSync();
                buffer.clear();
            }
        }
    }
}

消费者低级API（官方代码）

package com.bigdata.kafkademo.consumer;

import java.nio.ByteBuffer;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import kafka.api.FetchRequest;
import kafka.api.FetchRequestBuilder;
import kafka.api.PartitionOffsetRequestInfo;
import kafka.cluster.BrokerEndPoint;
import kafka.common.ErrorMapping;
import kafka.common.TopicAndPartition;
import kafka.javaapi.FetchResponse;
import kafka.javaapi.OffsetResponse;
import kafka.javaapi.PartitionMetadata;
import kafka.javaapi.TopicMetadata;
import kafka.javaapi.TopicMetadataRequest;
import kafka.javaapi.consumer.SimpleConsumer;
import kafka.message.MessageAndOffset;

public class SimpleExample {
    private List<String> m_replicaBrokers = new ArrayList<>();

    public SimpleExample() {
        m_replicaBrokers = new ArrayList<>();
    }

    public static void main(String args[]) {
        SimpleExample example = new SimpleExample();
        // 最大读取消息数量
        long maxReads = Long.parseLong("3");
        // 要订阅的topic
        String topic = "test1";
        // 要查找的分区
        int partition = Integer.parseInt("0");
        // broker节点的ip
        List<String> seeds = new ArrayList<>();
        seeds.add("192.168.9.102");
        seeds.add("192.168.9.103");
        seeds.add("192.168.9.104");
        // 端口
        int port = Integer.parseInt("9092");
        try {
            example.run(maxReads, topic, partition, seeds, port);
        } catch (Exception e) {
            System.out.println("Oops:" + e);
            e.printStackTrace();
        }
    }

    public void run(long a_maxReads, String a_topic, int a_partition, List<String> a_seedBrokers, int a_port) throws Exception {
        // 获取指定Topic partition的元数据
        PartitionMetadata metadata = findLeader(a_seedBrokers, a_port, a_topic, a_partition);
        if (metadata == null) {
            System.out.println("Can't find metadata for Topic and Partition. Exiting");
            return;
        }
        if (metadata.leader() == null) {
            System.out.println("Can't find Leader for Topic and Partition. Exiting");
            return;
        }
        String leadBroker = metadata.leader().host();
        String clientName = "Client_" + a_topic + "_" + a_partition;

        SimpleConsumer consumer = new SimpleConsumer(leadBroker, a_port, 100000, 64 * 1024, clientName);
        long readOffset = getLastOffset(consumer, a_topic, a_partition, kafka.api.OffsetRequest.EarliestTime(), clientName);
        int numErrors = 0;
        while (a_maxReads > 0) {
            if (consumer == null) {
                consumer = new SimpleConsumer(leadBroker, a_port, 100000, 64 * 1024, clientName);
            }
            FetchRequest req = new FetchRequestBuilder().clientId(clientName).addFetch(a_topic, a_partition, readOffset, 100000).build();
            FetchResponse fetchResponse = consumer.fetch(req);

            if (fetchResponse.hasError()) {
                numErrors++;
                // Something went wrong!
                short code = fetchResponse.errorCode(a_topic, a_partition);
                System.out.println("Error fetching data from the Broker:" + leadBroker + " Reason: " + code);
                if (numErrors > 5)
                    break;
                if (code == ErrorMapping.OffsetOutOfRangeCode()) {
                    // We asked for an invalid offset. For simple case ask for
                    // the last element to reset
                    readOffset = getLastOffset(consumer, a_topic, a_partition, kafka.api.OffsetRequest.LatestTime(), clientName);
                    continue;
                }
                consumer.close();
                consumer = null;
                leadBroker = findNewLeader(leadBroker, a_topic, a_partition, a_port);
                continue;
            }
            numErrors = 0;

            long numRead = 0;
            for (MessageAndOffset messageAndOffset : fetchResponse.messageSet(a_topic, a_partition)) {
                long currentOffset = messageAndOffset.offset();
                if (currentOffset < readOffset) {
                    System.out.println("Found an old offset: " + currentOffset + " Expecting: " + readOffset);
                    continue;
                }
                readOffset = messageAndOffset.nextOffset();
                ByteBuffer payload = messageAndOffset.message().payload();

                byte[] bytes = new byte[payload.limit()];
                payload.get(bytes);
                System.out.println(String.valueOf(messageAndOffset.offset()) + ": " + new String(bytes, "UTF-8"));
                numRead++;
                a_maxReads--;
            }

            if (numRead == 0) {
                try {
                    Thread.sleep(1000);
                } catch (InterruptedException ie) {
                }
            }
        }
        if (consumer != null)
            consumer.close();
    }

    public static long getLastOffset(SimpleConsumer consumer, String topic, int partition, long whichTime, String clientName) {
        TopicAndPartition topicAndPartition = new TopicAndPartition(topic, partition);
        Map<TopicAndPartition, PartitionOffsetRequestInfo> requestInfo = new HashMap<TopicAndPartition, PartitionOffsetRequestInfo>();
        requestInfo.put(topicAndPartition, new PartitionOffsetRequestInfo(whichTime, 1));
        kafka.javaapi.OffsetRequest request = new kafka.javaapi.OffsetRequest(requestInfo, kafka.api.OffsetRequest.CurrentVersion(), clientName);
        OffsetResponse response = consumer.getOffsetsBefore(request);

        if (response.hasError()) {
            System.out.println("Error fetching data Offset Data the Broker. Reason: " + response.errorCode(topic, partition));
            return 0;
        }
        long[] offsets = response.offsets(topic, partition);
        return offsets[0];
    }


    private String findNewLeader(String a_oldLeader, String a_topic, int a_partition, int a_port) throws Exception {
        for (int i = 0; i < 3; i++) {
            boolean goToSleep = false;
            PartitionMetadata metadata = findLeader(m_replicaBrokers, a_port, a_topic, a_partition);
            if (metadata == null) {
                goToSleep = true;
            } else if (metadata.leader() == null) {
                goToSleep = true;
            } else if (a_oldLeader.equalsIgnoreCase(metadata.leader().host()) && i == 0) {
                // first time through if the leader hasn't changed give
                // ZooKeeper a second to recover
                // second time, assume the broker did recover before failover,
                // or it was a non-Broker issue
                //
                goToSleep = true;
            } else {
                return metadata.leader().host();
            }
            if (goToSleep) {
                Thread.sleep(1000);
            }
        }
        System.out.println("Unable to find new leader after Broker failure. Exiting");
        throw new Exception("Unable to find new leader after Broker failure. Exiting");
    }

    private PartitionMetadata findLeader(List<String> a_seedBrokers, int a_port, String a_topic, int a_partition) {
        PartitionMetadata returnMetaData = null;
        loop:
        for (String seed : a_seedBrokers) {
            SimpleConsumer consumer = null;
            try {
                consumer = new SimpleConsumer(seed, a_port, 100000, 64 * 1024, "leaderLookup");
                List<String> topics = Collections.singletonList(a_topic);
                TopicMetadataRequest req = new TopicMetadataRequest(topics);
                kafka.javaapi.TopicMetadataResponse resp = consumer.send(req);

                List<TopicMetadata> metaData = resp.topicsMetadata();
                for (TopicMetadata item : metaData) {
                    for (PartitionMetadata part : item.partitionsMetadata()) {
                        if (part.partitionId() == a_partition) {
                            returnMetaData = part;
                            break loop;
                        }
                    }
                }
            } catch (Exception e) {
                System.out.println("Error communicating with Broker [" + seed + "] to find Leader for [" + a_topic + ", " + a_partition + "] Reason: " + e);
            } finally {
                if (consumer != null)
                    consumer.close();
            }
        }
        if (returnMetaData != null) {
            m_replicaBrokers.clear();
            for (BrokerEndPoint replica : returnMetaData.replicas()) {
                m_replicaBrokers.add(replica.host());
            }
        }
        return returnMetaData;
    }
}

消费者组rebalance

1.三种rebalance策略
比如我们消费的一个topic主题有12个分区：p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11
假设我们的消费者组里面有三个消费者。

range范围策略 RangeAssignor（默认策略）

range策略就是按照partiton的序号范围
	p0~3             consumer1
	p4~7             consumer2
	p8~11            consumer3
默认就是这个策略

round-robin轮训策略 RoundRobinAssignor

consumer1:	0,3,6,9
consumer2:	1,4,7,10
consumer3:	2,5,8,11

但是前面的这两个方案有个问题：
	假设consuemr1挂了:p0-5分配给consumer2,p6-11分配给consumer3
	这样的话，原本在consumer2上的的p6,p7分区就被分配到了 consumer3上

sticky黏性策略 StickyAssignor

最新的一个sticky策略，就是说尽可能保证在rebalance的时候，让原本属于这个consumer
的分区还是属于他们，然后把多余的分区再均匀分配过去，这样尽可能维持原来的分区分配的策略

consumer1： 0-3
consumer2:  4-7
consumer3:  8-11 

假设consumer3挂了
consumer1：0-3，+8,9
consumer2: 4-7，+10,11

消费者分配策略

//设置消费者分配策略：
properties.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, 
StickyAssignor.class.getName());

多线程消费（其中一种）代码

package cn.edu360.kafka;

import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Properties;

import kafka.consumer.Consumer;
import kafka.consumer.ConsumerConfig;
import kafka.consumer.KafkaStream;
import kafka.javaapi.consumer.ConsumerConnector;
import kafka.message.MessageAndMetadata;

public class ConsumerDemo {
	private static final String topic = "xiaoniu";
	private static final Integer threads = 2;

	public static void main(String[] args) {
		
		Properties props = new Properties();
		props.put("zookeeper.connect", "node-1:2181,node-2:2181,node-3:2181");
		props.put("group.id", "vvvvv");
		//smallest重最开始消费,largest代表重消费者启动后产生的数据才消费
		//--from-beginning
		props.put("auto.offset.reset", "smallest");

		ConsumerConfig config = new ConsumerConfig(props);
		ConsumerConnector consumer =Consumer.createJavaConsumerConnector(config);
		Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
		topicCountMap.put(topic, threads);
		Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);

		List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topic);
		
		for(final KafkaStream<byte[], byte[]> kafkaStream : streams){
			new Thread(new Runnable() {
				public void run() {
					for(MessageAndMetadata<byte[], byte[]> mm : kafkaStream){
						String msg = new String(mm.message());
						System.out.println(msg);
					}
				}
			}).start();
		}
	}
}

四、压测和性能调优

五、监控软件kafka-eagle

下载Kafka Eagle安装包
http://download.smartloli.org/
kafka-eagle-bin-1.2.3.tar.gz
解压

- tar -zxvf kafka-eagle-bin-1.2.3.tar.gz -C /kkb/install
- 解压之后进入到kafka-eagle-bin-1.2.3目录中
  - 得到kafka-eagle-web-1.2.3-bin.tar.gz
  - 然后解压  tar -zxvf kafka-eagle-web-1.2.3-bin.tar.gz
  - 重命名  mv kafka-eagle-web-1.2.3  kafka-eagle-web

修改配置文件
进入到conf目录修改system-config.properties

# 填上你的kafka集群信息
kafka.eagle.zk.cluster.alias=cluster1
cluster1.zk.list=node01:2181,node02:2181,node03:2181

# kafka eagle页面访问端口
kafka.eagle.webui.port=8048

# kafka sasl authenticate
kafka.eagle.sasl.enable=false
kafka.eagle.sasl.protocol=SASL_PLAINTEXT
kafka.eagle.sasl.mechanism=PLAIN
kafka.eagle.sasl.client=/kkb/install/kafka-eagle-bin-1.2.3/kafka-eagle-web/conf/kafka_client_jaas.conf

#  添加刚刚导入的ke数据库配置，我这里使用的是mysql
kafka.eagle.driver=com.mysql.jdbc.Driver
kafka.eagle.url=jdbc:mysql://node03:3306/ke?useUnicode=true&characterEncoding=UTF-8&zeroDateTimeBehavior=convertToNull
kafka.eagle.username=root
kafka.eagle.password=123456

配置环境变量

vi /etc/profile
export KE_HOME=/kkb/install/kafka-eagle-bin-1.2.3/kafka-eagle-web
export PATH=$PATH:$KE_HOME/bin

启动kafka-eagle

 sh ke.sh start

启动成功后在浏览器中输入http://node01:8048/ke就可以访问kafka eagle 了

 用户名：admin
 password：123456

六、kafka streaming

七、kafka对接spark streaming 和 flink

停下看着

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

kafka个人学习总结2020-12-28

kafka0.11学习总结

一、product

核心参数

回调函数样例代码

幂等与事务（了解）

二、 broker

名词ISR HW LEO EPOCH

ISR

HW&LEO原理

保持数据一致性原理（丢数据）

kafka为什么快

文件存储机制

三、consumer

核心参数

消费者高级API

消费者低级API（官方代码）

消费者组rebalance

多线程消费（其中一种）代码

四、压测和性能调优

五、监控软件kafka-eagle

六、kafka streaming

七、kafka对接spark streaming 和 flink