Flume缺点:
1.数据都保存在内存中,数据容易丢失.
2.增加消费者不容易.
3.数据无法长时间保留. eg:微信的聊天记录保存功能 保存策略
缺点
数据无法长期保存,客户如果想延迟获取做不到
增加消费者,OCP(开发原则)[Open Closed Principle 开闭原则,是Java世界里最基础的设计原则,简单来说就是对添加开放,对修改关闭]
重复消费
-----------内存可见性
数据可见性
消费记录(历史记录)
Kafka
消息队列
@Subcribe 订阅
eventBus.post(data)
PUB/SUB
- 点对点模式(一对一,消费者主动拉取数据,消息收到后消息清除)
点对点模型通常是一个基于拉取或者轮询的消息传送模型,这种模型从队列中请求信息,而不是将消息推送给客户端.这个模型的特点是发送到队列的消息被一个且只有一个接收者接收处理,即使有多个消息监听者也是如此. - 发布/订阅模式(一对多)
发布订阅模型则是一个消息传送模型.发布订阅模型可以有多种不同的订阅者,临时订阅者只在主动监听主题时才接收消息,而持久订阅者则监听主题的所有消息,即使当前订阅者不可用,处于离线状态.
kafka高吞吐量
kafka顺写日志
kafka零复制
kafka 分段日志
kafka预读(Read ahead),后写(Write Behind)
kafka分区文件夹命名规则:Topic + “_” + 分区号
kafka生成数据时的应答机制(ACK),数据应答机制
- 取值为0:生产者发送完数据,不关心数据是否到达kafka,然后直接发送下一条.这样效率非常高,但是数据丢失的可能性非常大.
- 取值为1:生产者发送数据,需要等待Leader的应答,如果应答完成,才能发送下一条数据.不关心follwer是否接受成功,这种场合,性能会慢一些,但是数据比较安全,但是在leader保存数据成功后,突然down掉,follower没来的及获取数据,那么数据就会丢失. kafka默认为1,可以修改
- 取值为-1(all):生产者发送数据,需要等待所有副本[Leader + follower]的应答.这种方式数据最安全,但是性能非常差.
kafka命令行操作
-
查看当前服务器中的所有topic
[pp@hadoop101 kafka]$ bin/kafka-topics.sh --zookeeper hadoop101:2181 --list __consumer_offsets calllog first
-
创建topic
[pp@hadoop101 kafka]$ bin/kafka-topics.sh --zookeeper hadoop101:2181 --create --replication-factor 3 --partitions 1 --topic testpp Created topic "testpp".
-
删除topic
[pp@hadoop101 kafka]$ bin/kafka-topics.sh --zookeeper hadoop101:2181 --delete --topic testpp Topic testpp is marked for deletion. Note: This will have no impact if delete.topic.enable is not set to true.
需要server.properties中设置delete.topic.enable=true 否则只是标记删除或者直接重启.
-
发送消息
[pp@hadoop101 kafka]$ bin/kafka-console-producer.sh --broker-list hadoop101:9092 --topic test >hello wortld
-
消费消息
[pp@hadoop101 kafka]$ bin/kafka-console-consumer.sh --zookeeper hadoop101:2181 --from-beginning --topic test Using the ConsoleConsumer with old consumer is deprecated and will be removed in a future major release. Consider using the new consumer by passing [bootstrap-server] instead of [zookeeper]. hello wortld pp
–from-beginning:会把主题中以往所有的数据都读取出来.
[pp@hadoop101 kafka]$ bin/kafka-console-consumer.sh --bootstrap-server hadoop101:9092 --from-beginning --topic test hello wortld pp pp
-
查看某个Topic的详情
[atguigu@hadoop101 kafka]$ bin/kafka-topics.sh --zookeeper hadoop101:2181 --describe --topic test Topic:test PartitionCount:1 ReplicationFactor:3 Configs: Topic: test Partition: 0 Leader: 0 Replicas: 0,1,2 Isr: 0,1,
ISR:正在同步的副本(很重要)
双端队列(DeQue)
队列(Quque)
数据的格式_源码:Message.scala
HW:High WaterMark(水位的概念)
LEO:Log End Offset
木桶理论
/** * A message. The format of an N byte message is the following: * * 1. 4 byte CRC32 of the message * 2. 1 byte "magic" identifier to allow format changes, value is 0 or 1 * 3. 1 byte "attributes" identifier to allow annotations on the message independent of the version * bit 0 ~ 2 : Compression codec. * 0 : no compression * 1 : gzip * 2 : snappy * 3 : lz4 * bit 3 : Timestamp type * 0 : create time * 1 : log append time * bit 4 ~ 7 : reserved * 4. (Optional) 8 byte timestamp only if "magic" identifier is greater than 0 * 5. 4 byte key length, containing length K * 6. K byte key * 7. 4 byte payload length, containing length V * 8. V byte payload * * Default constructor wraps an existing ByteBuffer with the Message object with no change to the contents. * @param buffer the byte buffer of this message. * @param wrapperMessageTimestamp the wrapper message timestamp, which is only defined when the message is an inner * message of a compressed message. * @param wrapperMessageTimestampType the wrapper message timestamp type, which is only defined when the message is an * inner message of a compressed message. */
数据生产
-
写入方式
producer采用推(push)模式将消息发布到broker,每条消息都被追加(append)到分区(patition)中,属于顺序写磁盘(顺序写磁盘效率比随机写内存要高,保障kafka吞吐率).
-
分区(partition)
消息发送时都被发送到一个topic,其本质就是一个目录,而topic是由一些Partition Logs(分区日志)组成.
每个Partition中的消息都是有序的,生产的消息被不断追加到Partition log上,其中的每一个消息都被赋予了一个唯一的offset值.
- 分区的原因
- 方便在集群中扩展,每个Partition可以通过调整以适应它所在的机器,而一个topic又可以有多个Partition组成,因此整个集群就可以适应任意大小的数据了
- 可以提高并发,因为可以以Partition为单位读写了.
- 分区的原则
- 指定了patition,则直接使用;
- 未指定patition但指定key,通过对key的value进行hash出一个patition;
- patition和key都未指定,使用轮询选出一个patition.
- 分区的原因
-
副本(Replication)
同一个partition可能会有多个replication(对应 server.properties 配置中的default.replication.factor=N).没有replication的情况下,一旦broker宕机,其上所有patition的数据都不可被消费,同时producer也不能再将数据存在其上的patition.引入replication之后,同一个partition可能会有多个replication,而这时需要在这些replication之间选出一个leader,producer和consumer只与这个leader交互,其它replication作为follower从leader中复制数据.
-
消费者组
多个消费者消费同一个分区的数据是不行的,但是一个消费者能消费多个分区.
kafka的元数据保存在了zookeeper中.
元数据就是关于数据的数据,eg存了个张三,这个数据的字段信息,字段类型,字段长度.
接下来进入zookeeper看看
[atguigu@hadoop101 bin]$ ./zkCli.sh
Connecting to localhost:2181
2019-03-06 12:56:32,660 [myid:] - INFO [main:Environment@100] - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
2019-03-06 12:56:32,665 [myid:] - INFO [main:Environment@100] - Client environment:host.name=hadoop101
2019-03-06 12:56:32,665 [myid:] - INFO [main:Environment@100] - Client environment:java.version=1.8.0_144
2019-03-06 12:56:32,667 [myid:] - INFO [main:Environment@100] - Client environment:java.vendor=Oracle Corporation
2019-03-06 12:56:32,667 [myid:] - INFO [main:Environment@100] - Client environment:java.home=/opt/module/jdk1.8.0_144/jre
2019-03-06 12:56:32,668 [myid:] - INFO [main:Environment@100] - Client environment:java.class.path=/opt/module/zookeeper-3.4.10/bin/../build/classes:/opt/module/zookeeper-3.4.10/bin/../build/lib/*.jar:/opt/module/zookeeper-3.4.10/bin/../lib/slf4j-log4j12-1.6.1.jar:/opt/module/zookeeper-3.4.10/bin/../lib/slf4j-api-1.6.1.jar:/opt/module/zookeeper-3.4.10/bin/../lib/netty-3.10.5.Final.jar:/opt/module/zookeeper-3.4.10/bin/../lib/log4j-1.2.16.jar:/opt/module/zookeeper-3.4.10/bin/../lib/jline-0.9.94.jar:/opt/module/zookeeper-3.4.10/bin/../zookeeper-3.4.10.jar:/opt/module/zookeeper-3.4.10/bin/../src/java/lib/*.jar:/opt/module/zookeeper-3.4.10/bin/../conf:
2019-03-06 12:56:32,668 [myid:] - INFO [main:Environment@100] - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
2019-03-06 12:56:32,668 [myid:] - INFO [main:Environment@100] - Client environment:java.io.tmpdir=/tmp
2019-03-06 12:56:32,668 [myid:] - INFO [main:Environment@100] - Client environment:java.compiler=<NA>
2019-03-06 12:56:32,668 [myid:] - INFO [main:Environment@100] - Client environment:os.name=Linux
2019-03-06 12:56:32,668 [myid:] - INFO [main:Environment@100] - Client environment:os.arch=amd64
2019-03-06 12:56:32,668 [myid:] - INFO [main:Environment@100] - Client environment:os.version=2.6.32-642.el6.x86_64
2019-03-06 12:56:32,668 [myid:] - INFO [main:Environment@100] - Client environment:user.name=atguigu
2019-03-06 12:56:32,668 [myid:] - INFO [main:Environment@100] - Client environment:user.home=/home/atguigu
2019-03-06 12:56:32,668 [myid:] - INFO [main:Environment@100] - Client environment:user.dir=/opt/module/zookeeper-3.4.10/bin
2019-03-06 12:56:32,669 [myid:] - INFO [main:ZooKeeper@438] - Initiating client connection, connectString=localhost:2181 sessionTimeout=30000 watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@5c29bfd
Welcome to ZooKeeper!
2019-03-06 12:56:32,708 [myid:] - INFO [main-SendThread(localhost:2181):ClientCnxn$SendThread@1032] - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
JLine support is enabled
2019-03-06 12:56:32,781 [myid:] - INFO [main-SendThread(localhost:2181):ClientCnxn$SendThread@876] - Socket connection established to localhost/127.0.0.1:2181, initiating session
2019-03-06 12:56:32,798 [myid:] - INFO [main-SendThread(localhost:2181):ClientCnxn$SendThread@1299] - Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x169504059320000, negotiated timeout = 30000
WATCHER::
WatchedEvent state:SyncConnected type:None path:null
[zk: localhost:2181(CONNECTED) 0]
zookeeper中的根存放的信息
[zk: localhost:2181(CONNECTED) 0] ls /
[cluster, controller, brokers, zookeeper, yarn-leader-election, hadoop-ha, admin, isr_change_notification, controller_epoch, spark, kafka-manager, rmstore, consumers, latest_producer_id_block, config, hbase, kylin]
进入集群看看
[zk: localhost:2181(CONNECTED) 4] ls /cluster
[id]
[zk: localhost:2181(CONNECTED) 5] ls /cluster/id
[]
[zk: localhost:2181(CONNECTED) 6]
细看看
[zk: localhost:2181(CONNECTED) 6] get /cluster/id
{"version":"1","id":"TpHZn56WSEKmjTnJyizGhg"} 这个就是集群的id
cZxid = 0x200000032
ctime = Wed Dec 12 16:46:24 CST 2018
mZxid = 0x200000032
mtime = Wed Dec 12 16:46:24 CST 2018
pZxid = 0x200000032
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 45
numChildren = 0
controller 控制器 在平衡 分区处理都和这个有关系
[zk: localhost:2181(CONNECTED) 10] ls /controller
[]
[zk: localhost:2181(CONNECTED) 11] get /controller
{"version":1,"brokerid":1,"timestamp":"1551829594043"} 当前那台机器是控制器
cZxid = 0x320000000f 和redis中的master很像,别的机器要听他的话
ctime = Wed Mar 06 07:46:34 CST 2019
mZxid = 0x320000000f
mtime = Wed Mar 06 07:46:34 CST 2019
pZxid = 0x320000000f
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x269487804340006
dataLength = 54
numChildren = 0
接下来看下一个 controller_epoch 只要竞选一次数据就加一
[zk: localhost:2181(CONNECTED) 15] get /controller_epoch
44 根据这个值判断当前的一个稳定性
cZxid = 0x200000037
ctime = Wed Dec 12 16:46:25 CST 2018
mZxid = 0x3200000010
mtime = Wed Mar 06 07:46:34 CST 2019
pZxid = 0x200000037
cversion = 0
dataVersion = 43
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 2
numChildren = 0
整个集群中会有多个服务器,每个服务器我们称之为brokers. 这都是元数据的相关信息
[zk: localhost:2181(CONNECTED) 17] ls /brokers
[ids, topics, seqid]
[zk: localhost:2181(CONNECTED) 18] ls /brokers/ids
[0, 1]
[zk: localhost:2181(CONNECTED) 20] get /brokers/ids/0
{"listener_security_protocol_map":{"PLAINTEXT":"PLAINTEXT"},"endpoints":["PLAINTEXT://hadoop101:9092"],"jmx_port":-1,"host":"hadoop101","timestamp":"1551743231039","port":9092,"version":4}
cZxid = 0x300000504d
ctime = Tue Mar 05 07:47:11 CST 2019
mZxid = 0x300000504d
mtime = Tue Mar 05 07:47:11 CST 2019
pZxid = 0x300000504d
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x1694878042d13a5
dataLength = 188
numChildren = 0
[zk: localhost:2181(CONNECTED) 21] get /brokers/ids/1
{"listener_security_protocol_map":{"PLAINTEXT":"PLAINTEXT"},"endpoints":["PLAINTEXT://hadoop102:9092"],"jmx_port":-1,"host":"hadoop102","timestamp":"1551743230027","port":9092,"version":4}
cZxid = 0x3000005030
ctime = Tue Mar 05 07:47:10 CST 2019
mZxid = 0x3000005030
mtime = Tue Mar 05 07:47:10 CST 2019
pZxid = 0x3000005030
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x269487804340006
dataLength = 188
numChildren = 0
0.9版本后 集群和zookeeper打交道,为了提高效率, 偏移量不在和zookeeper打交道,
[zk: localhost:2181(CONNECTED) 23] ls /brokers/topics
[GMALL_STARTUP, test, calllog, ads_log, GMALL_ORDER, first, __consumer_offsets, GMALL_EVENT]
[zk: localhost:2181(CONNECTED) 24]
继续进一层看看
[zk: localhost:2181(CONNECTED) 26] ls /brokers/topics/__consumer_offsets/partitions
[44, 45, 46, 47, 48, 49, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43]
[zk: localhost:2181(CONNECTED) 27] ls /brokers/topics/first/partitions
[0, 1, 2]
通过这个我们发现元数据的信息在zookeeper中都有的
看consumers
[zk: localhost:2181(CONNECTED) 30] ls /consumers
[console-consumer-34712, console-consumer-75673, spark, console-consumer-53820, console-consumer-74373, test-consumer-group]
[zk: localhost:2181(CONNECTED) 31] get /consumers
null
cZxid = 0x30000000c
ctime = Wed Dec 12 17:10:03 CST 2018
mZxid = 0x30000000c
mtime = Wed Dec 12 17:10:03 CST 2018
pZxid = 0x2d00000065
cversion = 22
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 6
再平衡
kafka中的消费数据的信息
手动维护office,将偏移量保存到Redis中的set集合中,起到一个去重的作用,
Spark和Storm都是流式处理框架,而Kafka Stream提供的是一个基于Kafka的流式处理类库.
Flume–>Kafka---->Flume