1.Direct DStream(No Receivers)
Spark 1.3中引入了这种新的无接收方“直接”方法,以确保更强的端到端保证。这种方法不使用接收者来接收数据,而是定期查询Kafka在每个主题+分区中的最新偏移量,并相应地定义每个批处理中的偏移范围。启动处理数据的作业时,Kafka的简单消费者API用于从Kafka读取已定义的偏移范围(类似于从文件系统读取文件)。请注意,该特性是在针对Scala和Java API的Spark 1.3中引入的,在针对Python API的Spark 1.4中引入的。
This new receiver-less “direct” approach has been introduced in Spark 1.3 to ensure stronger end-to-end guarantees. Instead of using receivers to receive data, this approach periodically queries Kafka for the latest offsets in each topic+partition, and accordingly defines the offset ranges to process in each batch. When the jobs to process the data are launched, Kafka’s simple consumer API is used to read the defined ranges of offsets from Kafka (similar to read files from a file system). Note that this feature was introduced in Spark 1.3 for the Scala and Java API, in Spark 1.4 for the Python API.
1.1.Kafka
开Zookeeper:
./zkServer.sh start
开Kafka:
./kafka-server-start.sh -daemon /home/hadoop/app/kafka_2.11-0.9.0.0/config/server.properties
创建一个Kafka topic:
./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic kafka_streaming_topic
创建Producer与Consumer测试一下连通性:
kafka-console-producer.sh --broker-list hadoop000:9092 --topic tp_kafka_streaming_topic
kafka-console-consumer.sh --zookeeper hadoop000:2181 --topic tp_kafka_streaming_to