整合Spark Streaming与Kafka

1.Direct DStream(No Receivers)

Spark 1.3中引入了这种新的无接收方“直接”方法,以确保更强的端到端保证。这种方法不使用接收者来接收数据,而是定期查询Kafka在每个主题+分区中的最新偏移量,并相应地定义每个批处理中的偏移范围。启动处理数据的作业时,Kafka的简单消费者API用于从Kafka读取已定义的偏移范围(类似于从文件系统读取文件)。请注意,该特性是在针对Scala和Java API的Spark 1.3中引入的,在针对Python API的Spark 1.4中引入的。

This new receiver-less “direct” approach has been introduced in Spark 1.3 to ensure stronger end-to-end guarantees. Instead of using receivers to receive data, this approach periodically queries Kafka for the latest offsets in each topic+partition, and accordingly defines the offset ranges to process in each batch. When the jobs to process the data are launched, Kafka’s simple consumer API is used to read the defined ranges of offsets from Kafka (similar to read files from a file system). Note that this feature was introduced in Spark 1.3 for the Scala and Java API, in Spark 1.4 for the Python API.

1.1.Kafka

开Zookeeper:

./zkServer.sh start

开Kafka:

./kafka-server-start.sh -daemon /home/hadoop/app/kafka_2.11-0.9.0.0/config/server.properties

创建一个Kafka topic:

./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic kafka_streaming_topic

创建Producer与Consumer测试一下连通性:

kafka-console-producer.sh --broker-list hadoop000:9092 --topic tp_kafka_streaming_topic
kafka-console-consumer.sh --zookeeper hadoop000:2181 --topic tp_kafka_streaming_topic --from-beginning 

数据能够传输,说明没问题。

2.2.Spark Streaming

pom引用:

   <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
      <version>2.2.0</version>
    </dependency>

小心Kafka版本冲突,版本若冲突会报错,注释掉:

    <!--
    <dependency>
      <groupId>org.apache.kafka</groupId>
      <artifactId>kafka_2.11</artifactId>
      <version>0.9.0.0</version>
    </dependency>
    -->

程序编写:

createDirectorStream参数参考:

ssc: StreamingContext,
kafkaParams: Map[String, String],
fromOffsets: Map[TopicAndPartition, Long],
messageHandler: MessageAndMetadata[K, V] => R
package com.taipark.spark

import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * Spark Streaming 对接Kafka方式二
  */
object KafkaDirectorWordCount {
  def main(args: Array[String]): Unit = {
    if(args.length != 2){
      System.err.println("Userage:KafkaDirectorWordCount<brokers><topics>");
      System.exit(1);
    }
    val Array(brokers,topics) = args

    val sparkConf = new SparkConf().setAppName("KafkaReceiverWordCount")
      .setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf,Seconds(5))

    val kafkaParams = Map[String,String]("metadata.broker.list"-> brokers)
    val topicSet = topics.split(",").toSet
    val messages = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](
      ssc,kafkaParams,topicSet
    )
    //第二位是字符串的值
    messages.map(_._2).flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).print()

    ssc.start()
    ssc.awaitTermination()
  }

}

 本地上传参数测试:

Producer端输入数据:

 OK

不要硬编码,把代码中setAppName和setMaster注释掉。打包上传服务器。

用spark-submit运行:

spark-submit \
--class com.taipark.spark.KafkaDirectorWordCount \
--master local[2] \
--name KafkaDirectorWordCount \
--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 \
/home/hadoop/tplib/sparktrain-1.0.jar \
hadoop000:9092 tp_kafka_streaming_topic

报错参考:这里

hadoop000:4040看下Spark的UI: 

OK~ 

2.Receiver DStream(0.10.0后已弃用)

这种方法使用接收器来接收数据。接收器是使用Kafka高级消费者API实现的。与所有接收器一样,通过接收器从Kafka接收到的数据存储在Spark执行器中,然后由Spark流启动的作业处理数据。

但是,在默认配置下,这种方法可能会在故障情况下丢失数据(请参阅接收方可靠性)。为了确保零数据丢失,您还必须在Spark流中启用写前日志(在Spark 1.2中引入)。这将同步地将所有接收到的Kafka数据保存到分布式文件系统的提前写日志中(e。(g HDFS),以便所有数据可以在故障时恢复。有关写前日志的详细信息,请参阅流编程指南中的部署部分。

This approach uses a Receiver to receive the data. The Receiver is implemented using the Kafka high-level consumer API. As with all receivers, the data received from Kafka through a Receiver is stored in Spark executors, and then jobs launched by Spark Streaming processes the data.

However, under default configuration, this approach can lose data under failures (see receiver reliability. To ensure zero-data loss, you have to additionally enable Write-Ahead Logs in Spark Streaming (introduced in Spark 1.2). This synchronously saves all the received Kafka data into write-ahead logs on a distributed file system (e.g HDFS), so that all the data can be recovered on failure. See Deploying section in the streaming programming guide for more details on Write-Ahead Logs.

2.1.Kafka

开Zookeeper:

./zkServer.sh start

开Kafka:

./kafka-server-start.sh -daemon /home/hadoop/app/kafka_2.11-0.9.0.0/config/server.properties

2.2.Spark Streaming

 

程序编写:

package com.taipark.spark

import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * Spark Streaming 对接Kafka方式一
  */
object KafkaReceiverWordCount {
  def main(args: Array[String]): Unit = {
    if(args.length != 4){
      System.err.println("Userage:KafkaReceiverWordCount<zkQuorum><group><topics><numThreads>");
    }
    val Array(zkQuorum,group,topics,numThreads) = args

    val sparkConf = new SparkConf()//.setAppName("KafkaReceiverWordCount")
      //.setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf,Seconds(5))

    val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
    val messages = KafkaUtils.createStream(ssc,zkQuorum,group,topicMap)
    //第二位是字符串的值
    messages.map(_._2).flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).print()

    ssc.start()
    ssc.awaitTermination()
  }

}

Maven打包上服务器。

用spark-submit运行:

spark-submit \
--class com.taipark.spark.KafkaReceiverWordCount \
--master local[2] \
--name KafkaReceiverWordCount \
--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 \
/home/hadoop/tplib/sparktrain-1.0.jar \
hadoop000:2181 test tp_kafka_streaming_topic 1

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值