sparkstreaming sparkstreaming整合其常见数据源

示意图

以下几个数据源处理都以单词计数为例,如下图:
在这里插入图片描述

数据抽象

不论何种数据源,最终都会被抽象成ReceiverInputDStream

/**
 * Abstract class for defining any [[org.apache.spark.streaming.dstream.InputDStream]]
 * that has to start a receiver on worker nodes to receive external data.
 * Specific implementations of ReceiverInputDStream must
 * define [[getReceiver]] function that gets the receiver object of type
 * [[org.apache.spark.streaming.receiver.Receiver]] that will be sent
 * to the workers to receive data.
 * @param _ssc Streaming context that will execute this input stream
 * @tparam T Class type of the object of this stream
 */
abstract class ReceiverInputDStream[T: ClassTag](_ssc: StreamingContext)
  extends InputDStream[T](_ssc) {

socket数据源

# 注意计数只能统计每个批次的数据
object socket {
  def main(args: Array[String]): Unit = {
    //非直连模式local>=2
    val conf: SparkConf= new SparkConf().setAppName("socketWD").setMaster("local[2]")
    //设置block之间的时间间隔
      //.set("spark.streaming.blockInterval", "50")
    //指定conf和batch之间的时间间隔
    val sc = new StreamingContext(conf,Seconds(2))
    //指定hostname和port
    val socketTextStream: ReceiverInputDStream[String] = sc.socketTextStream("node01",9999)
    //处理数据
    val result: DStream[(String, Int)] = socketTextStream.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)

    result.print()
    //开启流式计算
    sc.start()
    sc.awaitTermination()

  }
}

输入
在这里插入图片描述

结果
两个batch的单词数没有进行叠加
在这里插入图片描述
在这里插入图片描述

HDFS数据源

object hdfs {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName("hdfsWD").setMaster("local[2]")
    val sc = new StreamingContext(conf,Seconds(2))
    val textFS: DStream[String] = sc.textFileStream("hdfs://node01:8020/data")
    val result: DStream[(String, Int)] = textFS.flatMap(_.split(" ").map((_,1))).reduceByKey(_+_)
    result.print()

    //开启流式计算
    sc.start()
    sc.awaitTermination()
  }
}

结果
在这里插入图片描述

在这里插入图片描述

flume数据源

sparkstreaming很少直接整合flume,sparkstreaming整合flume有两种方式:

  • sparkstreaming作为sink端
  • flume将数据缓存到sink端后,sparkstreaming去拉取

前置要求

  • 至少一个spark的worker必须在启动flume的服务器上

pom文件

<dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-flume_2.11</artifactId>
      <version>${spark.version}</version>
      //如果上面version有问题,可以直接指定如下的版本
      // <version>2.2.0</version>
 </dependency>

更多内容可以参考以下文章
sparkstreaming整合flume

官网地址

sparkstreaming整合flume

自定义数据源

自定义数据源需要实现Receiver,并实现以下两个方法:

  • onStart():用来开始接收数据
// 一般启动一个线程来接收数据,数据接收和处理流程在receive()方法中实现
 def onStart() {
    // Start the thread that receives data over a connection
    new Thread("Socket Receiver") {
      override def run() { 
      	receive() 
      }
    }.start()
  }
  • onStop():用来停止接收数据
def onStop() {
    // There is nothing much to do as the thread calling receive()
    // is designed to stop by itself if isStopped() returns false
  }

示例:自定义receiver接收socket传输的数据

//自定义receiver,处理数据类型为string
object custom{
  def main(args: Array[String]): Unit = {
    //非直连模式local>=2
    val conf: SparkConf= new SparkConf().setAppName("customerWD").setMaster("local[2]")
    //指定conf和batch之间的时间间隔
    val sc = new StreamingContext(conf,Seconds(2))
    //指定自定义的receiver和hostname:port
    val socketTextStream: ReceiverInputDStream[String] = sc.receiverStream(new customReceiver("node01",8888))
    //处理数据
    val result: DStream[(String, Int)] = socketTextStream.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)

    result.print()
    //开启流式计算
    sc.start()
    sc.awaitTermination()

  }


  class customReceiver(host:String,port:Int) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging
  {
    override def onStart(){
      // Start the thread that receives data over a connection
      new Thread("socket Receiver"){
        override def run(){
          receive()
        }
      }.start()
    }

    override def onStop(): Unit ={
      // There is nothing much to do as the thread calling receive()
      // is designed to stop by itself if isStopped() returns false
    }

    /** Create a socket connection and receive data until receiver is stopped */
    private def receive(){
      var socket:Socket = null
      var userInput : String = null
      try {
        // connect to host:port
        logInfo("Connecting to " + host + ":" + port)
        val socket: Socket = new Socket(host,port)
        logInfo("Connected to " + host + ":" + port)
        val reader = new BufferedReader(new InputStreamReader(socket.getInputStream, StandardCharsets.UTF_8))
        userInput= reader.readLine()
        //通过内置方法isStopped和输入流判断是否停止此次的输入流
        while (!isStopped() && userInput != null){
          //通过sotre方法存储数据到spark的memory中
          store(userInput)

        }
        reader.close()
        socket.close()
        logInfo("reader close")
        logInfo("socket close")
        // Restart in an attempt to connect again when server is active again
        restart("trying re-connect")
      }catch {
        case e: java.net.ConnectException =>
          // restart if could not connect to server
          restart("Error connecting to " + host + ":" + port, e)
        case t: Throwable =>
          // restart if there is any other error
          restart("Error receiving data", t)
      }
  }
  }
}

官网地址

kafka数据源

见下一篇blog

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值