sparkstreaming sparkstreaming整合其常见数据源

最新推荐文章于 2023-05-16 09:30:19 发布

爱吃甜食_

最新推荐文章于 2023-05-16 09:30:19 发布

阅读量710

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/a3125504x/article/details/108415632

版权

Spark 专栏收录该内容

36 篇文章 0 订阅

订阅专栏

sparkstreaming整合非API数据源

示意图
数据抽象
socket数据源
HDFS数据源
flume数据源
- pom文件
- 官网地址
自定义数据源
kafka数据源

示意图

以下几个数据源处理都以单词计数为例，如下图：
在这里插入图片描述

数据抽象

不论何种数据源，最终都会被抽象成ReceiverInputDStream。

/**
 * Abstract class for defining any [[org.apache.spark.streaming.dstream.InputDStream]]
 * that has to start a receiver on worker nodes to receive external data.
 * Specific implementations of ReceiverInputDStream must
 * define [[getReceiver]] function that gets the receiver object of type
 * [[org.apache.spark.streaming.receiver.Receiver]] that will be sent
 * to the workers to receive data.
 * @param _ssc Streaming context that will execute this input stream
 * @tparam T Class type of the object of this stream
 */
abstract class ReceiverInputDStream[T: ClassTag](_ssc: StreamingContext)
  extends InputDStream[T](_ssc) {

socket数据源

# 注意计数只能统计每个批次的数据
object socket {
  def main(args: Array[String]): Unit = {
    //非直连模式local>=2
    val conf: SparkConf= new SparkConf().setAppName("socketWD").setMaster("local[2]")
    //设置block之间的时间间隔
      //.set("spark.streaming.blockInterval", "50")
    //指定conf和batch之间的时间间隔
    val sc = new StreamingContext(conf,Seconds(2))
    //指定hostname和port
    val socketTextStream: ReceiverInputDStream[String] = sc.socketTextStream("node01",9999)
    //处理数据
    val result: DStream[(String, Int)] = socketTextStream.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)

    result.print()
    //开启流式计算
    sc.start()
    sc.awaitTermination()

  }
}

输入
在这里插入图片描述

结果
两个batch的单词数没有进行叠加
在这里插入图片描述

HDFS数据源

object hdfs {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName("hdfsWD").setMaster("local[2]")
    val sc = new StreamingContext(conf,Seconds(2))
    val textFS: DStream[String] = sc.textFileStream("hdfs://node01:8020/data")
    val result: DStream[(String, Int)] = textFS.flatMap(_.split(" ").map((_,1))).reduceByKey(_+_)
    result.print()

    //开启流式计算
    sc.start()
    sc.awaitTermination()
  }
}

结果
在这里插入图片描述

在这里插入图片描述

flume数据源

sparkstreaming很少直接整合flume，sparkstreaming整合flume有两种方式：

sparkstreaming作为sink端
flume将数据缓存到sink端后，sparkstreaming去拉取

前置要求：

至少一个spark的worker必须在启动flume的服务器上

pom文件

<dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-flume_2.11</artifactId>
      <version>${spark.version}</version>
      //如果上面version有问题，可以直接指定如下的版本
      // <version>2.2.0</version>
 </dependency>

更多内容可以参考以下文章
sparkstreaming整合flume

官网地址

sparkstreaming整合flume

自定义数据源

自定义数据源需要实现Receiver，并实现以下两个方法：

onStart():用来开始接收数据

// 一般启动一个线程来接收数据，数据接收和处理流程在receive()方法中实现
 def onStart() {
    // Start the thread that receives data over a connection
    new Thread("Socket Receiver") {
      override def run() { 
      	receive() 
      }
    }.start()
  }

onStop():用来停止接收数据

def onStop() {
    // There is nothing much to do as the thread calling receive()
    // is designed to stop by itself if isStopped() returns false
  }

示例：自定义receiver接收socket传输的数据

//自定义receiver，处理数据类型为string
object custom{
  def main(args: Array[String]): Unit = {
    //非直连模式local>=2
    val conf: SparkConf= new SparkConf().setAppName("customerWD").setMaster("local[2]")
    //指定conf和batch之间的时间间隔
    val sc = new StreamingContext(conf,Seconds(2))
    //指定自定义的receiver和hostname:port
    val socketTextStream: ReceiverInputDStream[String] = sc.receiverStream(new customReceiver("node01",8888))
    //处理数据
    val result: DStream[(String, Int)] = socketTextStream.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)

    result.print()
    //开启流式计算
    sc.start()
    sc.awaitTermination()

  }


  class customReceiver(host:String,port:Int) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging
  {
    override def onStart(){
      // Start the thread that receives data over a connection
      new Thread("socket Receiver"){
        override def run(){
          receive()
        }
      }.start()
    }

    override def onStop(): Unit ={
      // There is nothing much to do as the thread calling receive()
      // is designed to stop by itself if isStopped() returns false
    }

    /** Create a socket connection and receive data until receiver is stopped */
    private def receive(){
      var socket:Socket = null
      var userInput : String = null
      try {
        // connect to host:port
        logInfo("Connecting to " + host + ":" + port)
        val socket: Socket = new Socket(host,port)
        logInfo("Connected to " + host + ":" + port)
        val reader = new BufferedReader(new InputStreamReader(socket.getInputStream, StandardCharsets.UTF_8))
        userInput= reader.readLine()
        //通过内置方法isStopped和输入流判断是否停止此次的输入流
        while (!isStopped() && userInput != null){
          //通过sotre方法存储数据到spark的memory中
          store(userInput)

        }
        reader.close()
        socket.close()
        logInfo("reader close")
        logInfo("socket close")
        // Restart in an attempt to connect again when server is active again
        restart("trying re-connect")
      }catch {
        case e: java.net.ConnectException =>
          // restart if could not connect to server
          restart("Error connecting to " + host + ":" + port, e)
        case t: Throwable =>
          // restart if there is any other error
          restart("Error receiving data", t)
      }
  }
  }
}

官网地址

kafka数据源

见下一篇blog

爱吃甜食_

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
sparkstreaming sparkstreaming整合其常见数据源

sparkstreaming整合非API数据源示意图数据抽象socket数据源HDFS数据源flume数据源pom文件官网地址自定义数据源kafka数据源示意图以下几个数据源处理都以单词计数为例，如下图：数据抽象不论何种数据源，最终都会被抽象成ReceiverInputDStream。/** * Abstract class for defining any [[org.apache.spark.streaming.dstream.InputDStream]] * that has to s
复制链接

扫一扫

专栏目录