spark-streaming-[3]-Transform_the transform operation-CSDN博客

本文链接：https://blog.csdn.net/hjw199089/article/details/71075648

Transform Operation

Return a new DStream by applying a RDD-to-RDD function to every RDDof the source DStream. This can be used to do arbitrary RDD operationson the DStream.

The transform operation (along with its variations like transformWith) allows arbitrary RDD-to-RDDfunctions to be applied on a DStream. It can be used to apply any RDD operation that is not exposed in the DStream API. For example, the functionality of joining every batch in a data stream with another dataset is not directly exposed in the DStream API. However, you can easily use transform to do this.This enables very powerful possibilities. For example, one can do real-time data cleaning by joining the input data stream with precomputed spam information (maybe generated with Spark as well) and then filtering based on it.

这我们模拟测试流处理中利用Transform将批处理总key值为“hello”的结果过滤掉

第一部分-模拟一个spout

此spout监听指定端口，若有链接后指定时间间隔millisecond向链接者发送一行数据

先运行以下模拟器再运行下面的Transform

TransfromData.txt中内容如下

hello
hell0
hello
hello
hjw
hjw
hjw
hjw
hello
hello
hello

program arguments：./srcFile/TransfromData.txt 9999 1000

package com.dt.spark.main.Streaming

import java.io.PrintWriter
import java.net.ServerSocket

import scala.io.Source

/**
  * Created by hjw on 17/5/1.
  */
object StreamingSimulation {
  /*
  随机取整函数
   */
  def index(length:Int) ={
    import java.util.Random
    val rdm = new Random()
    rdm.nextInt(length)
  }

  def main(args: Array[String]) {
    if (args.length != 3){
      System.err.println("Usage: <filename><port><millisecond>")
      System.exit(1)
    }

    val filename = args(0)
    val lines = Source.fromFile(filename).getLines().toList
    val fileRow = lines.length

    val listener = new ServerSocket(args(1).toInt)

    //指定端口,当有请求时建立连接
    while(true){
      val socket = listener.accept()
      new Thread(){
        override def run() = {
          println("Got client connect from: " + socket.getInetAddress)
          val out =  new PrintWriter(socket.getOutputStream,true)
          while(true){
            Thread.sleep(args(2).toLong)
            //随机发送一行数据至client
            val content = lines(index(fileRow))
            println(content)
            out.write(content + '\n')
            out.flush()
          }
          socket.close()
        }
      }.start()
    }
  }
}

第二部分-测试Transfrom部分

package com.dt.spark.main.Streaming.Transfrom

import org.apache.log4j.{Level, Logger}
import org.apache.spark._
import org.apache.spark.streaming._ // not necessary since Spark 1.3

object Transform {
  Logger.getLogger("org").setLevel(Level.ERROR)
  def main(args: Array[String]): Unit ={
    // Create a local StreamingContext with two working thread and batch interval of 1 second.
    // The master requires 2 cores to prevent from a starvation scenario.

    val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

    val ssc = new StreamingContext(conf, Seconds(1))


    // Create a DStream that will connect to hostname:port, like localhost:9999
    val lines = ssc.socketTextStream("localhost", 9999)
    // Split each line into words
    val words = lines.flatMap(_.split(" ")) // not necessary since Spark 1.3 // Count each word in each batch
    val pairs = words.map(word => (word, 1))
    val wordCounts = pairs.reduceByKey(_ + _)

    val cleanedDStream = wordCounts.transform(rdd=>{
      rdd.filter(a=> !a._1.equals("hello"))
    })

    // Print the first ten elements of each RDD generated in this DStream to the console
    cleanedDStream.print()
    ssc.start() // Start the computation
    ssc.awaitTermination() // Wait for the computation to terminate
  }
}

先运行以下模拟器再运行Transform

StreamingSimulation输出：

Got client connect from: /127.0.0.1
hello
hjw
hello
hjw
hello
hello
hjw
hello
hello
hello
hello
hello
hjw
hjw

Transform按目的过滤“hello”具体输出如下：

-------------------------------------------
Time: 1493646558000 ms
-------------------------------------------
-------------------------------------------
Time: 1493646559000 ms
-------------------------------------------
-------------------------------------------
Time: 1493646560000 ms
-------------------------------------------
(hjw,1)
------------------------------------------
Time: 1493646561000 ms
-------------------------------------------
-------------------------------------------
Time: 1493646562000 ms
-------------------------------------------
(hjw,1)
-------------------------------------------
Time: 1493646563000 ms
-------------------------------------------

第三部分结合updateStateByKey统计过滤hello后的单词

TransformV2.txt内容：

hello world
hello hjw
hello hjw
hello hjw
hello world
hello test
hello test
启动StreamingSimulation开始随机发送行数据（运行参数：./srcFile/TransformV2.txt    9999   1000）
再运行TransformV2
TransformV2代码如下：
package com.dt.spark.main.Streaming.Transfrom

import org.apache.log4j.{Level, Logger}
import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * Created by hjw on 17/9/19.
  * 过滤hello,统计其他单次的出现的次数
  */
object TransformV2 {
  Logger.getLogger("org").setLevel(Level.ERROR)

  ///函数常量定义，返回类型是Some(Int)，表示的含义是最新状态
  ///函数的功能是将当前时间间隔内产生的Key的value集合，加到上一个状态中，得到最新状态
  val updateFunc = (values: Seq[Int], state: Option[Int]) => {
    val currentCount = values.sum
    val previousCount = state.getOrElse(0)
    Some(currentCount + previousCount)
  }

  ///入参是三元组遍历器，三个元组分别表示Key、当前时间间隔内产生的对应于Key的Value集合、上一个时间点的状态
  ///newUpdateFunc的返回值要求是iterator[(String,Int)]类型的
  val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
    ///对每个Key调用updateFunc函数(入参是当前时间间隔内产生的对应于Key的Value集合、上一个时间点的状态）得到最新状态
    ///然后将最新状态映射为Key和最新状态
    iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))
  }


  def main(args: Array[String]): Unit ={
    // Create a local StreamingContext with two working thread and batch interval of 1 second.
    // The master requires 2 cores to prevent from a starvation scenario.

    val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

    val ssc = new StreamingContext(conf, Seconds(1))

    ssc.checkpoint(".")
    // Initial RDD input to updateStateByKey
    val initialRDD = ssc.sparkContext.parallelize(List(("world", 1)))

    // Create a DStream that will connect to hostname:port, like localhost:9999
    val lines = ssc.socketTextStream("localhost", 9999)
    // Split each line into words
    val words = lines.flatMap(_.split(" ")) // not necessary since Spark 1.3 // Count each word in each batch
    val pairs = words.map(word => (word, 1))
    val wordCounts = pairs.reduceByKey(_ + _)

    val cleanedDStream = wordCounts.transform(rdd=>{
      rdd.filter(a=> !a._1.equals("hello"))
    })

    // Update the cumulative count using updateStateByKey
    // This will give a Dstream made of state (which is the cumulative count of the words)
    //注意updateStateByKey的四个参数，第一个参数是状态更新函数
    val stateDstream = cleanedDStream.updateStateByKey[Int](newUpdateFunc,
      new HashPartitioner(ssc.sparkContext.defaultParallelism), true, initialRDD)
    stateDstream.print()

    ssc.start() // Start the computation
    ssc.awaitTermination() // Wait for the computation to terminate
  }
}

【备注】
（1）要有checkoutpoit,    ssc.checkpoint(".")
（2）初识化字段只是定个格式，统计中可以根据统计结果添加key
比如这里值初识化了“world”，统计中会根据结果添加其他key
val initialRDD = ssc.sparkContext.parallelize(List(("world", 0)))
package com.dt.spark.main.Streaming.Transfrom

import org.apache.log4j.{Level, Logger}
import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * Created by hjw on 17/9/19.
  * 过滤hello,统计其他单次的出现的次数
  */
object TransformV2 {
  Logger.getLogger("org").setLevel(Level.ERROR)

  ///函数常量定义，返回类型是Some(Int)，表示的含义是最新状态
  ///函数的功能是将当前时间间隔内产生的Key的value集合，加到上一个状态中，得到最新状态
  val updateFunc = (values: Seq[Int], state: Option[Int]) => {
    val currentCount = values.sum
    val previousCount = state.getOrElse(0)
    Some(currentCount + previousCount)
  }

  ///入参是三元组遍历器，三个元组分别表示Key、当前时间间隔内产生的对应于Key的Value集合、上一个时间点的状态
  ///newUpdateFunc的返回值要求是iterator[(String,Int)]类型的
  val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
    ///对每个Key调用updateFunc函数(入参是当前时间间隔内产生的对应于Key的Value集合、上一个时间点的状态）得到最新状态
    ///然后将最新状态映射为Key和最新状态
    iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))
  }


  def main(args: Array[String]): Unit ={
    // Create a local StreamingContext with two working thread and batch interval of 1 second.
    // The master requires 2 cores to prevent from a starvation scenario.

    val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

    val ssc = new StreamingContext(conf, Seconds(1))

    ssc.checkpoint(".")
    // Initial RDD input to updateStateByKey
    val initialRDD = ssc.sparkContext.parallelize(List(("world", 1)))

    // Create a DStream that will connect to hostname:port, like localhost:9999
    val lines = ssc.socketTextStream("localhost", 9999)
    // Split each line into words
    val words = lines.flatMap(_.split(" ")) // not necessary since Spark 1.3 // Count each word in each batch
    val pairs = words.map(word => (word, 1))
    val wordCounts = pairs.reduceByKey(_ + _)

    val cleanedDStream = wordCounts.transform(rdd=>{
      rdd.filter(a=> !a._1.equals("hello"))
    })

    // Update the cumulative count using updateStateByKey
    // This will give a Dstream made of state (which is the cumulative count of the words)
    //注意updateStateByKey的四个参数，第一个参数是状态更新函数
    val stateDstream = cleanedDStream.updateStateByKey[Int](newUpdateFunc,
      new HashPartitioner(ssc.sparkContext.defaultParallelism), true, initialRDD)
    stateDstream.print()

    ssc.start() // Start the computation
    ssc.awaitTermination() // Wait for the computation to terminate
  }
}