Structured Streaming(二)--DataSet和DataFrame API

DataSet和DataFrame API

自Spark 2.0以来,DataFrames和数据集可以表示静态的有界数据,也可以表示流式的无界数据。与静态数据集/DataFrames类似,您可以使用公共入口点SparkSession (Scala/Java/Python/R docs)从流数据源创建流数据集/数据集,并对它们应用与静态数据集/数据集相同的操作。

Input Sources
  • File source(支持故障容错)
    将写入目录中的文件作为数据流读取。支持的文件格式有文本,csv, json, orc, parquet。

csv

// 读取一个目录中自动写入的所有csv文件
    //name,age,salary,sex,dept,deptNo
    val userSchema = new StructType()
      .add("name", "string")
      .add("age", "integer")
      .add("salary", "double")
      .add("sex", "boolean")
      .add("dept", "string")
      .add("deptNo", "integer")



    val csvDF = spark
      .readStream
      .option("sep", ";")
      .option("header","true") //去除表头
      .schema(userSchema)      // 指定csv文件的架构
      .csv("hdfs://train:9000/results/csv")


    // 开始运行将运行计数打印到控制台的查询
    val query = csvDF.writeStream
      .outputMode("Append")
      .format("console")
      .start()

    query.awaitTermination()
  • Socket Source(不支持故障容错)
import spark.implicits._

    // 创建输入流
    val lines = spark.readStream
      .format("socket")
      .option("host", "train")
      .option("port", 9999)
      .load()

    val words = lines.as[String]
      .flatMap(_.split(" "))
      .groupBy("value")
      .count()


    // 开始运行将运行计数打印到控制台的查询
    val query = words.writeStream
      .outputMode("complete")
      .format("console")
      .start()

    query.awaitTermination()
  • Kafka Source(支持故障恢复)
// 订阅一个主题
    val df = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "train:9092")
      .option("subscribe", "topic01")
      .load()
      .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
      .as[(String, String)]


    // 开始运行将运行计数打印到控制台的查询
    val query = df.writeStream
      .outputMode("complete")
      .format("console")
      .start()
Basic Operations

在这里插入图片描述

import org.apache.spark.sql.SparkSession
case class DeviceData(device: String, deviceType: String, signal: Double, time: String)
object KafkaSource {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .appName("StructuredNetworkWordCount")
      .master("local[6]")
      .getOrCreate()

    import spark.implicits._
    // 创建输入流
    // 订阅一个主题
    val df = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "train:9092")
      .option("subscribe", "topic01")
      .load()
      .selectExpr("CAST(value AS STRING)")
      .as[String].map(line=>{
      val ts = line.split(",")
      DeviceData(ts(0),ts(1),ts(2).toDouble,ts(3))
    }).toDF()
      .groupBy("deviceType","device")
      .mean("signal")




    // 开始运行将运行计数打印到控制台的查询
    val query = df.writeStream
      .outputMode("Complete")
      .format("console")
      .start()

    query.awaitTermination()
  }
}

import org.apache.spark.sql.SparkSession

object KafkaSource {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .appName("StructuredNetworkWordCount")
      .master("local[6]")
      .getOrCreate()

    import spark.implicits._
    // 创建输入流
    // 订阅一个主题
    spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "train:9092")
      .option("subscribe", "topic01")
      .load()
      .selectExpr("CAST(value AS STRING)")
      .as[String].map(line=>{
      var ts = line.split(",")
      DeviceData(ts(0),ts(1),ts(2).toDouble,ts(3))
    }).toDF().createOrReplaceTempView("t_device")

    var sql =
      """
        |select device,deviceType,age(signal)
        |     from t_device
        |     group by deviceType,device
      """.stripMargin


    var results = spark.sql(sql)

    // 开始运行将运行计数打印到控制台的查询
    val query = results.writeStream
      .outputMode("Complete")
      .format("console")
      .start()

    query.awaitTermination()
  }
}
Output Sinks
File sink(Append)

将输出存储到目录

val inputs = spark.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "CentOS:9092")
      .option("subscribePattern", "topic.*")
      .load()


   //3.产生StreamQuery对象
    val query:StreamingQuery = inputs.writeStream
      .outputMode(OutputMode.Append())
      .format("csv")
      .option("sep", ",")
      .option("header", "true")//去除表头
      .option("inferSchema", "true")
      .option("path", "hdfs://train:9000/structured/csv")
      .option("checkpointLocation", "hdfs://train:9000/structured-checkpoints")
      .start()

注意:File Sink只允许用在Append模式,并且支持精准一次的写入。

Kafka sink(Append,Update,Complete)

将输出存储到Kafka中的一个或多个主题。在这里,我们描述了将流查询和批查询写入Kafka的支持。请注意,Kafka仅支持至少一次写入语义。因此,在向Kafka写入流查询或批处理查询时,某些记录可能会重复。例如,如果Kafka需要重试Broker未确认的消息(即使Broker已经收到并写入了消息记录),就会发生这种情况。由于这些Kafka写语义,结构化流无法阻止此类重复项的发生。写入Kafka的DataFrame在架构中应包含以下级例:
在这里插入图片描述
如果未指定“topic”配置选项,则topic列为必填项。

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql-kafka-0-10_2.11</artifactId>
    <version>2.4.5</version>
</dependency>
 //3.产生StreamQuery对象
    val query:StreamingQuery = results.writeStream
      .outputMode(OutputMode.Update())
      .format("kafka")
      .option("checkpointLocation", "hdfs://CentOS:9000/structured-checkpoints-kafka")
      .option("kafka.bootstrap.servers", "CentOS:9092")
      .option("topic", "topic01")//覆盖DF中topic字段
      .start()

必须保证写入的DF中仅仅含有value字符串类型的字段、key可选,如果没有系统认为是null。topic可选,前提是必须在Option中配置topic,否则必须出现在字段列中。

完整案例

// 1,zhangsan,1,4.5
    val inputs = spark.readStream
      .format("socket")
      .option("host", "CentOS")
      .option("port", "9999")
      .load()

    import org.apache.spark.sql.functions._
    val results = inputs.as[String].map(_.split(","))
      .map(ts => (ts(0).toInt, ts(1), ts(2).toInt * ts(3).toDouble))
      .toDF("id", "name", "cost")
      .groupBy("id", "name")
      .agg(sum("cost") as "cost" )
      .as[(Int,String,Double)]
      .map(t=>(t._1+":"+t._2,t._3+""))
      .toDF("key","value")



   //3.产生StreamQuery对象
    val query:StreamingQuery = results.writeStream
      .outputMode(OutputMode.Update())
      .format("kafka")
      .option("checkpointLocation", "hdfs://CentOS:9000/structured-checkpoints-kafka")
      .option("kafka.bootstrap.servers", "CentOS:9092")
      .option("topic", "topic01")//覆盖DF中topic字段
      .start()

    query.awaitTermination()
Console sink(for debugging)

每次有触发器时,将输出打印到控制台/stdout。支持追加和完整输出模式。由于每次触发后都会收集全部输出并将其存储在驱动程序的内存中,因此应在数据量较小时用于调试目的。

//3.产生StreamQuery对象
    val query:StreamingQuery = results.writeStream
      .outputMode(OutputMode.Complete())
      .format("console")
      .option("numRows","2")
      .option("truncate","true")
      .start()
Memory sink(for debugging)

输出作为内存表存储在内存中。支持追加和完整输出模式。当整个输出被收集并存储在驱动程序的内存中时,应将其用于调试低数据量的目的。因此,请谨慎使用。

val lines:DataFrame = spark.readStream
      .format("socket")
      .option("host", "CentOS")
      .option("port", 9999)
      .load()

    //2.持续查询,和SQL保持一致
    val wordCounts:DataFrame = lines.as[String]
      .flatMap(_.split(" "))
      .groupBy("value")
      .count()


    //3.产生StreamQuery对象
    val query:StreamingQuery = wordCounts.writeStream
      .outputMode(OutputMode.Complete())
      .format("memory")
      .queryName("t_word")
      .start()

    new Thread(){
      override def run(): Unit = {
        while(true){
          Thread.sleep(1000)
          spark.sql("select * from t_word").show()
        }
      }
    }.start()
Foreach|ForeachBatch sink

对输出中的记录运行任意输出。使用foreach和foreachBatch操作,您可以在流查询的输出上应用任意操作并编写逻辑。它们的用例略有不同-虽然foreach允许在每一行上使用自定义写逻辑,但是foreachBatch允许在每个微批处理的输出上进行任意操作和自定义逻辑。
foreachBatch
foreachBatch(…)允许您指定在流查询的每个微批处理的输出数据上执行的函数。

val lines:DataFrame = spark.readStream
      .format("socket")
      .option("host", "CentOS")
      .option("port", 9999)
      .load()

    //2.持续查询,和SQL保持一致
    val wordCounts:DataFrame = lines.as[String]
      .flatMap(_.split(" "))
      .groupBy("value")
      .count()


    //3.产生StreamQuery对象
    val query:StreamingQuery = wordCounts.writeStream
      .outputMode(OutputMode.Complete())
      .foreachBatch((ds,batchID)=>{
          ds.write
            .mode(SaveMode.Overwrite).format("json")
            .save("hdfs://CentOS:9000/results/structured-json")
      })
      .start()

Foreach

class WordCountWriter extends ForeachWriter[Row]{
  override def open(partitionId: Long, epochId: Long): Boolean = {
   // println("打开了链接")
    true //执行process
  }

  override def process(value: Row): Unit = {
    var Row(word,count)=value
    println(word,count)
  }

  override def close(errorOrNull: Throwable): Unit = {
   // println("释放资源")
  }
}
//3.产生StreamQuery对象
    val query:StreamingQuery = wordCounts.writeStream
      .outputMode(OutputMode.Complete())
      .foreach(new WordCountWriter)
      .start()
写数据到Redis
object StructedStreamForEach {

  def main(args: Array[String]): Unit = {

    val spark = SparkSession
      .builder
      .appName("StructuredNetworkWordCount")
      .master("local[*]")
      .getOrCreate()
    import spark.implicits._
    spark.sparkContext.setLogLevel("ERROR")

    val lines: DataFrame = spark.readStream
      .format("socket")
      .option("host", "train")
      .option("port", 9999)
      .load()

    //2.持续查询,和SQL保持一致
    val wordCounts: DataFrame = lines.as[String]
      .flatMap(_.split(" "))
      .groupBy("value")
      .count()


    //3.产生StreamQuery对象
    val query: StreamingQuery = wordCounts.writeStream
      .outputMode(OutputMode.Complete())
      .foreach(new WordCountWriter)
      .start()


    query.awaitTermination()
  }

}
class WordCountWriter extends ForeachWriter[Row]{


  lazy val jedisPool:JedisPool=createJedisPool()
  var jedis:Jedis=null


  def createJedisPool(): JedisPool = {
    new JedisPool("train",6379)
  }

  override def open(partitionId: Long, epochId: Long): Boolean = {
    jedis=jedisPool.getResource
    true //执行process
  }

  override def process(value: Row): Unit = {
    var Row(word,count)=value
    println(word,count)
    jedis.set(word.toString,count.toString)
  }

  override def close(errorOrNull: Throwable): Unit = {
    jedis.close()
  }
}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值