大数据-SparkStreaming（五）

最新推荐文章于 2022-01-06 22:43:03 发布

海恋北斗星

最新推荐文章于 2022-01-06 22:43:03 发布

阅读量149

点赞数

分类专栏：大数据文章标签： scala SparkStreaming

本文链接：https://blog.csdn.net/zy12306/article/details/108866779

版权

大数据专栏收录该内容

70 篇文章 7 订阅

订阅专栏

大数据-SparkStreaming（五）

SparkStreaming和SparkSQL整合

pom.xml里面添加

<dependency>
     <groupId>org.apache.spark</groupId>
     <artifactId>spark-sql_2.11</artifactId>
     <version>2.3.3</version>
</dependency>

代码开发

package com.kaikeba.streaming

import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}

/**
  * sparkStreaming整合sparksql
  */
object SocketWordCountForeachRDDDataFrame {

  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.ERROR)

    // todo: 1、创建SparkConf对象
    val sparkConf: SparkConf = new SparkConf().setAppName("NetworkWordCountForeachRDDDataFrame").setMaster("local[2]")

    // todo: 2、创建StreamingContext对象
    val ssc = new StreamingContext(sparkConf,Seconds(2))

    //todo: 3、接受socket数据
    val socketTextStream: ReceiverInputDStream[String] = ssc.socketTextStream("node01",9999)

    //todo: 4、对数据进行处理
    val words: DStream[String] = socketTextStream.flatMap(_.split(" "))

    //todo: 5、对DStream进行处理，将RDD转换成DataFrame
      words.foreachRDD(rdd=>{

          //获取得到sparkSessin，由于将RDD转换成DataFrame需要用到SparkSession对象
        val sparkSession: SparkSession = SparkSession.builder().config(rdd.sparkContext.getConf).getOrCreate()
        import sparkSession.implicits._
        val dataFrame: DataFrame = rdd.toDF("word")

        //将dataFrame注册成表
         dataFrame.createOrReplaceTempView("words")

        //统计每个单词出现的次数
         val result: DataFrame = sparkSession.sql("select word,count(*) as count from words group by word")

         //展示结果
        result.show()

      })

    //todo: 6、开启流式计算
    ssc.start()
    ssc.awaitTermination()


  }
}

SparkStreaming容错

SparkStreaming运行流程回顾

Executor失败

Tasks和Receiver自动的重启，不需要做任何的配置。

Driver失败

用checkpoint机制恢复失败的Driver

定期的将Driver信息写入到HDFS中。

步骤一：设置自动重启Driver程序

Standalone：

在spark-submit中增加以下两个参数：
--deploy-mode cluster
--supervise #失败后是否重启Driver

使用示例：

spark-submit \
--master spark://node01:7077 \
--deploy-mode cluster \
--supervise \
--class com.kaikeba.streaming.Demo \
--executor-memory 1g \
--total-executor-cores 2 \
original-sparkStreamingStudy-1.0-SNAPSHOT.jar

Yarn：

在spark-submit中增加以下参数：

--deploy-mode cluster

在yarn配置中设置yarn.resourcemanager.am.max-attemps参数 ,默认为2，例如：

<property>
  <name>yarn.resourcemanager.am.max-attempts</name>
  <value>4</value>
  <description>
    The maximum number of application master execution attempts.
  </description>
</property>

使用示例：

spark-submit \
--master yarn \
--deploy-mode cluster \
--class com.kaikeba.streaming.Demo \
--executor-memory 1g \
--total-executor-cores 2 \
original-sparkStreamingStudy-1.0-SNAPSHOT.jar

步骤二：设置HDFS的checkpoint目录

streamingContext.setCheckpoint(hdfsDirectory)

步骤三：代码实现

// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
  val ssc = new StreamingContext(...)   // new context
  val lines = ssc.socketTextStream(...) // create DStreams
  ...
  ssc.checkpoint(checkpointDirectory)   // set checkpoint directory
  ssc
}

// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)

// Do additional setup on context that needs to be done,
// irrespective of whether it is being started or restarted
context. ...

// Start the context
context.start()
context.awaitTermination()

海恋北斗星

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
大数据-SparkStreaming（五）

大数据-SparkStreaming（五）SparkStreaming和SparkSQL整合pom.xml里面添加<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.3.3</version&gt...
复制链接

扫一扫

专栏目录