SparkStreaming 进阶与案例实战

最新推荐文章于 2020-05-15 23:24:52 发布

leedsjung

最新推荐文章于 2020-05-15 23:24:52 发布

阅读量250

点赞数

分类专栏： Spark Streaming

本文链接：https://blog.csdn.net/leedsjung/article/details/79583153

版权

Spark Streaming 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

updateStateByKey算子

需求：统计到目前为止累积出现的单词的个数(需要保持住以前的状态)

UpdateStateByKey Operation

The updateStateByKey operation allows you to maintain arbitrary state while continuously updating it with new information. To use this, you will have to do two steps.

Define the state - The state can be an arbitrary data type.
Define the state update function - Specify with a function how to update the state using the previous state and the new values from an input stream.

val ssc = new StreamingContext( new SparkConf().setMaster( "local[2]" ).setAppName( "StatefulWordCount" ) , Seconds ( 5 ))

// 如果使用了 stateful 算子（带 state 的 function ），必须设置 checkpoint

// 生产环境，建议设置为 hdfs 的目录，提供容错性

ssc.checkpoint( "." )

val result = ssc.textFileStream( "D: \\ leedsoft \\ test" ).flatMap(_.split( " " )).map((_ , 1 ))

result.updateStateByKey( updateFunction _).print()

ssc.start()

ssc.awaitTermination()

Design Patterns for using foreachRDD

需求：将统计结果写入到MySQL

create table wordcount(

word varchar(50) default null,

wordcount int(10) default null

);

通过该sql将统计结果写入到MySQL

insert into wordcount(word, wordcount) values('" + record._1 + "'," + record._2 + ")"

1)存在性能问题

dstream.foreachRDD { rdd =>

val connection = createNewConnection() // executed at the driver

rdd.foreach { record =>

connection.send(record) // executed at the worker

}

2）推荐借助foreachPartition、connectionPool

dstream.foreachRDD { rdd =>

rdd.foreachPartition { partitionOfRecords =>//按照分区，获取所有记录

// ConnectionPool is a static, lazily initialized pool of connections//创建分区内共享的数据库连接

val connection = ConnectionPool.getConnection()

partitionOfRecords.foreach(record => connection.send(record))

ConnectionPool.returnConnection(connection) // return to the pool for future reuse

}

存在的问题：

1) 对于已有的数据做更新，而是所有的数据均为insert

改进思路：

a) 在插入数据前先判断单词是否存在，如果存在就update，不存在则insert

b) 工作中：HBase/Redis

2) 每个rdd的partition创建connection，建议大家改成连接池

Window Operations

window：定时的进行一个时间段内的数据处理

window length ：窗口的长度

sliding interval：窗口的间隔

这2个参数和我们的batch size的关系：必须是倍数，以为一个batch是一个time unit

每隔多久计算某个范围内的数据：每隔10秒计算前10分钟的wc

==> 每隔sliding interval统计前window length的值

黑名单过滤

访问日志 ==> DStream

20180808,zs

20180808,ls

20180808,ww

==> (zs: 20180808,zs)(ls: 20180808,ls)(ww: 20180808,ww)

黑名单列表 ==> RDD

==>(zs: true)(ls: true)

==> 20180808,ww

leftjoin

(zs: [<20180808,zs>, <true>]) x

(ls: [<20180808,ls>, <true>]) x

(ww: [<20180808,ww>, <false>]) ==> tuple 1

/**

* 构建黑名单 rdd

val blacklistRDD = scc.sparkContext.parallelize( List ( "zs" , "ls" )).map((_ , true ))

val dStream = scc.socketTextStream( "centos1" , 8080 ).map(x => (x.split( "," )( 1 ) , x))

//DStream 是一个batches RDD。。

//transform、foreachRDD，都是基于每个RDD执行一个function

//区别是transform是DStream的tranformation Operation、foreachRDD是一个Output Operation

val res = dStream.transform(rdd => {

//rdd 中的记录，始终是以一个 tuple变量的形式存在。。

//最初读入时，根据按行还是其他指定分隔符，将数据读入(v )为单位的RDD

//执行xxByKey操作时，tuple必定是(k,v)形式，即二元组。。。即pairRDD 就是以(k,v)为单位的RDD

rdd.leftOuterJoin(blacklistRDD).filter(x => !x._2._2.getOrElse( false )).map(_._2._1)

})

Spark Streaming 整合 Spark SQL 实现WordCount

val words: DStream[ String] = scc.socketTextStream( "centos1" , 8080).flatMap(_.split( " "))

words.foreachRDD { rdd =>

// Get the singleton instance of SparkSession

val spark = SparkSession. builder.config(rdd.sparkContext.getConf).getOrCreate()

import spark.implicits._

// Convert RDD[String] to DataFrame

val wordsDataFrame = rdd.toDF( "word")

// Create a temporary view

wordsDataFrame.createOrReplaceTempView( "words")

// Do word count on DataFrame using SQL and print it

val wordCountsDataFrame =

spark.sql( "select word, count(*) as total from words group by word")

wordCountsDataFrame.show()

}

leedsjung

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
SparkStreaming 进阶与案例实战

updateStateByKey算子需求：统计到目前为止累积出现的单词的个数(需要保持住以前的状态)UpdateStateByKey OperationThe updateStateByKey operation allows you to maintain arbitrary state while continuously updating it with new information. T...
复制链接

扫一扫