updateStateByKey算子
需求:统计到目前为止累积出现的单词的个数(需要保持住以前的状态)
UpdateStateByKey Operation
The updateStateByKey operation allows you to maintain arbitrary state while continuously updating it with new information. To use this, you will have to do two steps.
- Define the state - The state can be an arbitrary data type.
- Define the state update function - Specify with a function how to update the state using the previous state and the new values from an input stream.
val
ssc =
new
StreamingContext(
new
SparkConf().setMaster(
"local[2]"
).setAppName(
"StatefulWordCount"
)
,
Seconds
(
5
))
//
如果使用了
stateful
算子(带
state
的
function
),必须设置
checkpoint
//
生产环境,建议设置为
hdfs
的目录,提供容错性
ssc.checkpoint(
"."
)
val
result = ssc.textFileStream(
"D:
\\
leedsoft
\\
test"
).flatMap(_.split(
" "
)).map((_
,
1
))
result.updateStateByKey(
updateFunction
_).print()
ssc.start()
ssc.awaitTermination()
需求:将统计结果写入到MySQL
create table wordcount(
word varchar(50) default null,
wordcount int(10) default null
);
通过该sql将统计结果写入到MySQL
insert into wordcount(word, wordcount) values('" + record._1 + "'," + record._2 + ")"
1)存在性能问题
dstream.foreachRDD { rdd =>
val connection = createNewConnection() // executed at the driver
rdd.foreach { record =>
connection.send(record) // executed at the worker
}
}
2)推荐借助foreachPartition、connectionPool
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>//按照分区,获取所有记录
// ConnectionPool is a static, lazily initialized pool of connections//创建分区内共享的数据库连接
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
}
存在的问题:
1) 对于已有的数据做更新,而是所有的数据均为insert
改进思路:
a) 在插入数据前先判断单词是否存在,如果存在就update,不存在则insert
b) 工作中:HBase/Redis
2) 每个rdd的partition创建connection,建议大家改成连接池
Window Operations
window:定时的进行一个时间段内的数据处理
window length : 窗口的长度
sliding interval: 窗口的间隔
这2个参数和我们的batch size的关系:必须是倍数,以为一个batch是一个time unit
每隔多久计算某个范围内的数据:每隔10秒计算前10分钟的wc
==> 每隔sliding interval统计前window length的值
黑名单过滤
访问日志 ==> DStream
20180808,zs
20180808,ls
20180808,ww
==> (zs: 20180808,zs)(ls: 20180808,ls)(ww: 20180808,ww)
黑名单列表 ==> RDD
zs
ls
==>(zs: true)(ls: true)
==> 20180808,ww
leftjoin
(zs: [<20180808,zs>, <true>]) x
(ls: [<20180808,ls>, <true>]) x
(ww: [<20180808,ww>, <false>]) ==> tuple 1
/**
*
构建黑名单
rdd
*/
val
blacklistRDD = scc.sparkContext.parallelize(
List
(
"zs"
,
"ls"
)).map((_
,
true
))
val
dStream = scc.socketTextStream(
"centos1"
,
8080
).map(x => (x.split(
","
)(
1
)
,
x))
//DStream
是一个batches RDD。。
//transform、foreachRDD,都是基于每个RDD执行一个function
//区别是transform是DStream的tranformation Operation、foreachRDD是一个Output Operation
val
res = dStream.transform(rdd => {
//rdd
中的记录,始终是以一个
tuple变量
的形式存在。。
//最初读入时,根据按行还是其他指定分隔符,将数据读入(v
)为单位的RDD
//执行xxByKey操作时,tuple必定是(k,v)形式,即二元组。。。即pairRDD 就是以(k,v)为单位的RDD
rdd.leftOuterJoin(blacklistRDD).filter(x => !x._2._2.getOrElse(
false
)).map(_._2._1)
})
val words: DStream[
String] = scc.socketTextStream(
"centos1"
,
8080).flatMap(_.split(
" "))
words.foreachRDD { rdd =>
// Get the singleton instance of SparkSessionval spark = SparkSession. builder.config(rdd.sparkContext.getConf).getOrCreate()import spark.implicits._
// Convert RDD[String] to DataFrameval wordsDataFrame = rdd.toDF( "word")
// Create a temporary viewwordsDataFrame.createOrReplaceTempView( "words")
// Do word count on DataFrame using SQL and print itval wordCountsDataFrame =spark.sql( "select word, count(*) as total from words group by word")wordCountsDataFrame.show()
}