一 DStream输出
输出操作指定了对流数据经转化操作得到的数据所要执行的操作(例如把结果推入外部数据库或输出到屏幕上)。与RDD中的惰性求值类似,如果一个DStream及其派生出的DStream都没有被执行输出操作,那么这些DStream就都不会被求值。如果StreamingContext中没有设定输出操作,整个context就都不会启动。
输出操作如下:
- print():在运行流程序的驱动结点上打印DStream中每一批次数据的最开始10个元素。这用于开发和调试。在Python API中,同样的操作叫print()。
- saveAsTextFiles(prefix, [suffix]):以text文件形式存储这个DStream的内容。每一批次的存储文件名基于参数中的prefix和suffix。”prefix-Time_IN_MS[.suffix]”。
- saveAsObjectFiles(prefix, [suffix]):以Java对象序列化的方式将Stream中的数据保存为 SequenceFiles . 每一批次的存储文件名基于参数中的为"prefix-TIME_IN_MS[.suffix]". Python中目前不可用。
- saveAsHadoopFiles(prefix, [suffix]):将Stream中的数据保存为 Hadoop files. 每一批次的存储文件名基于参数中的为"prefix-TIME_IN_MS[.suffix]"。Python API 中目前不可用。
- foreachRDD(func):这是最通用的输出操作,即将函数 func 用于产生于 stream的每一个RDD。其中参数传入的函数func应该实现将每一个RDD中数据推送到外部系统,如将RDD存入文件或者通过网络将其写入数据库。通用的输出操作foreachRDD(),它用来对DStream中的RDD运行任意计算。这和transform() 有些类似,都可以让我们访问任意RDD。在foreachRDD()中,可以重用我们在Spark中实现的所有行动操作。比如,常见的用例之一是把数据写到诸如MySQL的外部数据库中。
注意:
- 连接不能写在driver层面(序列化)
- 如果写在foreach则每个RDD中的每一条数据都创建,得不偿失;
- 增加foreachPartition,在分区创建(获取)。
二 优雅关闭
流式任务需要7*24小时执行,但是有时涉及到升级代码需要主动停止程序,但是分布式程序,没办法做到一个个进程去杀死,所有配置优雅的关闭就显得至关重要了。
使用外部文件系统来控制内部程序关闭。
1 MonitorStop
import java.net.URI
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.streaming.{StreamingContext, StreamingContextState}
class MonitorStop(ssc: StreamingContext) extends Runnable {
override def run(): Unit = {
// 监控HDFS文件的变化
val fs: FileSystem = FileSystem.get(new URI("hdfs://hadoop101:8020"), new Configuration(), "hike")
while (true) {
try
Thread.sleep(5000)
catch {
case e: InterruptedException =>
e.printStackTrace()
}
// 获取Streaming状态
val state: StreamingContextState = ssc.getState
// 判断路径是否存在
val bool: Boolean = fs.exists(new Path("hdfs://hadoop101:8020/stopSpark"))
if (bool) {
// 如果环境对象处于活动状态,可以进行关闭操作
if (state == StreamingContextState.ACTIVE) {
ssc.stop(stopSparkContext = true, stopGracefully = true)
System.exit(0)
}
}
}
}
}
2 SparkTest
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object SparkTest {
def createSSC(): _root_.org.apache.spark.streaming.StreamingContext = {
val update: (Seq[Int], Option[Int]) => Some[Int] = (values: Seq[Int], status: Option[Int]) => {
//当前批次内容的计算
val sum: Int = values.sum
//取出状态信息中上一次状态
val lastStatu: Int = status.getOrElse(0)
Some(sum + lastStatu)
}
val sparkConf: SparkConf = new SparkConf().setMaster("local[4]").setAppName("SparkTest")
//设置优雅的关闭
sparkConf.set("spark.streaming.stopGracefullyOnShutdown", "true")
val ssc = new StreamingContext(sparkConf, Seconds(5))
ssc.checkpoint("./ck")
val line: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop101", 9999)
val word: DStream[String] = line.flatMap(_.split(" "))
val wordAndOne: DStream[(String, Int)] = word.map((_, 1))
val wordAndCount: DStream[(String, Int)] = wordAndOne.updateStateByKey(update)
wordAndCount.print()
ssc
}
def main(args: Array[String]): Unit = {
val ssc: StreamingContext = StreamingContext.getActiveOrCreate("./ck", () => createSSC())
new Thread(new MonitorStop(ssc)).start()
ssc.start()
ssc.awaitTermination()
}
}
三 SparkStreaming 案例实操
1 环境准备
(1) pom文件
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.alibaba/druid -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>druid</artifactId>
<version>1.1.10</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.27</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>2.10.1</version>
</dependency>
</dependencies>
(2) 工具类
PropertiesUtil
import java.io.InputStreamReader
import java.util.Properties
object PropertiesUtil {
def load(propertiesName:String): Properties ={
val prop=new Properties()
prop.load(new InputStreamReader(Thread.currentThread().getContextClassLoader.getResourceAsStream(propertiesName) , "UTF-8"))
prop
}
}
2 实时数据生成模块
本实战项目实时的分析处理用户对广告点击的行为数据。使用代码的方式持续的生成数据,然后写入到kafka中,然后从kafka消费数据,并对数据根据需求进行分析。
模拟出来的数据格式:
时间戳 | 地区 | 城市 | 用户id | 广告id |
---|---|---|---|---|
(1)config.properties
#jdbc配置
jdbc.datasource.size=10
jdbc.url=jdbc:mysql://hadoop101:3306/spark2020?useUnicode=true&characterEncoding=utf8&rewriteBatchedStatements=true
jdbc.user=root
jdbc.password=000000
# Kafka配置
kafka.broker.list=hadoop101:9092,hadoop102:9092,hadoop103:9092
(2)CityInfo
/**
*
* 城市信息表
*
* @param city_id 城市id
* @param city_name 城市名称
* @param area 城市所在大区
*/
case class CityInfo (city_id:Long,
city_name:String,
area:String)
(3)RandomOptions
import scala.collection.mutable.ListBuffer
import scala.util.Random
case class RanOpt[T](value: T, weight: Int)
object RandomOptions {
def apply[T](opts: RanOpt[T]*): RandomOptions[T] = {
val randomOptions = new RandomOptions[T]()
for (opt <- opts) {
randomOptions.totalWeight += opt.weight
for (i <- 1 to opt.weight) {
randomOptions.optsBuffer += opt.value
}
}
randomOptions
}
}
class RandomOptions[T](opts: RanOpt[T]*) {
var totalWeight = 0
var optsBuffer = new ListBuffer[T]
def getRandomOpt: T = {
val randomNum: Int = new Random().nextInt(totalWeight)
optsBuffer(randomNum)
}
}
(4)MockerRealTime
object MockerRealTime {
/**
* 模拟的数据
*
* 格式 :timestamp area city userid adid
* 某个时间点 某个地区 某个城市 某个用户 某个广告
*/
def generateMockData(): Array[String] = {
val array: ArrayBuffer[String] = ArrayBuffer[String]()
val CityRandomOpt = RandomOptions(RanOpt(CityInfo(1, "北京", "华北"), 30),
RanOpt(CityInfo(2, "上海", "华东"), 30),
RanOpt(CityInfo(3, "广州", "华南"), 10),
RanOpt(CityInfo(4, "深圳", "华南"), 20),
RanOpt(CityInfo(5, "天津", "华北"), 10))
val random = new Random()
// 模拟实时数据:
// timestamp province city userid adid
for (i <- 0 to 50) {
val timestamp: Long = System.currentTimeMillis()
val cityInfo: CityInfo = CityRandomOpt.getRandomOpt
val city: String = cityInfo.city_name
val area: String = cityInfo.area
val adid: Int = 1 + random.nextInt(6)
val userid: Int = 1 + random.nextInt(6)
// 拼接实时数据
array += timestamp + " " + area + " " + city + " " + userid + " " + adid
}
array.toArray
}
def createKafkaProducer(broker: String): KafkaProducer[String, String] = {
// 创建配置对象
val prop = new Properties()
// 添加配置
prop.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, broker)
prop.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
prop.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
// 根据配置创建Kafka生产者
new KafkaProducer[String, String](prop)
}
def main(args: Array[String]): Unit = {
// 获取配置文件config.properties中的Kafka配置参数
val config: Properties = PropertiesUtil.load("config.properties")
val broker: String = config.getProperty("kafka.broker.list")
val topic = "test"
// 创建Kafka消费者
val kafkaProducer: KafkaProducer[String, String] = createKafkaProducer(broker)
while (true) {
// 随机产生实时数据并通过Kafka生产者发送到Kafka集群中
for (line <- generateMockData()) {
kafkaProducer.send(new ProducerRecord[String, String](topic, line))
println(line)
}
Thread.sleep(2000)
}
}
}
3 模拟数据生成步骤
(1)开启集群
启动 zookeeper 和 Kafka
(2)在 kafka 中创建topic: ads_log
>bin/kafka-topics.sh --zookeeper hadoop202:2181 --list
>bin/kafka-topics.sh --zookeeper hadoop202:2181 --create --topic my-ads-bak --partitions 3 --replication-factor 2
(3)产生循环不断的数据到指定的topic
创建maven模块spark-realtime,添加scala支持,添加依赖支持
实现相关的类
(4)运行MockerRealtime,确认 kafka 中数据是否生成成功
注意:在测试的时候需要修改MockRealTime和RealtimeApp类中的Kafka的配置信息
4 需求一:广告黑名单
实现实时的动态黑名单机制:将每天对某个广告点击超过 100 次的用户拉黑。
注:黑名单保存到MySQL中。
(1)思路分析
1)读取Kafka数据之后,并对MySQL中存储的黑名单数据做校验;
2)校验通过则对给用户点击广告次数累加一并存入MySQL;
3)在存入MySQL之后对数据做校验,如果单日超过100次则将该用户加入黑名单。
(2)MySQL建表
创建库spark2022
1)存放黑名单用户的表
CREATE TABLE black_list (userid CHAR(1) PRIMARY KEY);
2)存放单日各用户点击每个广告的次数
CREATE TABLE user_ad_count (
dt varchar(255),
userid CHAR (1),
adid CHAR (1),
count BIGINT,
PRIMARY KEY (dt, userid, adid)
);
(3)环境准备
接下来开始实时需求的分析,需要用到SparkStreaming来做实时数据的处理,在生产环境中,绝大部分时候都是对接的Kafka数据源,创建一个SparkStreaming读取Kafka数据的工具类。
MyKafkaUtil
object MyKafkaUtil {
//1.创建配置信息对象
private val properties: Properties = PropertiesUtil.load("config.properties")
//2.用于初始化链接到集群的地址
val broker_list: String = properties.getProperty("kafka.broker.list")
//3.kafka消费者配置
val kafkaParam = Map(
"bootstrap.servers" -> broker_list,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
//消费者组
"group.id" -> "commerce-consumer-group",
//如果没有初始化偏移量或者当前的偏移量不存在任何服务器上,可以使用这个配置属性
//可以使用这个配置,latest自动重置偏移量为最新的偏移量
"auto.offset.reset" -> "latest",
//如果是true,则这个消费者的偏移量会在后台自动提交,但是kafka宕机容易丢失数据
//如果是false,会需要手动维护kafka偏移量
"enable.auto.commit" -> (true: java.lang.Boolean)
)
// 创建DStream,返回接收到的输入数据
// LocationStrategies:根据给定的主题和集群地址创建consumer
// LocationStrategies.PreferConsistent:持续的在所有Executor之间分配分区
// ConsumerStrategies:选择如何在Driver和Executor上创建和配置Kafka Consumer
// ConsumerStrategies.Subscribe:订阅一系列主题
def getKafkaStream(topic: String, ssc: StreamingContext): InputDStream[ConsumerRecord[String, String]] = {
val dStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String](Array(topic), kafkaParam))
dStream
}
}
JdbcUtil
object JdbcUtil {
//初始化连接池
var dataSource: DataSource = init()
//初始化连接池方法
def init(): DataSource = {
val properties = new Properties()
val config: Properties = PropertiesUtil.load("config.properties")
properties.setProperty("driverClassName", "com.mysql.jdbc.Driver")
properties.setProperty("url", config.getProperty("jdbc.url"))
properties.setProperty("username", config.getProperty("jdbc.user"))
properties.setProperty("password", config.getProperty("jdbc.password"))
properties.setProperty("maxActive", config.getProperty("jdbc.datasource.size"))
DruidDataSourceFactory.createDataSource(properties)
}
//获取MySQL连接
def getConnection: Connection = {
dataSource.getConnection
}
//执行SQL语句,单条数据插入
def executeUpdate(connection: Connection, sql: String, params: Array[Any]): Int = {
var rtn = 0
var pstmt: PreparedStatement = null
try {
connection.setAutoCommit(false)
pstmt = connection.prepareStatement(sql)
if (params != null && params.length > 0) {
for (i <- params.indices) {
pstmt.setObject(i + 1, params(i))
}
}
rtn = pstmt.executeUpdate()
connection.commit()
pstmt.close()
} catch {
case e: Exception => e.printStackTrace()
}
rtn
}
//执行SQL语句,批量数据插入
def executeBatchUpdate(connection: Connection, sql: String, paramsList: Iterable[Array[Any]]): Array[Int] = {
var rtn: Array[Int] = null
var pstmt: PreparedStatement = null
try {
connection.setAutoCommit(false)
pstmt = connection.prepareStatement(sql)
for (params <- paramsList) {
if (params != null && params.length > 0) {
for (i <- params.indices) {
pstmt.setObject(i + 1, params(i))
}
pstmt.addBatch()
}
}
rtn = pstmt.executeBatch()
connection.commit()
pstmt.close()
} catch {
case e: Exception => e.printStackTrace()
}
rtn
}
//判断一条数据是否存在
def isExist(connection: Connection, sql: String, params: Array[Any]): Boolean = {
var flag: Boolean = false
var pstmt: PreparedStatement = null
try {
pstmt = connection.prepareStatement(sql)
for (i <- params.indices) {
pstmt.setObject(i + 1, params(i))
}
flag = pstmt.executeQuery().next()
pstmt.close()
} catch {
case e: Exception => e.printStackTrace()
}
flag
}
//获取MySQL的一条数据
def getDataFromMysql(connection: Connection, sql: String, params: Array[Any]): Long = {
var result: Long = 0L
var pstmt: PreparedStatement = null
try {
pstmt = connection.prepareStatement(sql)
for (i <- params.indices) {
pstmt.setObject(i + 1, params(i))
}
val resultSet: ResultSet = pstmt.executeQuery()
while (resultSet.next()) {
result = resultSet.getLong(1)
}
resultSet.close()
pstmt.close()
} catch {
case e: Exception => e.printStackTrace()
}
result
}
//主方法,用于测试上述方法
def main(args: Array[String]): Unit = {
}
}
(4)代码实现
Ads_log
case class Ads_log(timestamp: Long,
area: String,
city: String,
userid: String,
adid: String)
BlackListHandler
object BlackListHandler {
//时间格式化对象
private val sdf = new SimpleDateFormat("yyyy-MM-dd")
def addBlackList(filterAdsLogDSteam: DStream[Ads_log]): Unit = {
//统计当前批次中单日每个用户点击每个广告的总次数
//1.将数据接转换结构 ads_log=>((date,user,adid),1)
val dateUserAdToOne: DStream[((String, String, String), Long)] = filterAdsLogDSteam.map(adsLog => {
//a.将时间戳转换为日期字符串
val date: String = sdf.format(new Date(adsLog.timestamp))
//b.返回值
((date, adsLog.userid, adsLog.adid), 1L)
})
//2.统计单日每个用户点击每个广告的总次数 ((date,user,adid),1)=>((date,user,adid),count)
val dateUserAdToCount: DStream[((String, String, String), Long)] = dateUserAdToOne.reduceByKey(_ + _)
dateUserAdToCount.foreachRDD(rdd => {
rdd.foreachPartition(iter => {
val connection: Connection = JdbcUtil.getConnection
iter.foreach { case ((dt, user, ad), count) =>
JdbcUtil.executeUpdate(connection,
"""
|INSERT INTO user_ad_count (dt,userid,adid,count)
|VALUES (?,?,?,?)
|ON DUPLICATE KEY
|UPDATE count=count+?
""".stripMargin, Array(dt, user, ad, count, count))
val ct: Long = JdbcUtil.getDataFromMysql(connection, "select count from user_ad_count where dt=? and userid=? and adid =?", Array(dt, user, ad))
if (ct >= 30) {
JdbcUtil.executeUpdate(connection, "INSERT INTO black_list (userid) VALUES (?) ON DUPLICATE KEY update userid=?", Array(user, user))
}
}
connection.close()
})
})
}
def filterByBlackList(adsLogDStream: DStream[Ads_log]): DStream[Ads_log] = {
adsLogDStream.transform(rdd => {
rdd.filter(adsLog => {
val connection: Connection = JdbcUtil.getConnection
val bool: Boolean = JdbcUtil.isExist(connection, "select * from black_list where userid=?", Array(adsLog.userid))
connection.close()
!bool
})
})
}
}
RealtimeApp
object RealTimeApp {
def main(args: Array[String]): Unit = {
//1.创建SparkConf
val sparkConf: SparkConf = new SparkConf().setAppName("RealTimeApp ").setMaster("local[*]")
//2.创建StreamingContext
val ssc = new StreamingContext(sparkConf, Seconds(3))
//3.读取数据
val kafkaDStream: InputDStream[ConsumerRecord[String, String]] = MyKafkaUtil.getKafkaStream("ads_log", ssc)
//4.将从Kafka读出的数据转换为样例类对象
val adsLogDStream: DStream[Ads_log] = kafkaDStream.map(record => {
val value: String = record.value()
val arr: Array[String] = value.split(" ")
Ads_log(arr(0).toLong, arr(1), arr(2), arr(3), arr(4))
})
//5.需求一:根据MySQL中的黑名单过滤当前数据集
val filterAdsLogDStream: DStream[Ads_log] = BlackListHandler2.filterByBlackList(adsLogDStream)
//6.需求一:将满足要求的用户写入黑名单
BlackListHandler2.addBlackList(filterAdsLogDStream)
//测试打印
filterAdsLogDStream.cache()
filterAdsLogDStream.count().print()
//启动任务
ssc.start()
ssc.awaitTermination()
}
}