Flink project scala篇

最新推荐文章于 2022-10-07 22:47:30 发布

ManJustDoneIt

最新推荐文章于 2022-10-07 22:47:30 发布

阅读量305

点赞数

分类专栏： Flink_study_project

本文链接：https://blog.csdn.net/weixin_41628546/article/details/106450895

版权

本文详细介绍了使用Scala进行Flink项目实践的各个阶段，包括用户行为分析、网络流量分析、登录失败检测、订单超时检测和市场分析。重点强调了每个阶段的关键点，如数据处理、窗口函数、Bloom过滤器、CEP复杂事件处理等，并提供了注意事项和错误总结，适合Flink初学者和进阶者学习。

摘要由CSDN通过智能技术生成

学习渠道：尚硅谷

stage00 tips

pom.xml以及数据

[戳我👇](链接: https://pan.baidu.com/s/1ASPKIqxT4cM63Q0swZuqrg 密码: d58l)

常用函数

/**模版
  source

  transform

    map --- 得到bean
    assignAscendingTimestamps --- 数据要带时间戳  而且通常是乱序的窗口
    filter --- 过滤
    keyBy --- id聚合操作
    timeWindow --- 滑动窗口
    aggregate --- 聚合操作aggregate（计数聚合 ， 窗口处理函数）
      【本次结合输入的数据，经窗口函数处理后编程「itemId , EndOfEveryWindow , 计数聚合得到的结果」】
    keyBy --- 窗口聚合操作
    process --- 底层API处理函数
      采用定时器根据时间戳进行输出
    trigger --- 触发器【什么时间才触发process操作？】

  print//sink

  */

attention

注意事项：1.keyBy中如果使用字符串，需要进行TypeInformation的转换

2.在定时器中采用ListBuffer暂存State，，需要进行隐式转换，详见：com.analysis.HotItemAnalysis

scala mistake summary 【summary most important!!!】

jar包版本
process之前没有window()相关的函数，因此process中定义的定时器不起作用[待进一步验证]
注意各种隐式转换的问题
keyBy等的关键字，最好采用的是之前类中包含的，不然使用Tuple需要TypeInformation指定相应的类型

stage01 user behavior analysis

1.totalcount

// An highlighted block
val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    val inStream:DataStream[ ] = env.addSource(new【self define RichParallelSourceFunction】)
      .assignAscendingTimestamps(_.timestamp)

    val stream = inStream.filter(data => {
   condition})
      .map(data=>{
   
        
      })
      .keyBy(_._1)
      .timeWindow(Time.hours(1) /*, Time.seconds(10)*/)
      .process(new 【self define ProcessWindowFunction】)
      .print

    //println(getClass.getName)
    env.execute(getClass.getName)

2.HotItemAnalysis


import java.util.Properties

import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.common.state.{
   ListState, ListStateDescriptor}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.util.Collector

import scala.collection.mutable.ListBuffer

/**
  * userId: Long, itemId: Long, categoryId: Int, behavior: String, timestamp: Long)
  * itemId: Long, windowEnd: Long, count: Long
  */

case class UserBehavior(userId:Long , itemId:Long , categoryId : Long , behavior : String , timestamp : Long)

case class ItemViewCount(itemId :Long , windowEnd:Long , count :Long)

object HotItemAnalysis {
   

  def main(args: Array[String]): Unit = {
   

    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    /**
      * 最上面的那个module的路径  copy path 操作最好
      */
    //val inputStream = env.readTextFile("./userBehaviorAnalysis/src/main/resources/UserBehavior.csv")

    //val properties = new Properties()
    //properties.setProperty("bootstrap.servers","localhost:9092")
    //properties.setProperty("group.id" , "consumer-group")
    //properties.setProperty("key.deserializer","org.apache.kafka.common.serialization.StringDeserializer")
    //properties.setProperty("value.deserializer","org.apache.kafka.common.serialization.StringDeserializer")
    //properties.setProperty("auto.offset.reset","latest")
    val inputStream =
//      env.addSource(
//      new FlinkKafkaConsumer[String]("hotitems", new SimpleStringSchema() , properties)
//    )
      env.readTextFile("./userBehaviorAnalysis/src/main/resources/UserBehavior.csv")




    val dataStream = inputStream.map(data =>{
   
      var fields = data.split(",")
      UserBehavior(fields(0).toLong , fields(1).toLong , fields(2).toLong
      ,fields(3) , fields(4).toLong)
    })//设置时间戳水位线，这样就有时间标记的数据流了 //timestamp大致看出来是秒的格式，转成毫秒
      .assignAscendingTimestamps(_.timestamp * 1000)
      //我们要做的是行为的过滤  pv 最适合/代表了热度
      .filter(_.behavior == "pv")
      .keyBy(_.itemId)
      //用时间做一小时内每五分钟的滑动窗口实时分析item热度
      .timeWindow(Time.hours(1) , Time.minutes(5))
      //对item做count，将count与item与时间窗口进行绑定
      .aggregate(new countAgg() , new WindowResultFunction())
      //变成这种类型了   ItemViewCount(key ,window.getEnd ,countRes)
      .keyBy(_.windowEnd)
      .process(new TopNHotItems(3))
      .print()

    env.execute("my hot items job")
  }
}

class countAgg() extends AggregateFunction[UserBehavior ,Long , Long]{
   
  override def createAccumulator(): Long = 0L

  override def add(in: UserBehavior, acc: Long): Long = acc + 1

  override def getResult(acc: Long): Long = acc

  override def merge(acc: Long, acc1: Long): Long = acc + acc1
}

/**
  * in , out , key , window  ---- 都是type
  * 所有的key都跑进来干活，分配上时间窗口的logo
  */
class WindowResultFunction() extends WindowFunction[Long , ItemViewCount , Long , TimeWindow]{
   
  override def apply(key: Long, window: TimeWindow, aggregateRes: Iterable[Long],
                     out: Collector[ItemViewCount]): Unit = {
   
    val countRes = aggregateRes.iterator.next()
    //就不需要将整个窗口进行传输了，我们传窗口最后的时间就可以了
    out.collect(ItemViewCount(key ,window.getEnd ,countRes))
  }
}

/**
  * 需要根据已经对窗口进行排序对数据进行集合并统计排序输出
  *   按照窗口把所有的不同key的数据都集中起来，，然后在定时器里面进行集中"训练"  ---  集中操作
  * @param topSize
  */
class TopNHotItems(topSize: Int) extends KeyedProcessFunction[Long ,ItemViewCount , String]{
   
  private var itemState : ListState[ItemViewCount] = _
  override def open(parameters: Configuration): Unit = {
   
    super.open(parameters)
    //open 时将上下文中此时在跑的数据对状态存入itemState中
    itemState = getRuntimeContext.getListState(
      new ListStateDescriptor[ItemViewCount]("item_state" , classOf[ItemViewCount])
    )
  }

  //1。保存数据
    //2。注册+1 定时器
  override def processElement(i: ItemViewCount, context: KeyedProcessFunction[Long, ItemViewCount, String]#Context,
                              collector: Collector[String]): Unit = {
   
    //保存数据
    itemState.add(i)
    //注册定时器  等待时间到达回调onTimer
    context.timerService().registerEventTimeTimer(i.windowEnd + 1)
  }

  /**
    * 可以让processElement先存储数据，然后onTimer集中处理，
    */
  override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Long, ItemViewCount, String]#OnTimerContext,
                       out: Collector[String]): Unit = {
   
    //ListState对buffer先将数据（id,window，pvCount）进行存储
      //方便itemState清空释放缓存
    var allItems :ListBuffer[ItemViewCount] = ListBuffer();


    /**
      * Error:(109, 30) value foreach is not a member of Iterable[com.analysis.ItemViewCount]
      * for(item <- itemState.get()){
      *
      *
      * 看了一圈，还是需要下述那样的隐式转换去操作，然后添加State的元素也用那种+= -=的方式才更为方便
      */


    import scala.collection.JavaConversions._
    for(item <- itemState.get()){
   
      allItems += item//这里相当于列表的添加操作？？秀到我了。。
    }

    itemState.clear()
    println("allItems.size = " + allItems.size)

    //排序
      //take联想到spark取出前几的那个语法
    var sortItems = allItems.sortBy(_.count)(Ordering.Long.reverse).take(topSize)


    //整理输出,时间戳展示的时候记得要回去，因为上面定时器往时间后移了一秒，方便等待
    var res = new StringBuilder
    res.append("======================\n")
        .append("时间:" + (timestamp - 1) + "\n")

    //展示商品
      //indices:索引
    for( i <- sortItems.indices){
   
      res.append("No." + (i + 1) + "\t商品id: " + sortItems(i).itemId +
        "\thot: " + sortItems(i).count + "\n")
    }

    out.collect(res.toString())

  }
}

stage02 network total flow analysis

1.PageView(pv)

注意事项：1.时间窗口[,]是左开右开的

2.针对scala transform的函数，我们尽量先拜读源码，查看相应的参数，在编程

3.需要特别注意的是：Java和Scala的api很容易混乱，尽量不要在同一个项目中进行编写


import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time

case class UserBehavior(userId:Long , itemId :Long , catagoryId : Long , beHavior : String , timestamp : Long)

/**
  * 是只有一个小时的数据吗
  * 对的，一个小时一个小时统计
  * 巧了，，，真的是多一秒都给拦下来，牛逼牛逼
  * 将1511690400这个数据全都删掉，那么就会只有9个窗口
  *   而 1511690400 - 1511658000 巧好为9hour ，因为区间是左开右开的，即[]
  *   例如:[0,3600]
  */
object PageView {
   

  def main(args: Array[String]): Unit = {
   

    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    val stream = env.readTextFile(
      "./networkTotalFlowAnalysis/src/main/resources/Userbehavior.csv")
      .map(data =>{
   
        var fields = data.split(",")
        UserBehavior(fields(0).toLong,fields(1).toLong,fields(2).toLong,
          fields(3),fields(4).toLong)
      })
      .filter(_.beHavior == "pv")
      .assignAscendingTimestamps(_.timestamp * 1000)
      .map(data => ("pv" , 1))
      //下面的_._1代表第一个参数即"pv"
      .keyBy(_._1)
      .timeWindow(Time.hours(1))
      //position 0 1 2 ...
      .sum(1)
      .print()

    env.execute("statistic pv job")

  }
}

2.HotPageOnLog

注意事项：1.filter中可采用正则表达式进行数据过滤，有一个预定义正则貌似能加快速度

2.统计热度，思路大致是一样的，先keyBy itemId ,窗口，在进行排序统计

3.编程思想：先keyBy itemId 在aggregate(累加器，窗口函数[累加器的输出就是窗口函数的输入])

然后对窗口进行keyBy，然后采用process(自定义KeyedProcessFunction)，

再使用定时器进行缓存数据以方便清理State，然后进行排序{例：records.sortBy(_.hot)(Ordering.Long.reverse).take(top5)}后输出


import java.text.SimpleDateFormat

import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.api.common.state.{
   ListState, ListStateDescriptor}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

import scala.collection.mutable.ListBuffer

case class ApacheEvent(ip:String , userId :String , eventTime : Long , method : String ,
                       url:String )

case class UrlViewCount(url:String , windowEnd :Long , count :Long)

object HotPageOnLog {
   

  def main(args: Array[String]): Unit = {
   

    val env = StreamExecutionEnvironment.getExecutionEnvironment

    env.setParallelism(1)

    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    val stream = env.readTextFile("./networkTotalFlowAnalysis/src/main/resources/apache.log")
      .map(data =>{
   
        var fields = data.split(" ")
        val sdf = new SimpleDateFormat("dd/MM/yyyy:HH:mm:ss")
        var timestamp = sdf.parse(fields(3)).getTime
        ApacheEvent(fields(0

最低0.47元/天解锁文章

ManJustDoneIt

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Flink project scala篇

Flink project-UserBehaviorAnalysis-TotalCount1.tipsstage01 user behavior analysis1.totalcountstage02 network total flow analysis1.PageView(pv)2.HotPageOnLog1.tips/**模版 source transform map --- 得到bean assignAscendingTimestamps --- 数据要带时间戳而
复制链接

扫一扫

专栏目录