flink training 打车实时计算项目

最新推荐文章于 2022-10-14 18:34:35 发布

xiaopeigen

最新推荐文章于 2022-10-14 18:34:35 发布

阅读量732

点赞数 1

分类专栏： Flink 文章标签： flink

本文链接：https://blog.csdn.net/xiaopeigen/article/details/108323975

版权

Flink 专栏收录该内容

18 篇文章 3 订阅

订阅专栏

本文介绍一个开源的flink项目，主要提供了一些业务场景和flink api结合的case。重点是为Flink的状态和时间管理api提供直观的介绍，掌握了这些基础知识后，能够更好地理解和运用flink。

GitHub地址： http://training.ververica.com/trainingData 建议clone到本地练习

- 如何建立一个环境来开发Flink程序
- 如何实现流数据处理管道
- Flink管理状态
- 如何使用事件时间一致计算准确的分析
- 如何在连续流上构建事件驱动的应用程序
- Flink如何能够提供容错的、有状态的流处理和精确的一次语义
- 各种operator实现

一项目概览

出租车数据集包含纽约市出租车的信息：每一次骑行都由两个事件表示:旅程开始事件和旅程结束事件；出租车车费数据的相关数据集，实现以下算法：

1、清洗掉开始或者结束经纬度都不在纽约市区域内的行程记录日志

2、计算出租车上/下客人热点区域

filter过滤不是纽约的，map将记录映射为（cellid,event type）,key（cellid,event type），timewindow

count,过滤count<20，map映射为最终结果

DataStream<Tuple5<Float, Float, Long, Boolean, Integer>> popularSpots = rides
                // remove all rides which are not within NYC
                .filter(new RideCleansing.NYCFilter())
                // match ride to grid cell and event type (start or end)
                .map(new GridCellMatcher())
                // partition by cell id and event type
                .<KeyedStream<Tuple2<Integer, Boolean>, Tuple2<Integer, Boolean>>>keyBy(0, 1)
                // build sliding window
                .timeWindow(Time.minutes(15), Time.minutes(5))
                // count ride events in window
                .apply(new RideCounter())
                // filter by popularity threshold
                .filter((Tuple4<Integer, Long, Boolean, Integer> count) -> (count.f3 >= popThreshold))
                // map grid cell to coordinates
                .map(new GridToCoordinates());

3、Event Time 和 Watermarks、窗口（Windows）等操作

4、事件驱动应用，KeyedProcessFunction的实现

5、基于State的计算，将每次乘车的 TaxiRide 和 TaxiFare 记录进行 join 操作

6、Checkpoints 和 Savepoints

7、广播状态运用等等...

二数据准备

flink-traing的大部分例子是以New York City Taxi & Limousine Commission 提供的一份历史数据集作为练习数据源，其中最常用一种类型为taxi ride的事件定义为

rideId         : Long      // a unique id for each ride
taxiId         : Long      // a unique id for each taxi
driverId       : Long      // a unique id for each driver
isStart        : Boolean   // TRUE for ride start events, FALSE for ride end events
startTime      : DateTime  // the start time of a ride
endTime        : DateTime  // the end time of a ride,
                           //   "1970-01-01 00:00:00" for start events
startLon       : Float     // the longitude of the ride start location
startLat       : Float     // the latitude of the ride start location
endLon         : Float     // the longitude of the ride end location
endLat         : Float     // the latitude of the ride end location
passengerCnt   : Short     // number of passengers on the ride

下载数据集

wget http://training.data-artisans.com/trainingData/nycTaxiRides.gz

将数据源转化为flink stream source数据

// get an ExecutionEnvironment
StreamExecutionEnvironment env =
  StreamExecutionEnvironment.getExecutionEnvironment();
// configure event-time processing
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

// get the taxi ride data stream
DataStream<TaxiRide> rides = env.addSource(
  new TaxiRideSource("/path/to/nycTaxiRides.gz", maxDelay, servingSpeed));

具体项目代码分析可参看下面的博客文章

flink基础与flink培训——出租车乘车项目_康健的专栏-CSDN博客

https://blog.csdn.net/healthsun/article/details/103786991?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-16.nonecase&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-16.nonecase

xiaopeigen

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
2
评论
flink training 打车实时计算项目

本文介绍一个开源的flink项目，主要提供了一些业务场景和flink api结合的case。重点是为Flink的状态和时间管理api提供直观的介绍，掌握了这些基础知识后，能够更好地理解和运用flink。GitHub地址：http://training.ververica.com/trainingData 建议clone到本地练习- 如何建立一个环境来开发Flink程序 - 如何实现流数据处理管道 - Flink管理状态 - 如何使用事件时间一致计算准确的分析 - 如何在连续流上构建事件...
复制链接

扫一扫