文章目录
Flink 流处理程序的结构如下:
- 创建 Flink 程序执行环境。
- 从数据源读取一条或者多条流数据
- 使用流转换算子实现业务逻辑
- 将计算结果输出到一个或者多个外部设备(可选)
- 执行程序
1.Enviroment(创建 Flink 程序执行环境)
1.1 getExecutionEnvironment()
调用静态 getExecutionEnvironment() 方法来获取执行环境
1.批处理:
val env: ExecutionEnvironment =
ExecutionEnvironment.getExecutionEnvironment
2.流处理:
val env: StreamExecutionEnvironment =
StreamExecutionEnvironment.getExecutionEnvironment
1.2 createLocalEnvironment():创建本地执行环境
val localEnv = StreamExecutionEnvironment
.createLocalEnvironment()
1.3 createRemoteEnvironment():创建远程执行环境
val remoteEnv = StreamExecutionEnvironment
.createRemoteEnvironment(
"host", // hostname of JobManager
123456, // port of JobManager process
"path/to/jarFile.jar"
) // JAR file to ship to the JobManager
2.Source(读取输入流)
2.1 从集合读取
val stream = env
.fromCollection(List(
SensorReading("sensor_1", 1547718199, 35.80018327300259),
SensorReading("sensor_6", 1547718199, 15.402984393403084),
SensorReading("sensor_7", 1547718199, 6.720945201171228),
SensorReading("sensor_10", 1547718199, 38.101067604893444)
))
2.2 从文件读取
import org.apache.flink.streaming.api.scala._
/**
* @Author jaffe
* @Date 2020/06/09 10:57
*/
object SourceFromFile {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val stream = env
.readTextFile("F:\\ide\\moven\\flink0608\\src\\main\\resources\\test")
.map(r => {
val arr = r.split(",")
SensorReading(arr(0), arr(1).toLong, arr(2).toDouble)
})
stream.print()
env.execute()
}
}
case class SensorReading(id: String,
timestamp: Long,
temperature: Double)
2.3 从kafka读取
import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.{
FlinkKafkaConsumer011, FlinkKafkaProducer011}
/**
* @Author jaffe
* @Date 2020/06/10 00:20
*/
object KafkaExample {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val props = new Properties()
props.put("bootstrap.servers","hadoop103:9092")
props.put("group.id","consumer-group")
props.put(
"key.deserializer",
"org.apache.kafka.common.serialization.StringDeserialization"
)
props.put(
"value.deserializer",
"org.apache.kafka.common.serialization.StringDeserialization"
)
props.put("auto.offset.reset","latest")
val stream = env
.addSource(
new FlinkKafkaConsumer011[String](
"test",//主题
new SimpleStringSchema(),
props
)
)
stream.print()
env.execute()
}
}
2.4 自定义Source
import java.util.Calendar
import org.apache.flink.streaming.api.functions.source.{
RichParallelSourceFunction, SourceFunction}
import scala.util.Random
/**
* @Author jaffe
* @Date 2020/06/09 10:57
*/
// (温度传感器ID, 时间戳,温度值)
case class SensorReading(id: String,
timestamp: Long,
temperature: Double)
// 用来源源不断的产生温度读数,造了一条数据流
// 实现自定义数据源,需要实现`RichParallelSourceFunction`
// 数据源产生的事件类型是`SensorReading`
class SensorSource extends RichParallelSourceFunction[SensorReading]{
// 表示数据源是否正在运行,`true`表示正在运行
var running = true
// `run`函数会连续不断的发送`SensorReading`数据
// 使用`SourceContext`来发送数据
override def run(sourceContext: SourceFunction.SourceContext[SensorReading]): Unit = {
// 初始化随机数发生器,用来产生随机的温度读数
val rand = new Random
// 初始化10个(温度传感器ID,温度读数)元组
// `(1 to 10)`从1遍历到10
var curFTemp = (1 to 10).map(
// 使用高斯噪声产生温度读数
i => ("sensor_"+ i, 65 + (rand.nextGaussian() * 20))
)
// 无限循环,产生数据流
while (running) {
// 更新温度
curFTemp = curFTemp.map(t => (t._1,t._2 + (rand.nextGaussian() * 0.5)))
// 获取当前的时间戳,单位是ms
val curTime = Calendar.getInstance.getTimeInMillis
// 调用`SourceContext`的`collect`方法来发射出数据
// Flink的算子向下游发送数据,基本都是`collect`方法
curFTemp.foreach(t => sourceContext.collect(SensorReading(t._1,curTime,t._2)))
// 100ms发送一次数据
Thread.sleep(100)
}
}
// 当取消任务时,关闭无限循环
override def cancel(): Unit = running = false
}
执行
import com.jaffe.day02.SensorSource
import org.apache.flink.streaming.api.scala._
/**
* @Author jaffe
* @Date 2020/06/09 10:57
*/
object SourceFromCustomDataSource {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val stream = env
// 添加数据源
.addSource(new SensorSource)
stream.print()
env.execute()
}
}
3.Transform(转换算子)
3.1 基本转换算子
3.1.1 Map
将每一个输入的事件传送到一个用户自定义的 mapper,输出和输入事件的类型可能不一样
import com.jaffe.day02.{
SensorReading, SensorSource}
import org.apache.flink.api.common.functions.MapFunction
import org.apache.flink.streaming.api.scala._
/**
* @Author jaffe
* @Date 2020/06/09 10:55
*/
object MapExample {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val stream = env.