前言
在前面的博客Flink笔记01——入门篇中,我们提到了Flink 常用的API,如下图所示:
这篇博客,南国主要讲述一下Flink的DataStream。
DataStream的编程模型
DataSTream的编程模型包括4个部分:Environment,DataSource,Transformation,SInk。
构建上下文环境Environment 比较简单,而且之前南国也说过了,主要分为构建DataStream 实时计算和DataSet做批计算。
DataStream的数据源
基于文件的Source
这里还可以分为读取本地文件系统的数据和基于HDFS中的数据,一般而言我们会把数据源放在HDFS中。
为了内容的全面,南国这里简单写了个读取本地文件的demo:
def main(args: Array[String]): Unit = {
//1.初始化flink 流计算的环境
val streamEnv: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//2.导入隐式转换
import org.apache.flink.streaming.api.scala._
//3.读取数据
val stream = streamEnv.readTextFile("/wordcount.txt")
//DataStream类同于sparkStreaming中的DStream
//4.转换和处理数据
val result = stream.flatMap(_.split(" "))
.map((_, 1))
.keyBy(0) //分组算子 0或者1代表前面DataStream的下标,0代表单词 1代表出现的次数
.sum(1) //聚合累加
//5.打印结果
result.print("结果")
//6.启动流计算程序
streamEnv.execute("wordcount")
}
关于读取HDFS的数据源,首先需要再项目工程文件中加入Hadoop相关的依赖:
<!--hadoop相关依赖-->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
demo:
def main(args: Array[String]): Unit = {
//1.初始化flink 流计算的环境
val streamEnv: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//2.导入隐式转换
import org.apache.flink.streaming.api.scala._
//3.读取数据
val stream = streamEnv.readTextFile("hdfs://hadoop101:9000/wc.txt")
//DataStream类同于sparkStreaming中的DStream
//4.转换和处理数据
val result = stream.flatMap(_.split(" "))
.map((_, 1))
.keyBy(0) //分组算子 0或者1代表前面DataStream的下标,0代表单词 1代表出现的次数
.sum(1) //聚合累加
//5.打印结果
result.print("结果")
//6.启动流计算程序
streamEnv.execute("wordcount")
}
看完 大家是否发现 基于文件的数据源的代码几乎一样,只是在stream.readTextFile(“path”),更改了path。
基于集合的source
def main(args: Array[String]): Unit = {
//初始化Flink的Streaming(流计算)上下文执行环境
val streamEnv = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换,
import org.apache.flink.streaming.api.scala._ //读取数据
var dataStream =streamEnv.fromCollection(Array(
new StationLog("001","186","189","busy",1577071519462L,0),
new StationLog("002","186","188","busy",1577071520462L,0),
new StationLog("003","183","188","busy",1577071521462L,0),
new StationLog("004","186","188","success",1577071522462L,32)
))
dataStream.print()
streamEnv.execute()
}
简单来说,就是在代码中手动创建集合来作为DataStream的数据源 进行测试。
基于kafka的数据源
首先添加Kafka的相关依赖:
<!--kafka相关依赖-->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>1.9.1</version>
</dependency>
1.读取kafka中的string类型数据
package com.flink.primary.DataSource
import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.kafka.common.serialization.StringDeserializer
/**
* 读取kafka中的普通数据 String
* flink连接kafka比较 sparkStreaming来说更加简单(SparkStreaming来凝结kafka有Receiver Direct两种模式等等)
* @author xjh 2020.4.5
*/
object kafka_Source_String {
def main(args: Array[String]): Unit = {
//1.初始化flink 流计算的环境
val streamEnv: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//2.导入隐式转换
import org.apache.flink.streaming.api.scala._
//连接kafka
val properties = new Properties()
properties.setProperty("bootstrap.servers", "m1:9092,m2:9092,m3:9093")
properties.setProperty("groupid", "Flink_project")
properties.setProperty("key.deserializer", classOf[StringDeserializer].getName)
properties.setProperty("value.deserializer", classOf[StringDeserializer].getName)
properties.setProperty("auto.offset.reset", "latest")
//设置kafka数据源,这里kafka中的数据是String
val stream = streamEnv.addSource(new FlinkKafkaConsumer[String]("t_test", new SimpleStringSchema(), properties))
stream.print()
streamEnv.execute()
}
}
2.读取kafka中的键值对数据
package com.flink.primary.DataSource
import java.util.Properties
import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer, KafkaDeserializationSchema}
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.flink.streaming.api.scala._
/**
* 读取kafka中的key/value数据
* @author xjh 2020.4.6
*/
object kafka_Source_keyValue {
def main(args: Array[String]): Unit = {
val environment = StreamExecutionEnvironment.getExecutionEnvironment
//连接kafka
val properties = new Properties()
properties.setProperty("bootstrap.servers", &#