创建执行环境
- Flink支持 批处理 和 流处理,两者创建执行环境的API是不一样的,创建批处理env的代码如下:
val env = ExecutionEnvironment.getExecutionEnvironment
创建流处理env的代码如下:
val env = StreamExecutionEnvironment.getExecutionEnvironment
添加Source
- 基于集合的source
val stream = env.fromCollection(List(
TemperatureRecord("d1", 25.5, 100),
TemperatureRecord("d2", 25.5, 100),
TemperatureRecord("d3", 25.5, 100),
TemperatureRecord("d4", 25.5, 100),
TemperatureRecord("d5", 25.5, 100),
TemperatureRecord("d6", 25.5, 100),
TemperatureRecord("d7", 25.5, 100)
))
- 基于文件的source
val stream = env.readTextFile("""/source.txt""")
- 基于网络套接字的source
val dataStream = env.socketTextStream("192.168.1.101", 7777)
- 自定义的source
kafka source
val properties = new Properties()
properties.setProperty("bootstrap.servers", "192.168.1.101:9092")
properties.setProperty("zookeeper.connect", "192.168.1.101:2181")
properties.setProperty("group.id", "kafkaStreamTest")
val kafka11 = new FlinkKafkaConsumer011[String]("kafkaStreamTest", new SimpleStringSchema(), properties)
val stream = env.addSource(kafka11)
自定义测试source
class TestSourceFunction extends SourceFunction[String] {
var running = true
override def run(sourceContext: SourceFunction.SourceContext[String]): Unit = {
var i = 0
while (running) {
i += 1
sourceContext.collect(i + " " + "v_" + Random.nextInt(5))
Thread.sleep(1000)
}
}
override def cancel(): Unit = {
running = false
}
}
Transform
Flink中的tranform可以类比spark中的transform,也是进行转换的API,但细节可能不同
Transformation | 转换 | 描述 |
---|---|---|
map | DataStream → DataStream | 取一个元素并产生一个元素 |
flatMap | DataStream → DataStream | 取一个元素并产生0或多个元素 |
filter | DataStream → DataStream | 对数据流进行过滤 |
keyBy | DataStream → KeyedStream | 类似数据库中的group by |
reduce | KeyedStream → DataStream | 有一个泛型T,接收两个T类型的参数,返回一个T类型的参数,返回值作为下一次执行的第一个参数,第二个参数是数据流里面的数据 |
fold | KeyedStream → DataStream | 已弃用类似reduce,但接收两个泛型,进出流的泛型可以不一样 |
Aggregations | KeyedStream → DataStream | 对KeyedStream进行聚合,包含sum,min,max,minBy,maxBy |
union | DataStream*→ DataStream | 连接 多个同类型的数据流 |
connect | DataStream,DataStream → ConnectedStreams | 连接两个数据流,这两个数据流的类型可以不相同 |
split | DataStream → SplitStream | 根据某些标准将流分成两个或多个流 |
select | SplitStream → DataStream | 从拆分流中选择一个或多个流 |
window | KeyedStream → WindowedStream | 在已经分区的KeyedStreams上定义Window |
timeWindow | KeyedStream → WindowedStream | 在已经分区的KeyedStreams上定义时间Window |
countWindow | KeyedStream → WindowedStream | 在已经分区的KeyedStreams上定义计数Window |
windowAll | DataStream → WindowedStream | 在普通的DataStreams上定义Window |
Window api待更
以上操作都支持传入一个自定义的函数类,如MapFunction、RichMapFunction
Sink
Flink的Sink类似Spark中的action,主要用于数据的输出
常见的Sink
- kafka Sink
首先,增加依赖
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.10_2.11</artifactId>
<version>1.3.1</version>
</dependency>
主程序
val env = StreamExecutionEnvironment.getExecutionEnvironment
//TestObjectSourceFunction是自定义的测试Source
val stream = env.addSource(new TestObjectSourceFunction)
val producer = new FlinkKafkaProducer011[String]("localhost:9092", "test", new SimpleStringSchema())
stream.map(_.id)
.addSink(producer)
env.execute("kafkaSinkTest")
- es Sink
- 首先,增加依赖
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-elasticsearch6_2.11</artifactId>
<version>1.7.2</version>
</dependency>
主程序
val env = StreamExecutionEnvironment.getExecutionEnvironment
//TestObjectSourceFunction是默认的测试Source
val stream = env.addSource(new TestObjectSourceFunction)
val httpHosts = new util.ArrayList[HttpHost]
httpHosts.add(new HttpHost("localhost", 9200))
val esSink = new ElasticsearchSink.Builder[ApplyInfo](httpHosts, new ElasticsearchSinkFunction[ApplyInfo] {
override def process(item: ApplyInfo, ctx: RuntimeContext, requestIndexer: RequestIndexer): Unit = {
val esSource = new util.HashMap[String, String]()
esSource.put("id", item.id)
esSource.put("areaCode", item.areaCode)
val req = Requests.indexRequest("applyInfo").`type`("applyInfo").source(esSource)
requestIndexer.add(req)
}
}).build()
stream.addSink(esSink)
env.execute("esSinkTest")
- 自定义Sink
实现自定义的Sink Function即可,一般实现RichSinkFunction会提供更丰富的功能
val env = StreamExecutionEnvironment.getExecutionEnvironment
//TestObjectSourceFunction是默认的测试Source
val stream = env.addSource(new TestObjectSourceFunction)
stream.addSink(new RichSinkFunction[ApplyInfo] {
var out: OutputStream = _
override def open(parameters: Configuration): Unit = {
out = new FileOutputStream("/tmp/customSink.txt")
}
override def invoke(value: ApplyInfo, context: SinkFunction.Context[_]): Unit = {
out.write((value.id + "," + value.areaCode + "\r\n").getBytes(StandardCharsets.UTF_8))
}
override def close(): Unit = {
out.close()
}
})
env.execute("esSinkTest")