输出流_5分钟Flink 侧输出流(SideOutput)

本文详细介绍了Apache Flink中如何使用侧输出流(SideOutput)进行数据分流,通过一个案例展示了如何从同一数据流中分离出包含特定字符串("side")的事件,并给出了启动Kafka、定义数据Bean、编写生产者工具类以及主程序的实现过程。
摘要由CSDN通过智能技术生成

8c7fbda8b492d212f842e72a1fb4b62c.png

代码版本

Flink : 1.10.0 Scala : 2.12.6

侧输出流(SideOutput)

本文介绍的内容是侧输出流(SideOutput),在平时大部分的 DataStream API 的算子的输出是单一输出,也就是某一种或者说某一类数据流,流向相同的地方。

在处理不同的流中,除了 split 算子,可以将一条流分成多条流,这些流的数据类型也都相同。ProcessFunction 的 side outputs 功能可以产生多条流,并且这些流的数据类型可以不一样。一个 side output 可以定义为 OutputTag[X]对象,X 是输出流的数据类型。process function 可以通过 Context 对象发射一个事件到一个或者多个 side outputs。

当使用旁路输出时,首先需要定义一个OutputTag来标识一个旁路输出流

下面给出scala的表达形式:

val outputTag = OutputTag[String]("side-output")

注意:OutputTag是如何根据旁路输出流包含的元素类型typed的     可以通过以下几种函数发射数据到旁路输出,本文给出ProcessFunction的案例

  • ProcessFunction

  • CoProcessFunction

  • ProcessWindowFunction

  • ProcessAllWindowFunction

案例

下面举一个例子是将含有特殊字符串的流区分开,数据由两个定义好的工具类向Kafka灌入不同内容的数据,然后通过侧输出流(SideOutput)将不同的流进行分离,得到不同的输出

数据内容如下:

常规输出内容:{"id":3,"name":"Johngo3","age":13,"sex":1,"email":"Johngo3@flink.com","time":1590067813271}侧输出流输出内容:{"id":3,"name":"Johngo_side3","age":13,"sex":1,"email":"Johngo_side3@flink.com","time":1590067813271}

很明显看到,咱们要把带有 “side” 字样的数据进行摘取出来

下面按照步骤来进行

1.启动Kafka

该步骤按照各自的环境进行操作,我这里按照我本地的Kafka进行启动

启动ZooKeeper和Kafka

nohup bin/zookeeper-server-start.sh config/zookeeper.properties &nohup bin/kafka-server-start.sh config/server.properties &

创建 topic,名称person_t:

$ kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic person_t

测试消费数据

$ kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic person_t --from-beginning

2.定义bean

先定义一个POJO类(Person_t.scala)

package com.tech.beanimport scala.beans.BeanPropertyclass Person_t() {  @BeanProperty var id:Int = 0  @BeanProperty var name:String = _  @BeanProperty var age:Int = 0  @BeanProperty var sex:Int = 2  @BeanProperty var email:String = _  @BeanProperty var time:Long = 0L  // 实现toString()方法  override def toString: String = {    "id:"+ this.id + "," +    "name:"+ this.name + "," +    "age:"+ this.age + "," +    "sex:"+ this.sex + "," +    "email:"+ this.email + "," +    "time:"+ this.time  }}

3.编写工具类

工具类的作用是向Kafka中写入数据,编写两个方法,分别为ProduceToKafkaUtil1和ProduceToKafkaUtil2,不同数据源写入同一个Topic

ProduceToKafkaUtil1.scala

package com.tech.utilimport java.text.SimpleDateFormatimport java.util.{Date, Properties}import com.google.gson.Gsonimport com.tech.bean.Person_timport org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}/**  * 创建 topic:  * kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic person_t  *  * 消费数据:  * kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic person_t --from-beginning  *  * 添加时间字段  *  */object ProduceToKafkaUtil1 {  final val broker_list: String = "localhost:9092"  final val topic = "person_t"  def produceMessageToKafka(): Unit = {//    val sdf: SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")    val writeProps = new Properties()    writeProps.setProperty("bootstrap.servers", broker_list)    writeProps.setProperty("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")    writeProps.setProperty("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")    val producer = new KafkaProducer[String, String](writeProps)    for (i 1 to       val curDate = System.currentTimeMillis()      val person: Person_t = new Person_t()      person.setId(i)      person.setName("Johngo" + i)      person.setAge(10 + i)      person.setSex(i%2)      person.setEmail("Johngo" + i + "@flink.com")      person.setTime(curDate.toLong)      val record = new ProducerRecord[String, String](topic, null, null,  new Gson().toJson(person))      producer.send(record)      println("SendMessageToKafka: " + new Gson().toJson(person))      Thread.sleep(2000)    }    producer.flush()  }  def main(args: Array[String]): Unit = {    this.produceMessageToKafka()  }}

ProduceToKafkaUtil2.scala

package com.tech.utilimport java.util.Propertiesimport com.google.gson.Gsonimport com.tech.bean.Person_timport org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}/**  * 创建 topic:  * kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic person_t  *  * 消费数据:  * kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic person_t --from-beginning  *  * 添加时间字段  *  */object ProduceToKafkaUtil2 {  final val broker_list: String = "localhost:9092"  final val topic = "person_t"  def produceMessageToKafka(): Unit = {//    val sdf: SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")    val writeProps = new Properties()    writeProps.setProperty("bootstrap.servers", broker_list)    writeProps.setProperty("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")    writeProps.setProperty("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")    val producer = new KafkaProducer[String, String](writeProps)    for (i 1 to       val curDate = System.currentTimeMillis()      val person: Person_t = new Person_t()      person.setId(i)      person.setName("Johngo_side" + i)      person.setAge(10 + i)      person.setSex(i%2)      person.setEmail("Johngo_side" + i + "@flink.com")      person.setTime(curDate.toLong)      val record = new ProducerRecord[String, String](topic, null, null,  new Gson().toJson(person))      producer.send(record)      println("SendMessageToKafka: " + new Gson().toJson(person))      Thread.sleep(2000)    }    producer.flush()  }  def main(args: Array[String]): Unit = {    this.produceMessageToKafka()  }}

定义好工具写Kafka的类之后,下面进行侧输出流的实现

4.侧输出流案例

下面案例可以使用webUI地址访问:http://localhost:8081/#/job/running

但是要记住添加如下的pom依赖:

<dependency>    <groupId>org.apache.flinkgroupId>    <artifactId>flink-runtime-web_2.12artifactId>    <version>1.10.0version>    <scope>compilescope>dependency>

定义一个OutputTag来标识一个旁路输出流

val outputTag = new OutputTag[String]("person_t_side-output")

实现主函数

sideOutputPerson_t.scala

package com.tech.sideoutputimport com.alibaba.fastjson.JSONimport com.tech.bean.Person_timport com.tech.util.KafkaSourceUtilimport org.apache.flink.configuration.Configurationimport org.apache.flink.streaming.api.datastream.DataStreamimport org.apache.flink.streaming.api.environment.StreamExecutionEnvironmentimport org.apache.flink.streaming.api.functions.ProcessFunctionimport org.apache.flink.streaming.api.scala._import org.apache.flink.util.Collectorobject sideOutputPerson_t {  def main(args: Array[String]): Unit = {    // UI地址访问:http://localhost:8081/#/job/running    val env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration())    val ksu = new KafkaSourceUtil("person_t", "test-consumer-group")    val dstream = env.addSource(ksu.getSouceInfo())    // 首先需要定义一个OutputTag来标识一个旁路输出流    val outputTag = new OutputTag[String]("person_t_side-output")    val mainDataStream = dstream      .map(line => {        JSON.parseObject(line, classOf[Person_t])      })    val sideOutput = mainDataStream.process(new ProcessFunction[Person_t, String] {      override def processElement(                                   value: Person_t,                                   ctx: ProcessFunction[Person_t, String]#Context,                                   out: Collector[String]): Unit = {        if (!value.getName.contains("_side")) {          out.collect(value.toString)        } else {          // 测输出流输出的部分          ctx.output(outputTag, "sideOutput-> 带有_side标识的数据名称" + value.getName)        }      }    })    val sideOutputStream: DataStream[String] = sideOutput.getSideOutput(outputTag)    // 测输出流处理    sideOutputStream.print("测输出流")    // 常规数据处理    sideOutput.print("常规数据")    env.execute("outSideput")  }}

5.程序启动

分别启动两个写Kafka的工具类:

ProduceToKafkaUtil1开始写入(不带side字样,大家观察)

d1fd765fa78f90b7ee75df8a0391fdcc.png

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.SendMessageToKafka: {"id":1,"name":"Johngo1","age":11,"sex":1,"email":"Johngo1@flink.com","time":1590068784050}SendMessageToKafka: {"id":2,"name":"Johngo2","age":12,"sex":0,"email":"Johngo2@flink.com","time":1590068786475}SendMessageToKafka: {"id":3,"name":"Johngo3","age":13,"sex":1,"email":"Johngo3@flink.com","time":1590068788477}SendMessageToKafka: {"id":4,"name":"Johngo4","age":14,"sex":0,"email":"Johngo4@flink.com","time":1590068790481}SendMessageToKafka: {"id":5,"name":"Johngo5","age":15,"sex":1,"email":"Johngo5@flink.com","time":1590068792483}SendMessageToKafka: {"id":6,"name":"Johngo6","age":16,"sex":0,"email":"Johngo6@flink.com","time":1590068794489}SendMessageToKafka: {"id":7,"name":"Johngo7","age":17,"sex":1,"email":"Johngo7@flink.com","time":1590068796492}SendMessageToKafka: {"id":8,"name":"Johngo8","age":18,"sex":0,"email":"Johngo8@flink.com","time":1590068798494}SendMessageToKafka: {"id":9,"name":"Johngo9","age":19,"sex":1,"email":"Johngo9@flink.com","time":1590068800494}

ProduceToKafkaUtil2开始写入(带side字样,大家观察)

bcdb76cae8f248f6b575e4e7aeb4bdc8.png

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.SendMessageToKafka: {"id":1,"name":"Johngo_side1","age":11,"sex":1,"email":"Johngo_side1@flink.com","time":1590068787210}SendMessageToKafka: {"id":2,"name":"Johngo_side2","age":12,"sex":0,"email":"Johngo_side2@flink.com","time":1590068789521}SendMessageToKafka: {"id":3,"name":"Johngo_side3","age":13,"sex":1,"email":"Johngo_side3@flink.com","time":1590068791526}SendMessageToKafka: {"id":4,"name":"Johngo_side4","age":14,"sex":0,"email":"Johngo_side4@flink.com","time":1590068793528}SendMessageToKafka: {"id":5,"name":"Johngo_side5","age":15,"sex":1,"email":"Johngo_side5@flink.com","time":1590068795531}SendMessageToKafka: {"id":6,"name":"Johngo_side6","age":16,"sex":0,"email":"Johngo_side6@flink.com","time":1590068797535}SendMessageToKafka: {"id":7,"name":"Johngo_side7","age":17,"sex":1,"email":"Johngo_side7@flink.com","time":1590068799538}SendMessageToKafka: {"id":8,"name":"Johngo_side8","age":18,"sex":0,"email":"Johngo_side8@flink.com","time":1590068801542}

最后启动主程序(sideOutputPerson_t.scala),看看效果

dfea43fc1a5497db4808914f390934a2.png

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.常规数据:7> id:1,name:Johngo1,age:11,sex:1,email:Johngo1@flink.com,time:1590069009644常规数据:7> id:2,name:Johngo2,age:12,sex:0,email:Johngo2@flink.com,time:1590069012246测输出流:7> sideOutput-> 带有_side标识的数据名称Johngo_side1常规数据:7> id:3,name:Johngo3,age:13,sex:1,email:Johngo3@flink.com,time:1590069014250测输出流:7> sideOutput-> 带有_side标识的数据名称Johngo_side2常规数据:7> id:4,name:Johngo4,age:14,sex:0,email:Johngo4@flink.com,time:1590069016255测输出流:7> sideOutput-> 带有_side标识的数据名称Johngo_side3常规数据:7> id:5,name:Johngo5,age:15,sex:1,email:Johngo5@flink.com,time:1590069018257测输出流:7> sideOutput-> 带有_side标识的数据名称Johngo_side4常规数据:7> id:6,name:Johngo6,age:16,sex:0,email:Johngo6@flink.com,time:1590069020263测输出流:7> sideOutput-> 带有_side标识的数据名称Johngo_side5常规数据:7> id:7,name:Johngo7,age:17,sex:1,email:Johngo7@flink.com,time:1590069022266

显然咱们看到了带有“side”字样的侧输出流的打印

结合具体业务的小伙伴可以在业务中,进行过不同类型数据进行不同的sink操作

6.参考:

来自官网1.10.0

https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/side_output.html

作者:Johngo

有问题随时联系作者,谢谢大家 ??????


欢迎大家留言,点个 在 ,也可以分享到朋友圈

59183a184e1567c9aa1570253074f2de.png

互联网广告收入占到互联网收入的80%以上 计算广告,一起研究 流量变现 ,欢迎大家的加入
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值