flink读取kafka数据对数据进行清洗,然后再重新写入kafka

10 篇文章 0 订阅
2 篇文章 0 订阅

flink读取kafka数据对数据进行清洗,然后再重新写入kafka

pom依赖

    <flink_version>1.7.2</flink_version>

    <!-- https://mvnrepository.com/artifact/com.alibaba/fastjson -->
    <dependency>
      <groupId>com.alibaba</groupId>
      <artifactId>fastjson</artifactId>
      <version>1.2.62</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-scala -->
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-scala_2.11</artifactId>
      <version>${flink_version}</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-scala -->
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-streaming-scala_2.11</artifactId>
      <version>${flink_version}</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-core -->
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-core</artifactId>
      <version>${flink_version}</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients -->
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-clients_2.11</artifactId>
      <version>${flink_version}</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-shaded-hadoop-2-uber -->
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-shaded-hadoop-2-uber</artifactId>
      <version>2.4.1-9.0</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka -->
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-connector-kafka_2.11</artifactId>
      <version>${flink_version}</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka -->
    <dependency>
      <groupId>org.apache.kafka</groupId>
      <artifactId>kafka_2.11</artifactId>
      <version>2.0.0</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients -->
    <dependency>
      <groupId>org.apache.kafka</groupId>
      <artifactId>kafka-clients</artifactId>
      <version>2.0.0</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka-0.11 -->
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-connector-kafka-0.11_2.11</artifactId>
      <version>1.7.2</version>
    </dependency>

1、流水写法

  • 读:设置kafka消费者为flink数据源
  • transform
  • 写:设置kafka生产者为flink数据源
object FlinkReadWriteKafka_event_attendees_raw {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val prop = new Properties()
    prop.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "single:9092")
    prop.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "xym")
    prop.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
    prop.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
    prop.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")

    val ds = env.addSource(
      new FlinkKafkaConsumer[String](
        "event_attendees_raw",
        new SimpleStringSchema(),
        prop
      )
    )

    val dataStream = ds.map(x => {
      val info = x.split(",", -1)
      Array(
        (info(0), info(1).split(" "), "yes"),
        (info(0), info(2).split(" "), "maybe"),
        (info(0), info(3).split(" "), "invited"),
        (info(0), info(4).split(" "), "no")
      )
    }).flatMap(x => x).flatMap(x => x._2.map(y => (x._1, y, x._3))).filter(_._2!="")
      .map(_.productIterator.mkString(","))

    val prop2 = new Properties()
    prop2.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,"single:9092")
    prop2.setProperty(ProducerConfig.RETRIES_CONFIG,"0")
    prop2.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringSerializer")
    prop2.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringSerializer")

    dataStream.addSink( new FlinkKafkaProducer[String](
      "single:9092",
      "event_attendees_ff",
      new SimpleStringSchema()) )
    env.execute("event_attendees_ff")
  }
}

2、OOP写法

参考文章:sparkStreaming对kafka topic数据进行处理后再重新写入kafka(2)

与前文一样,实际上flink读取和写入kafka比SparkStreaming还要更简单些,整体思路就不多说了,可以参考前文

2.1、抽象接口读、写、数据处理

trait Read[T] {
  def read(prop:Properties,tableName:String):DataStream[T]
}
trait Write[T] {
  def write(localhost:String,tableName:String,dataStream:DataStream[T]):Unit
}
trait Transform[T,V] {
  def transform(in:DataStream[T]):DataStream[V]
}

2.2、开发人员实现数据源添加和写入某数据平台

class KafkaSink[T] extends Write[String] {
  override def write(localhost: String, tableName: String, dataStream: scala.DataStream[String]): Unit = {
    dataStream.addSink(new FlinkKafkaProducer[String](
      localhost,
      tableName,
      new SimpleStringSchema()
    )
    )
  }
}
object KafkaSink{
  def apply[T](): KafkaSink[T] = new KafkaSink()
}
class KafkaSource(env:StreamExecutionEnvironment) extends Read[String]{
  override def read(prop: Properties,tableName:String): DataStream[String] = {
    env.addSource(
      new FlinkKafkaConsumer[String](
        tableName,
        new SimpleStringSchema(),
        prop
      )
    )
  }
}
object KafkaSource{
  def apply(env: StreamExecutionEnvironment): KafkaSource = new KafkaSource(env)
}

2.3、用户方针对不同数据实现的特质

trait FlinkTransform extends Transform[String,String] {
  override def transform(in: DataStream[String]): DataStream[String] = {
    in.map(x => {
      val info = x.split(",", -1)
      Array(
        (info(0), info(1).split(" "), "yes"),
        (info(0), info(2).split(" "), "maybe"),
        (info(0), info(3).split(" "), "invited"),
        (info(0), info(4).split(" "), "no")
      )
    }).flatMap(x => x).flatMap(x => x._2.map(y => (x._1, y, x._3))).filter(_._2!="")
      .map(_.productIterator.mkString(","))
  }
}

2.4、执行器,混入特质

class KTKExcutor(readConf:Properties,writelocalhost:String) {
    tran:FlinkTransform=>
    def worker(intopic:String,outputtopic:String)={
      val env = StreamExecutionEnvironment.getExecutionEnvironment
      val kr = new KafkaSource(env).read(readConf, intopic)
      val ds = tran.transform(kr)
      KafkaSink().write(writelocalhost,outputtopic,ds)
      env.execute()
    }
}

2.5、动态混入用户的方法,执行

object EAtest {
  def main(args: Array[String]): Unit = {
    val prop = new Properties()
    prop.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"single:9092")
    prop.setProperty(ConsumerConfig.GROUP_ID_CONFIG,"xym")
    prop.setProperty(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG,"1000")
    prop.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer")
    prop.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer")
    prop.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,"earliest")
    val localhost="single:9092"
    (new KTKExcutor(prop,localhost) with FlinkTransform)
      .worker("event_attendees_raw","event_attendees_kk")
  }
}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值