Flink读写kafka主题,并进行数据清洗

flink参考
flink参考

Datastream程序架构

datastream是flink提供给用户使用的用于进行流计算和批处理的api,是对底层流式计算模型的api封装,便于用户编程
一般流程为:

  1. 获得一个执行环境;Executiion Environment
  2. 加载/创建初始数据;Source
  3. 指定转换数据;Transformation
  4. 指定存放计算结果的位置;Sink
  5. 触发程序执行;流失计算必须的操作,批处理则不必

maven

<flink.version>1.7.2</flink.version>
<kafka.version>2.0.0</kafka.version>

<dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-scala_2.11</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-core -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-core</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-scala -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-scala_2.11</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_2.11</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-shaded-hadoop-2-uber -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-shaded-hadoop-2-uber</artifactId>
            <version>2.4.1-9.0</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka-0.10 -->
        <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka-0.11 -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka_2.11</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients -->
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>${kafka.version}</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/com.alibaba/fastjson -->
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.62</version>
        </dependency>
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka_2.11</artifactId>
            <version>${kafka.version}</version>
        </dependency>

简单流水写法

  1. 读:设置kafka消费者为flink数据源
  2. 写:设置kafka的生产者接受结果
object FlinkReadWriteKafka {
  def main(args: Array[String]): Unit = {
    //首先获取flink流计算环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    //kafka props
    val prop = new Properties()
    //指定kafka的Broker地址
    prop.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "single:9092")
    //指定组的ID
    prop.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "md")
    //k v 序列化
    prop.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
    prop.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
    //如果没有记录偏移量,第一次开始消费
    prop.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")


    val ds = env.addSource(
      new FlinkKafkaConsumer[String](
        "event_attendees",
        //对于字符串序列化和反序列化的schema
        new SimpleStringSchema(),
        prop
      ).setStartFromEarliest()//重置游标
    )

    //transform操作
    val dataStream = ds.map(x => {
      val info = x.split(",", -1)
      Array(
        (info(0), info(1).split(" "), "yes"),
        (info(0), info(2).split(" "), "maybe"),
        (info(0), info(3).split(" "), "invited"),
        (info(0), info(4).split(" "), "no")
      )
    }).flatMap(x => x).flatMap(x => x._2.map(y => (x._1, y, x._3))).filter(_._2 != "")
      .map(_.productIterator.mkString(","))

    val prop2 = new Properties()
    prop2.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "single:9092")
    prop2.setProperty(ProducerConfig.RETRIES_CONFIG, "0")
    prop2.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
    prop2.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")

    dataStream.addSink(new FlinkKafkaProducer[String](
      "single:9092",
      "event_attendees_ff",
      new SimpleStringSchema()))
    //启动流式计算
    env.execute("event_attendees_xf")
  }
}

优化后oop写法

抽象接口–读、写、数据处理

trait Read[T] {
  def read(prop:Properties,tableName:String):DataStream[T]
}
trait Write[T] {
  def write(localhost:String,tableName:String,dataStream:DataStream[T]):Unit
}
trait Transform[T,V] {
  def transform(in:DataStream[T]):DataStream[V]
}

添加读取和写入的数据源

class KafkaSource(env: StreamExecutionEnvironment) extends Read[String] {
  override def read(prop: Properties, tableName: String): DataStream[String] = {
    env.addSource(
      new FlinkKafkaConsumer[String](
        tableName,
        new SimpleStringSchema(),
        prop
      )
    )
  }
}

object KafkaSource {
  def apply(env: StreamExecutionEnvironment): KafkaSource = new KafkaSource(env)
}
class KafkaSink extends Write[String] {
  override def write(localhost: String, tableName: String, dataStream: DataStream[String]): Unit = {
    dataStream.addSink(new FlinkKafkaProducer[String](
      localhost,
      tableName,
      new SimpleStringSchema()
    ))
  }
}

object KafkaSink {
  def apply[T](): KafkaSink = new KafkaSink()
}

依据业务 实现数据处理的特质

trait FlikTransform extends Transform[String, String] {
  override def transform(in: DataStream[String]): DataStream[String] = {
    in.map(x => {
      val info = x.split(",", -1)
      Array(
        (info(0), info(1).split(" "), "yes"),
        (info(0), info(2).split(" "), "maybe"),
        (info(0), info(3).split(" "), "invited"),
        (info(0), info(4).split(" "), "no")
      )
    }).flatMap(x => x).flatMap(x => x._2.map(y => (x._1, y, x._3))).filter(_._2 != "")
      .map(_.productIterator.mkString(","))
  }
}

执行器,混入特质

class KTExcutor(readConf: Properties, writelocalhost: String) {
  tran: FlikTransform =>
  def worker(intopic: String, outputtopic: String) = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val kr = new KafkaSource(env).read(readConf, intopic)
    val ds = tran.transform(kr)
    KafkaSink().write(writelocalhost, outputtopic, ds)
    env.execute()
  }
}

动态混入方法,用户执行

object EAtest {
  def main(args: Array[String]): Unit = {
    val prop = new Properties()
    prop.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "single:9092")
    prop.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "md")
    prop.setProperty(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, "1000")
    prop.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer")
    prop.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer")
    prop.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
    val localhost = "single:9092"
    (new KTExcutor(prop, localhost) with FlikTransform)
      .worker("event_attendees", "attendees_AA")
  }
}
  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值