Flink DataSinks写入kafka重点

Data Sinks

数据接收器使用DataStreams并将其转发到文件,套接字,外部系统或打印它们。Flink带有多种内置输出格式,这些格式封装在DataStreams的操作后面:

  • writeAsText()/ TextOutputFormat-将元素按行写为字符串。通过调用每个元素的toString()方法获得字符串。
  • writeAsCsv(…)/ CsvOutputFormat-将元组写为逗号分隔的值文件。行和字段定界符是可配置的。每个字段的值来自对象的toString()方法。
  • print()/ printToErr() - 在标准输出/标准错误流上打印每个元素的toString()值。可选地,可以提供前缀(msg),该前缀在输出之前。这可以帮助区分不同的打印调用。如果并行度大于1,则输出之前还将带有产生输出的任务的标识符。
  • writeUsingOutputFormat()/ FileOutputFormat-的方法和自定义文件输出基类。支持自定义对象到字节的转换。
  • writeToSocket -根据一个元素将元素写入套接字 SerializationSchema
  • addSink-调用自定义接收器功能。Flink捆绑有与其他系统(例如Apache Kafka)的连接器,这些连接器已实现为接收器功能。

请注意,上的write*()方法DataStream主要用于调试目的。它们不参与Flink的检查点,这意味着这些功能通常具有至少一次语义。刷新到目标系统的数据取决于OutputFormat的实现。这意味着并非所有发送到OutputFormat的元素都立即显示在目标系统中。同样,在失败的情况下,这些记录可能会丢失。

为了将流可靠,准确地一次传输到文件系统中,请使用flink-connector-filesystem。同样,通过该.addSink(…)方法的自定义实现可以参与Flink一次精确语义的检查点。

请注意DataStream上的write*()方法主要用于调试目的。

package com.baizhi.jsy.sink
import org.apache.flink.api.java.io.TextOutputFormat
import org.apache.flink.core.fs.Path
import org.apache.flink.streaming.api.scala._
object FlinkWordCountFileSink {
  def main(args: Array[String]): Unit = {
    //1.创建流计算执⾏行行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    //2.创建DataStream - 细化
    val text = env.socketTextStream("Centos",9999)
    //3.执⾏行行DataStream的转换算⼦子
    val counts = text.flatMap(line=>line.split("\\s+"))
      .map(word=>(word,1))
      .keyBy(0)
      .sum(1)
    //4.将计算的结果在控制打印
    counts.writeUsingOutputFormat(new TextOutputFormat[(String, Int)](new Path("file:///D:/桌面文件/flink-result")))
    //5.执⾏行行流计算任务
    env.execute("Window Stream WordCount")
  }
}

在桌面自动创建的文件

在这里插入图片描述

注意事项​ :如果改成HDFS,需要⽤用户自己产生大量数据,才能看到测试效果,原因是因为HDFS文件系统写入时的缓冲区比较大。以上写入文件系统的Sink不能够参与系统检查点,如果在生产环
境下通常使用flink-connector-filesystem写入到外围系统。

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-filesystem_2.11</artifactId>
    <version>1.10.0</version>
</dependency>

**自动创建文件夹 存储文件 **

package com.baizhi.jsy.sink
import org.apache.flink.api.common.serialization.SimpleStringEncoder
import org.apache.flink.core.fs.Path
import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink
import org.apache.flink.streaming.api.scala._
object FlinkWordCountBucketingSink {
  def main(args: Array[String]): Unit = {
    //1.创建流计算执⾏行行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(4)//分桶
    //2.创建DataStream - 细化
    val text = env.readTextFile("hdfs://Centos:9000/demo/words")

    val bucketingSink = StreamingFileSink.forRowFormat(new Path("hdfs://Centos:9000/bucketing-result"),
      new SimpleStringEncoder[(String,Int)]("UTF-8")).build()
    //3.执⾏行行DataStream的转换算⼦子
    val counts = text.flatMap(line=>line.split("\\s+"))
      .map(word=>(word,1))
      .keyBy(0)
      .sum(1)
    //4.将计算的结果导入目的地
    counts.addSink(bucketingSink)
    //5.执⾏流计算任务
    env.execute("Window Stream WordCount")
  }
}

按日期格式 新版本

package com.baizhi.jsy.sink
import org.apache.flink.api.common.serialization.SimpleStringEncoder
import org.apache.flink.core.fs.Path
import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink
import org.apache.flink.streaming.api.functions.sink.filesystem.bucketassigners.DateTimeBucketAssigner
import org.apache.flink.streaming.api.scala._
object FlinkWordCountBucketingSink {
  def main(args: Array[String]): Unit = {
    //1.创建流计算执⾏行行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(4)//分桶
    //2.创建DataStream - 细化
    val text = env.readTextFile("hdfs://Centos:9000/demo/words")
    val bucketingSink = StreamingFileSink.forRowFormat(new Path("hdfs://Centos:9000/bucketing-result"),
      new SimpleStringEncoder[(String,Int)]("UTF-8"))
      .withBucketAssigner(new DateTimeBucketAssigner[(String, Int)]("yyyy-MM-dd"))//动态产生写入路径
      .build()
    //3.执⾏行行DataStream的转换算⼦子
    val counts = text.flatMap(line=>line.split("\\s+"))
      .map(word=>(word,1))
      .keyBy(0)
      .sum(1)
    //4.将计算的结果导入目的地
    counts.addSink(bucketingSink)
    //5.执⾏流计算任务
    env.execute("Window Stream WordCount")
  }
}

老版本写法:

package com.baizhi.jsy.sink
import java.time.DateTimeException

import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.fs.bucketing.{BucketingSink, DateTimeBucketer}

object FlinkWordCountBucketingSink1 {
  def main(args: Array[String]): Unit = {
    //1.创建流计算执⾏行行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(4)//分桶
    //2.创建DataStream - 细化
    val text = env.readTextFile("hdfs://Centos:9000/demo/words")
    val bucketingSink = new BucketingSink[(String,Int)]("hdfs://Centos:9000/bucket-result")
    bucketingSink.setBucketer(new DateTimeBucketer[(String, Int)]("yyyy-MM-dd"))
    bucketingSink.setBatchSize(1024)
    //3.执⾏行行DataStream的转换算⼦子
    val counts = text.flatMap(line=>line.split("\\s+"))
      .map(word=>(word,1))
      .keyBy(0)
      .sum(1)
    //4.将计算的结果导入目的地
    counts.addSink(bucketingSink)
    //5.执⾏流计算任务
    env.execute("Window Stream WordCount")
  }
}

print() / / printToErr()

Prints the toString() value of each element on the standard out / standard error stream. Optionally, a prefix (msg) can be provided which is prepended to the output. This can help to distinguish between di!erent calls to print . If the parallelism is greater than 1, the output will also be prepended with the identifier of the task which produced the output.

package com.baizhi.jsy.sink
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.fs.bucketing.{BucketingSink, DateTimeBucketer}
object FlinkWordCountPrint{
  def main(args: Array[String]): Unit = {
    //1.创建流计算执⾏行行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)//分桶
    //2.创建DataStream - 细化
    val text = env.readTextFile("hdfs://Centos:9000/demo/words")
    val bucketingSink = new BucketingSink[(String,Int)]("hdfs://Centos:9000/bucket-result")
    bucketingSink.setBucketer(new DateTimeBucketer[(String, Int)]("yyyy-MM-dd"))
    bucketingSink.setBatchSize(1024)
    //3.执⾏行行DataStream的转换算⼦子
    val counts = text.flatMap(line=>line.split("\\s+"))
      .map(word=>(word,1))
      .keyBy(0)
      .sum(1)
    //4.将计算的结果导入目的地
    counts.print()
    //5.执⾏流计算任务
    env.execute("Window Stream WordCount")
  }
}

设置并行度

package com.baizhi.jsy.sink
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.fs.bucketing.{BucketingSink, DateTimeBucketer}
object FlinkWordCountPrint{
  def main(args: Array[String]): Unit = {
    //1.创建流计算执⾏行行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)//分桶
    //2.创建DataStream - 细化
    val text = env.readTextFile("hdfs://Centos:9000/demo/words")
    val bucketingSink = new BucketingSink[(String,Int)]("hdfs://Centos:9000/bucket-result")
    bucketingSink.setBucketer(new DateTimeBucketer[(String, Int)]("yyyy-MM-dd"))
    bucketingSink.setBatchSize(1024)
    //3.执⾏行行DataStream的转换算⼦子
    val counts = text.flatMap(line=>line.split("\\s+"))
      .map(word=>(word,1))
      .keyBy(0)
      .sum(1)
    //4.将计算的结果导入目的地
    counts.print("测试").setParallelism(4)
    //5.执⾏流计算任务
    env.execute("Window Stream WordCount")
  }
}

RedisSink

参考:https://bahir.apache.org/docs/flink/current/flink-streaming-redis/

<dependency>
     <groupId>org.apache.bahir</groupId>
     <artifactId>flink-connector-redis_2.11</artifactId>
     <version>1.0</version>
</dependency>

启动redis

[root@localhost redis-4.0.10]# ./src/redis-server
[root@localhost ~]# ps -aux|grep redis
[root@localhost redis-4.0.10]# ./src/redis-cli -h 192.168.17.19 -p 6379 
package com.baizhi.jsy.sink
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.redis.RedisSink
import org.apache.flink.streaming.connectors.redis.common.config.FlinkJedisPoolConfig
object FlinkWordCountRedisRink{
  def main(args: Array[String]): Unit = {
    //1.创建流计算执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)//分桶
    //2.创建DataStream - 细化
    val text = env.readTextFile("hdfs://Centos:9000/demo/words")
    var flinkJedisConf=new
      FlinkJedisPoolConfig.Builder().setHost("192.168.17.19").setPort(6379).build();

    //3.执行DataStream的转换算子
    val counts = text.flatMap(line=>line.split("\\s+"))
      .map(word=>(word,1))
      .keyBy(0)
      .sum(1)
    //4.将计算的结果导入目的地
    counts.addSink(new RedisSink(flinkJedisConf,new UserDefinedRedisMapper()))
    //5.执流计算任务
    env.execute("Window Stream WordCount")
  }
}
package com.baizhi.jsy.sink
import org.apache.flink.streaming.connectors.redis.common.mapper.{RedisCommand, RedisCommandDescription, RedisMapper}
class UserDefinedRedisMapper extends RedisMapper[(String,Int)]{
  override def getCommandDescription: RedisCommandDescription = {
    new RedisCommandDescription(RedisCommand.HSET,"WordCount")
  }
  override def getKeyFromData(t: (String, Int)): String = {
    t._1
  }
  override def getValueFromData(t: (String, Int)): String = {
    t._2+""
  }
}

在这里插入图片描述

将数据写入kafka中

启动kafka

[root@Centos kafka_2.11-2.2.0]# ./bin/kafka-console-consumer.sh --bootstrap-server Centos:9092 --topic topic01 --group g1 --property print.key=true --property print.value=true --property key.separator=,

新方案 一

package com.baizhi.jsy.sink
import java.util.Properties
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.Semantic
import org.apache.kafka.clients.producer.ProducerConfig
object FlinkWordCountKafkaSink {
  def main(args: Array[String]): Unit = {
    //1.创建流计算执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(4)
    //2.创建DataStream - 细化
    val text = env.readTextFile("hdfs://Centos:9000/demo/words")
    val props = new Properties()
    props.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "Centos:9092")
    props.setProperty(ProducerConfig.BATCH_SIZE_CONFIG,"100")
    props.setProperty(ProducerConfig.LINGER_MS_CONFIG,"500")
    //Semantic.EXACTLY_ONCE:开启kafka幂等写特性
    //Semantic.AT_LEAST_ONCE:开启Kafka Retries机制
    val kafkaSink = new FlinkKafkaProducer[(String,Int)]("defult_topic",new UserDefinedKafkaSerializationSchema,props,Semantic.AT_LEAST_ONCE)
    //3.执行DataStream的转换算⼦
    val counts = text.flatMap(line=>line.split("\\s+"))
      .map(word=>(word,1))
      .keyBy(0)
      .sum(1)
    //4.将计算的结果保存到本地 桌面文件
    counts.addSink(kafkaSink)
    //5.执行流计算任务
    env.execute("Window Stream WordCount")
  }
}
package com.baizhi.jsy.sink
import java.lang
import org.apache.flink.streaming.connectors.kafka.KafkaSerializationSchema
import org.apache.kafka.clients.producer.ProducerRecord
class UserDefinedKafkaSerializationSchema extends KafkaSerializationSchema[(String,Int)]{
  override def serialize(t: (String, Int), aLong: lang.Long): ProducerRecord[Array[Byte], Array[Byte]] = {
    return  new ProducerRecord("topic01",t._1.getBytes(),t._2.toString.getBytes())
  }
}

在这里插入图片描述

旧方案二:

在这里插入图片描述
产生defult_topic topic

package com.baizhi.jsy.sink
import java.util.Properties
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.Semantic
import org.apache.kafka.clients.producer.ProducerConfig
object FlinkWordCountKafkaSink {
  def main(args: Array[String]): Unit = {
    //1.创建流计算执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(4)
    //2.创建DataStream - 细化
    val text = env.readTextFile("hdfs://Centos:9000/demo/words")
    val props = new Properties()
    props.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "Centos:9092")
    props.setProperty(ProducerConfig.BATCH_SIZE_CONFIG,"100")
    props.setProperty(ProducerConfig.LINGER_MS_CONFIG,"500")
    //Semantic.EXACTLY_ONCE:开启kafka幂等写特性
    //Semantic.AT_LEAST_ONCE:开启Kafka Retries机制
    val kafkaSink = new FlinkKafkaProducer[(String,Int)]("defult_topic",new UserDefinedKeyedSerializationSchema,props,Semantic.AT_LEAST_ONCE)
    //3.执行DataStream的转换算⼦
    val counts = text.flatMap(line=>line.split("\\s+"))
      .map(word=>(word,1))
      .keyBy(0)
      .sum(1)
    //4.将计算的结果保存到本地 桌面文件
    counts.addSink(kafkaSink)
    //5.执行流计算任务
    env.execute("Window Stream WordCount")
  }
}
package com.baizhi.jsy.sink
import org.apache.flink.streaming.util.serialization.KeyedSerializationSchema
class UserDefinedKeyedSerializationSchema extends KeyedSerializationSchema[(String,Int)]{
  Int
  override def serializeKey(element: (String, Int)): Array[Byte] = {
    element._1.getBytes()
  }
  override def serializeValue(element: (String, Int)): Array[Byte] = {
    element._2.toString.getBytes()
  }
  //可以覆盖 默认是topic,如果返回null,则将数据写⼊到默认的topic中
  override def getTargetTopic(element: (String, Int)): String = {
    null
  }
}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值