文章目录
flink读取kafka数据对数据进行清洗,然后再重新写入kafka
pom依赖
<flink_version>1.7.2</flink_version>
<!-- https://mvnrepository.com/artifact/com.alibaba/fastjson -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.62</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-scala -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.11</artifactId>
<version>${flink_version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-scala -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
<version>${flink_version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-core -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-core</artifactId>
<version>${flink_version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.11</artifactId>
<version>${flink_version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-shaded-hadoop-2-uber -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-shaded-hadoop-2-uber</artifactId>
<version>2.4.1-9.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>${flink_version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka-0.11 -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.11_2.11</artifactId>
<version>1.7.2</version>
</dependency>
1、流水写法
- 读:设置kafka消费者为flink数据源
- transform
- 写:设置kafka生产者为flink数据源
object FlinkReadWriteKafka_event_attendees_raw {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val prop = new Properties()
prop.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "single:9092")
prop.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "xym")
prop.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
prop.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
prop.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
val ds = env.addSource(
new FlinkKafkaConsumer[String](
"event_attendees_raw",
new SimpleStringSchema(),
prop
)
)
val dataStream = ds.map(x => {
val info = x.split(",", -1)
Array(
(info(0), info(1).split(" "), "yes"),
(info(0), info(2).split(" "), "maybe"),
(info(0), info(3).split(" "), "invited"),
(info(0), info(4).split(" "), "no")
)
}).flatMap(x => x).flatMap(x => x._2.map(y => (x._1, y, x._3))).filter(_._2!="")
.map(_.productIterator.mkString(","))
val prop2 = new Properties()
prop2.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,"single:9092")
prop2.setProperty(ProducerConfig.RETRIES_CONFIG,"0")
prop2.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringSerializer")
prop2.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringSerializer")
dataStream.addSink( new FlinkKafkaProducer[String](
"single:9092",
"event_attendees_ff",
new SimpleStringSchema()) )
env.execute("event_attendees_ff")
}
}
2、OOP写法
参考文章:sparkStreaming对kafka topic数据进行处理后再重新写入kafka(2)
与前文一样,实际上flink读取和写入kafka比SparkStreaming还要更简单些,整体思路就不多说了,可以参考前文
2.1、抽象接口读、写、数据处理
trait Read[T] {
def read(prop:Properties,tableName:String):DataStream[T]
}
trait Write[T] {
def write(localhost:String,tableName:String,dataStream:DataStream[T]):Unit
}
trait Transform[T,V] {
def transform(in:DataStream[T]):DataStream[V]
}
2.2、开发人员实现数据源添加和写入某数据平台
class KafkaSink[T] extends Write[String] {
override def write(localhost: String, tableName: String, dataStream: scala.DataStream[String]): Unit = {
dataStream.addSink(new FlinkKafkaProducer[String](
localhost,
tableName,
new SimpleStringSchema()
)
)
}
}
object KafkaSink{
def apply[T](): KafkaSink[T] = new KafkaSink()
}
class KafkaSource(env:StreamExecutionEnvironment) extends Read[String]{
override def read(prop: Properties,tableName:String): DataStream[String] = {
env.addSource(
new FlinkKafkaConsumer[String](
tableName,
new SimpleStringSchema(),
prop
)
)
}
}
object KafkaSource{
def apply(env: StreamExecutionEnvironment): KafkaSource = new KafkaSource(env)
}
2.3、用户方针对不同数据实现的特质
trait FlinkTransform extends Transform[String,String] {
override def transform(in: DataStream[String]): DataStream[String] = {
in.map(x => {
val info = x.split(",", -1)
Array(
(info(0), info(1).split(" "), "yes"),
(info(0), info(2).split(" "), "maybe"),
(info(0), info(3).split(" "), "invited"),
(info(0), info(4).split(" "), "no")
)
}).flatMap(x => x).flatMap(x => x._2.map(y => (x._1, y, x._3))).filter(_._2!="")
.map(_.productIterator.mkString(","))
}
}
2.4、执行器,混入特质
class KTKExcutor(readConf:Properties,writelocalhost:String) {
tran:FlinkTransform=>
def worker(intopic:String,outputtopic:String)={
val env = StreamExecutionEnvironment.getExecutionEnvironment
val kr = new KafkaSource(env).read(readConf, intopic)
val ds = tran.transform(kr)
KafkaSink().write(writelocalhost,outputtopic,ds)
env.execute()
}
}
2.5、动态混入用户的方法,执行
object EAtest {
def main(args: Array[String]): Unit = {
val prop = new Properties()
prop.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"single:9092")
prop.setProperty(ConsumerConfig.GROUP_ID_CONFIG,"xym")
prop.setProperty(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG,"1000")
prop.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer")
prop.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer")
prop.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,"earliest")
val localhost="single:9092"
(new KTKExcutor(prop,localhost) with FlinkTransform)
.worker("event_attendees_raw","event_attendees_kk")
}
}