Flink DataStream API 之 DataSource

最新推荐文章于 2024-01-29 01:29:35 发布

逆水行舟如何

最新推荐文章于 2024-01-29 01:29:35 发布

阅读量880

点赞数

分类专栏： Flink 文章标签： flink datastreamAPI datasource分类自定义datasource flink整合kafka fl

本文链接：https://blog.csdn.net/weixin_43823423/article/details/89926876

版权

Flink 专栏收录该内容

12 篇文章 3 订阅

订阅专栏

Flink API的抽象级别

1、概述

source是程序的数据源输入，你可以通过StreamExecutionEnvironment.addSource(sourceFunction)来为你的程序添加一个source。

flink提供了大量的已经实现好的source方法，你也可以自定义source

通过实现sourceFunction接口来自定义无并行度的source，或者你也可以通过实现ParallelSourceFunction 接口 or 继承RichParallelSourceFunction 来自定义有并行度的source。

2、分类

1）基于文件

readTextFile(path)

读取文本文件，文件遵循TextInputFormat 读取规则，逐行读取并返回。

2）基于socket

socketTextStream
从socker中读取数据，元素可以通过一个分隔符切开。

3）基于集合

fromCollection(Collection)

通过java 的collection集合创建一个数据流，集合中的所有元素必须是相同类型的。

4）自定义输入

addSource 可以实现读取第三方数据源的数据

系统内置提供了一批connectors，连接器会提供对应的source支持【kafka】

内置Connectors

Apache Kafka (source/sink)
Apache Cassandra (sink)
Elasticsearch (sink)
Hadoop FileSystem (sink)
RabbitMQ (source/sink)
Apache ActiveMQ (source/sink)
Redis (sink)

下面以kafka作为source来演示：

目的：获取kafka中的输入，然后进行输出

package qyl.study.streaming

import java.util.Properties

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.contrib.streaming.state.RocksDBStateBackend
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.api.environment.CheckpointConfig
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011

/**
  * Created by qyl on 2019/03/23.
  */
object StreamingKafkaSourceScala {

  def main(args: Array[String]): Unit = {

    val env = StreamExecutionEnvironment.getExecutionEnvironment

    //隐式转换（必须要导入，否则会报错）
    import org.apache.flink.api.scala._


    //checkpoint配置
    env.enableCheckpointing(5000);
    env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
    env.getCheckpointConfig.setMinPauseBetweenCheckpoints(500);
    env.getCheckpointConfig.setCheckpointTimeout(60000);
    env.getCheckpointConfig.setMaxConcurrentCheckpoints(1);
    env.getCheckpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);

    //设置statebackend

    //env.setStateBackend(new RocksDBStateBackend("hdfs://hadoop100:9000/flink/checkpoints",true));

    val topic = "t1"
    val prop = new Properties()
    prop.setProperty("bootstrap.servers","hadoop110:9092")
    prop.setProperty("group.id","con1")

    //获取kafka的输入的数据
    val myConsumer = new FlinkKafkaConsumer011[String](topic,new SimpleStringSchema(),prop) 
    
	//使用addsource，将kafka的输入转变为datastream
	val text = env.addSource(myConsumer)
 
    //将输入的数据输出
    text.print()


    env.execute("StreamingFromCollectionScala")


  }

}

依赖：

  <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka-0.11_2.11</artifactId>
            <version>1.6.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>0.11.0.3</version>
        </dependency>

3、Source 容错性保证

Source	语义保证	备注
kafka	exactly once(仅一次)	建议使用0.10及以上
Collections	exactly once
Files	exactly once
Socktes	at most once

4、自定义Source

1、两种情况：

1、实现并行度为1的自定义source

实现SourceFunction
一般不需要实现容错性保证
处理好cancel方法(cancel应用的时候，这个方法会被调用)

2、实现并行化的自定义source

实现ParallelSourceFunction
或者继承RichParallelSourceFunction
注意：继承RichParallelSourceFunction的那些SourceFunction意味着它们都是并行执行的并且可能有一些资源需要open/close

2、实现代码

需求：

创建自定义并行度为1的source
实现从1开始产生递增数字

实现代码1：

package qyl.study.streaming.custormSource

import org.apache.flink.streaming.api.functions.source.SourceFunction
import org.apache.flink.streaming.api.functions.source.SourceFunction.SourceContext

/**
  * Created by qyl on 2019/03/23.
  */
class MyNoParallelSourceScala extends SourceFunction[Long]{

  var count = 1L
  var isRunning = true

  override def run(ctx: SourceContext[Long]) = {
    while(isRunning){
      ctx.collect(count)
      count+=1
      Thread.sleep(1000)
    }

  }

  override def cancel() = {
    isRunning = false
  }
}

实现代码2：

package qyl.study.streaming.custormSource

import org.apache.flink.streaming.api.functions.source.ParallelSourceFunction
import org.apache.flink.streaming.api.functions.source.SourceFunction.SourceContext

/**
  *
  * Created by qyl on 2019/03/23.
  */
class MyParallelSourceScala extends ParallelSourceFunction[Long]{

  var count = 1L
  var isRunning = true

  override def run(ctx: SourceContext[Long]) = {
    while(isRunning){
      ctx.collect(count)
      count+=1
      Thread.sleep(1000)
    }

  }

  override def cancel() = {
    isRunning = false
  }
}

逆水行舟如何

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Flink DataStream API 之 DataSource

Flink API的抽象级别1、概述 source是程序的数据源输入，你可以通过StreamExecutionEnvironment.addSource(sourceFunction)来为你的程序添加一个source。 flink提供了大量的已经实现好的source方法，你也可以自定义source 通过实现sourceFunction接口...
复制链接

扫一扫