flink-5 flink的DataSource数据源

基于文件

object txt {
  def main(args: Array[String]): Unit = {
    //构建流处理的环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    //从socket获取数据
    val sourceStream: DataStream[String] = env.readTextFile("F:\\BigData\\data\\test.txt")

    //导入隐式转换的包
    import org.apache.flink.api.scala._
    //对数据进行处理

    val result: DataStream[(String, Int)] = sourceStream
      .flatMap(x => x.split(" ")) //按照空格切分
      .map(x => (x, 1))   //每个单词计为1
      .keyBy(0)          //按照下标为0的单词进行分组
      .sum(1)            //按照下标为1累加相同单词出现的1

    //对数据进行打印  注意文件目录不能已经存在
    result.writeAsText("F:\\BigData\\data\\result")
    //开启任务
    env.execute("FlinkStream")

  }

}

结果
会在目标文件夹下生成一系列的结果文件
在这里插入图片描述

基于socket

将上面的读取数据API换成如下代码即可:

//指定服务器和端口号
    val socketDS: DataStream[String] = environment.socketTextStream("node01",9999)

基于集合

fromCollection(Collection)
通过collection集合创建一个数据流,集合中的所有元素必须是相同类型的。

import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
object collection {
    def main(args: Array[String]): Unit = {
      val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

      //导入隐式转换的包
      import org.apache.flink.api.scala._

      //准备数据源--数组
      val array = Array("hello world","world spark","flink test","spark hive","test")
      val fromArray: DataStream[String] = environment.fromCollection(array)

      //  val value: DataStream[String] = environment.fromElements("hello world")
      val resultDataStream: DataStream[(String, Int)] = fromArray
        .flatMap(x => x.split(" "))
        .map(x =>(x,1))
        .keyBy(0)
        .sum(1)

      //打印
      resultDataStream.print()

      //启动
      environment.execute()
    }
}

在这里插入图片描述

自定义输入

下面的两种方法都是单数据源,只是并行度不同

自定义单并行度数据源

  • 通过addSource添加自定义数据源函数
  • 在自定义源函数中实现run()方法
object mySource {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    import org.apache.flink.api.scala._
    val getSource: DataStream[Long] = env.addSource(new MySourceFunc)
    val result: DataStream[Long] = getSource.filter(x=>x>2)
    result.print()
    env.execute()
  }

}
class MySourceFunc extends SourceFunction[Long]{
  private var number = 1L
  private var isRunning = true
  override def cancel(): Unit = {
    isRunning = false
  }

  override def run(sourceContext: SourceFunction.SourceContext[Long]): Unit = {
    while (isRunning){
      number += 1
      sourceContext.collect(number)
      Thread.sleep(1000)
    }
  }
}

在这里插入图片描述

自定义多并行度数据源

import org.apache.flink.streaming.api.functions.source.{ParallelSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}

/**
  *
  * 多并行度的source
  */
object MyMultipartSourceRun {

  def main(args: Array[String]): Unit = {
     //构建流处理环境
    val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    import org.apache.flink.api.scala._
    //添加source
    val getSource: DataStream[Long] = environment.addSource(new MultipartSource).setParallelism(2)
    //处理
    val resultStream: DataStream[Long] = getSource.filter(x => x %2 ==0)
    resultStream.setParallelism(2).print()
    environment.execute()
  }
}

//继承ParallelSourceFunction来自定义多并行度的source
class MultipartSource  extends ParallelSourceFunction[Long]{
  private var number = 1L
  private var isRunning = true


  override def run(sourceContext: SourceFunction.SourceContext[Long]): Unit = {
    while(true){
      number +=1
      sourceContext.collect(number)
      Thread.sleep(1000)
    }

  }

  override def cancel(): Unit = {
    isRunning = false

  }
}

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值