flink的DataSource数据源
基于文件
object txt {
def main(args: Array[String]): Unit = {
//构建流处理的环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//从socket获取数据
val sourceStream: DataStream[String] = env.readTextFile("F:\\BigData\\data\\test.txt")
//导入隐式转换的包
import org.apache.flink.api.scala._
//对数据进行处理
val result: DataStream[(String, Int)] = sourceStream
.flatMap(x => x.split(" ")) //按照空格切分
.map(x => (x, 1)) //每个单词计为1
.keyBy(0) //按照下标为0的单词进行分组
.sum(1) //按照下标为1累加相同单词出现的1
//对数据进行打印 注意文件目录不能已经存在
result.writeAsText("F:\\BigData\\data\\result")
//开启任务
env.execute("FlinkStream")
}
}
结果
会在目标文件夹下生成一系列的结果文件
基于socket
将上面的读取数据API换成如下代码即可:
//指定服务器和端口号
val socketDS: DataStream[String] = environment.socketTextStream("node01",9999)
基于集合
fromCollection(Collection)
通过collection集合创建一个数据流,集合中的所有元素必须是相同类型的。
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
object collection {
def main(args: Array[String]): Unit = {
val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换的包
import org.apache.flink.api.scala._
//准备数据源--数组
val array = Array("hello world","world spark","flink test","spark hive","test")
val fromArray: DataStream[String] = environment.fromCollection(array)
// val value: DataStream[String] = environment.fromElements("hello world")
val resultDataStream: DataStream[(String, Int)] = fromArray
.flatMap(x => x.split(" "))
.map(x =>(x,1))
.keyBy(0)
.sum(1)
//打印
resultDataStream.print()
//启动
environment.execute()
}
}
自定义输入
下面的两种方法都是单数据源,只是并行度不同
自定义单并行度数据源
- 通过addSource添加自定义数据源函数
- 在自定义源函数中实现
run()
方法
object mySource {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
import org.apache.flink.api.scala._
val getSource: DataStream[Long] = env.addSource(new MySourceFunc)
val result: DataStream[Long] = getSource.filter(x=>x>2)
result.print()
env.execute()
}
}
class MySourceFunc extends SourceFunction[Long]{
private var number = 1L
private var isRunning = true
override def cancel(): Unit = {
isRunning = false
}
override def run(sourceContext: SourceFunction.SourceContext[Long]): Unit = {
while (isRunning){
number += 1
sourceContext.collect(number)
Thread.sleep(1000)
}
}
}
自定义多并行度数据源
import org.apache.flink.streaming.api.functions.source.{ParallelSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
/**
*
* 多并行度的source
*/
object MyMultipartSourceRun {
def main(args: Array[String]): Unit = {
//构建流处理环境
val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
import org.apache.flink.api.scala._
//添加source
val getSource: DataStream[Long] = environment.addSource(new MultipartSource).setParallelism(2)
//处理
val resultStream: DataStream[Long] = getSource.filter(x => x %2 ==0)
resultStream.setParallelism(2).print()
environment.execute()
}
}
//继承ParallelSourceFunction来自定义多并行度的source
class MultipartSource extends ParallelSourceFunction[Long]{
private var number = 1L
private var isRunning = true
override def run(sourceContext: SourceFunction.SourceContext[Long]): Unit = {
while(true){
number +=1
sourceContext.collect(number)
Thread.sleep(1000)
}
}
override def cancel(): Unit = {
isRunning = false
}
}