Apache Flin之Streaming DataStream API【章节三】

Streaming (DataStream API)

DataSource 数据源

数据源是程序读取数据的来源,⽤户可以通过 env.addSource(SourceFunction) ,将SourceFunction添加到程序中。Flink内置许多已知实现的SourceFunction,但是⽤户可以⾃定义实现SourceFunction(⾮并⾏化的接⼝)接⼝或者实现ParallelSourceFunction(并⾏化)接⼝,如果需要有状态管理还可以继承RichParallelSourceFunction .

File-based

  • readTextFile(path) - 逐行读取文本文件,即符合TextInputFormat规范的文件,并将其作为字符串返回。
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 
//2.创建DataStream - 细化
val text:DataStream[String] = env.readTextFile("hdfs://CentOS:9000/demo/words")
 
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 
//4.将计算的结果在控制打印
counts.print()

//5.执⾏流计算任务 
env.execute("Window Stream WordCount")
  • readFile(fileInputFormat, path) 根据指定的文件输入格式读取(一次)文件
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 
//2.创建DataStream - 细化
var inputFormat:FileInputFormat[String]=new TextInputFormat(null)
val text:DataStream[String] =
env.readFile(inputFormat,"hdfs://CentOS:9000/demo/words")
 
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 
//4.将计算的结果在控制打印
counts.print()
 
//5.执⾏流计算任务
env.execute("Window Stream WordCount")

-readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo) 这是前两个内部调用的方法。它根据给定的fileInputFormat读取路径中的文件。根据提供的watchType,此源可能会定期(每间隔ms)监视路径以获取新数据,(FileProcessingMode.PROCESS_CONTINUOUSLY),或处理一次路径中当前存在的数据并退(FileProcessingMode.PROCESS_ONCE)。使用pathFilter,用户可以进一步将文件排除在处理范围之外。

//1.创建流计算执⾏环境
 
val env = StreamExecutionEnvironment.getExecutionEnvironment
 
//2.创建DataStream - 细化
var inputFormat:FileInputFormat[String]=new TextInputFormat(null)
 
val text:DataStream[String] = env.readFile(inputFormat,
 
"hdfs://CentOS:9000/demo/words",FileProcessingMode.PROCESS_CONTINUOUSLY,1000)
 
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 
//4.将计算的结果在控制打印
counts.print()
 
//5.执⾏流计算任务
env.execute("Window Stream WordCount")

该⽅法会检查采集⽬录下的⽂件,如果⽂件发⽣变化系统会重新采集。此时可能会导致⽂件的重复计算。⼀般来说不建议修改⽂件内容,直接上传新⽂件即可

Socket Based

  • socketTextStream - Reads from a socket. Elements can be separated by a delimiter.
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 
//2.创建DataStream - 细化
val text = env.socketTextStream("CentOS", 9999,'\n',3)
 
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 
//4.将计算的结果在控制打印
counts.print()
 
//5.执⾏流计算任务
env.execute("Window Stream WordCount")

Collection-based

可以读取一个集合的数据

//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 
//2.创建DataStream - 细化
val text = env.fromCollection(List("this is a demo","hello word"))
 
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 
//4.将计算的结果在控制打印
counts.print()
 
//5.执⾏流计算任务
env.execute("Window Stream WordCount")

UserDefinedSource

用户可以自定义一些数据输入源

  • SourceFunction
import org.apache.flink.streaming.api.functions.source.SourceFunction
import scala.util.Random
class UserDefinedNonParallelSourceFunction extends SourceFunction[String]{
   
 
@volatile //防⽌线程拷⻉变量
var isRunning:Boolean=true
 
val lines:Array[String]=Array("this is a demo","hello world","ni hao ma")
 
//在该⽅法中启动线程,通过sourceContext的collect⽅法发送数据
override def run(sourceContext: SourceFunction.SourceContext[String]): Unit = {
   
 
	while(isRunning){
   
	
		Thread.sleep(100)
 
	//输送数据给下游
	sourceContext.collect(lines(new Random().nextInt(lines.size)))
 	}
 }
 
//释放资源
 
override def cancel(): Unit = {
   
 
isRunning=false
 }
}
  • ParallelSourceFunction
import org.apache.flink.streaming.api.functions.source.{
   ParallelSourceFunction,
SourceFunction}
import scala.util.Random
class UserDefinedParallelSourceFunction extends ParallelSourceFunction[String]{
   
 
	@volatile //防⽌线程拷⻉变量
 
	var isRunning:Boolean=true
 
	val lines:Array[String]=Array("this is a demo","hello world","ni hao ma")
 
	//在该⽅法中启动线程,通过sourceContext的collect⽅法发送数据
override def run(sourceContext: SourceFunction.SourceContext[String]): Unit = {
   
 
	while(isRunning){
   
 
		Thread.sleep(100)
 
		//输送数据给下游
 
		sourceContext.collect(lines(new Random().nextInt(lines.size)))
	 }
 }
 
		//释放资源
		override def cancel(): Unit = {
   
		isRunning=false
	 }
}

测试

//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 
env.setParallelism(4)
 
//2.创建DataStream - 细化
 
val text = env.addSource[String](⽤户定义的SourceFunction)
 
//3.执⾏DataStream的转换算⼦
 
val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 
//4.将计算的结果在控制打印
counts.print()
 
println(env.getExecutionPlan) //打印执⾏计划
 
//5.执⾏流计算任务
env.execute("Window Stream WordCount")

Kafka集成

  • 引⼊maven
<dependency>
	<groupId>org.apache.flink</groupId>
	<artifactId>flink-connector-kafka_2.11</artifactId>
	<version>1.10.0</version>
</dependency>
  • SimpleStringSchema
    该SimpleStringSchema⽅案只会反序列化kafka中的value
//1.创建流计算执⾏环境
 
val env = StreamExecutionEnvironment.getExecutionEnvironment
 
//2.创建DataStream - 细化
 
val props = new Properties()
 
props.setProperty("bootstrap.servers", "CentOS:9092")
 
props.setProperty("group.id", "g1")
 
val text = env.addSource(new FlinkKafkaConsumer[String]("topic01",new
SimpleStringSchema(),props))
 
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 
//4.将计算的结果在控制打印
counts.print()
 
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
  • KafkaDeserializationSchema
package com.zb.datasorce

import org.apache.flink
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值