Flink 编程核心概念+数据源编程+并行度问题
1. 前期基础进行升华
一些别的方式
package com.ruozedata.flink.ByKey
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
object SpecifyingKeysApp {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("hadoop001", 9527)
text.flatMap(_.split(","))
.map(x => WC(x, 1))
.keyBy(_.word)
.sum("count")
.print()
env.execute(getClass.getCanonicalName)
}
case class WC(word : String, count: Int)
}
引入case class 后,进行自定义方法的书写
package com.ruozedata.flink.ByKey
import com.ruozedata.flink.bean.Domain.Access
import org.apache.flink.api.common.functions.{FilterFunction, RichMapFunction}
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._
object SpecifyingTransformationFunctionsApp {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.readTextFile("data/access.log")
val stream = text.map(new RuozedataMap)
.filter(new RuozedataFilter(4000)).print()
// FunctionFilter(stream)
env.execute(getClass.getCanonicalName)
}
// 方法 1
def FunctionFilter(stream:DataStream[Access]): Unit ={
stream.filter(_.traffic > 4000).print()
}
}
// 方法 2
class RuozedataFilter(traffic: Long) extends FilterFunction[Access]{
override def filter(value: Access): Boolean = value.traffic > traffic
}
class RuozedataMap extends RichMapFunction[String, Access]{
override def map(value: String): Access = {
println("-----------------aaaaaaaaaaaaaaa--------------------")
val strings = value.split(",")
Access(strings(0).trim.toLong, strings(1).trim, strings(2).trim.toDouble)
}
}
从输出可以看出并行度为5 ~
如果设置成2
毫无影响
这里引入生命周期方法 life circle
override def open(parameters: Configuration): Unit = {
println("-----------------open invoked...--------------------")
}
这里生命周期方法只有两次
所以 初始化连接 需要在open方法内完成
加入close方法
override def close(): Unit = {
println("-----------------close invoked...--------------------")
}
close执行了4次,为什么呢?
2. 数据源编程
package com.ruozedata.flink.source
import com.ruozedata.flink.bean.Domain.Access
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._
object SourceApp {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream = env.fromCollection(List(
Access(202112120010L, "ruozedata.com", 2000),
Access(202112120010L, "ruozedata.ke.qq.com", 6000),
Access(202112120010L, "github.com / ruozedata", 5000),
Access(202112120010L, "ruozedata.com", 4000),
Access(202112120010L, "ruozedata.ke.qq.com", 1000)
))
println("--------" + stream.parallelism)
val filterStream = stream.filter(x => x.traffic > 4000)
println("--------" + filterStream.parallelism)
env.execute(getClass.getCanonicalName)
}
}
可以看出经过filter后变成core的数量
如果设置成全局的并行度
env.setParallelism(4)
val stream = env.socketTextStream(“hadoop001”, 9527)
socket也是单并行度
但是文本读取是并行的~
还有fromParallelCollection 也是并行的
2.1 source Function 单并行度
SourceFunction 单并行度
ParallelSourceFunction 支持并行度
RichParallerSourceFunction 支持并行度
如何使用sourceFunction
可用于造数据
package com.ruozedata.flink.source
import com.ruozedata.flink.bean.Domain.Access
import org.apache.flink.streaming.api.functions.source.SourceFunction
import scala.util.Random
class AccessSource extends SourceFunction[Access]{
var isRunning = true
override def run(sourceContext: SourceFunction.SourceContext[Access]): Unit = {
val random = new Random()
val domains = Array("ruozedata.com","ruoze.ke.qq.com","github.com/ruozedata")
while(isRunning){
val timestamp = System.currentTimeMillis()
1.to(10).map(x=> {
sourceContext.collect(Access(timestamp, domains(random.nextInt(domains.length)), random.nextInt(1000 + x)))
})
Thread.sleep(5000)
}
}
override def cancel(): Unit = {
isRunning = false
}
}
val stream = env.addSource(new AccessSource)
println("--------" + stream.parallelism)
val filterStream = stream.filter(x => x.traffic > 200)
println("--------" + filterStream.parallelism)
stream.print()
当然并行度也可以直接在source上面设置
val stream = env.addSource(new AccessSource).setParallelism(2)
报错。 这个source的并行度必须为1
2.2 source Function 多并行度
只要继承ParallelSourceFunction就行