Flink1.7.2 Dataset 文件切片计算方式和切片数据读取源码分析
源码
概述
- 了解读取的文件或目录,具体进行切片拆分的实现
- 了解任务读取切片中的数据规则
数据文件读取结论
开始位置索引从0开始的
- 实际开始位置,0
- 结束位置:按行一直读,直到位置索引大于等于切片大小时,再读下一个切片的1m数据,由于此时当前切片数据已全部读完了,所以就overLimit=true,但是也会读取下一个切片的一行数据
开始位置索引从大于0开始的
- 实际开始位置,由切片分到的位置开始算,找到第一个换行符的位置 +1开始计算
- 结束位置,当读到的位置索引,大于等于切片数据大小时,说明本切片已读完,如果下一个切片还有数据,就从下一个切片读到第一个换行符的数据,如果没有下一个切片,就到当前读到的位置结束
图解
- https://github.com/opensourceteams/flink-maven-scala/blob/master/md/images/wordCount/dataset/切分数据案例三.png
- https://github.com/opensourceteams/flink-maven-scala/blob/master/md/images/wordCount/dataset/切分数据案例一.png
- https://github.com/opensourceteams/flink-maven-scala/blob/master/md/images/wordCount/dataset/切安数据案例二.png
输入数据
- 注意空格,第一行6个byte,第二行3个byte,(一共9个byte的数据,9个byte中包括一个byte的换行符)
c a a
b c
-转志Integer
99 32 97 32 97 32 10
98 32 99
WordCount.scala
- java的也不影响分析,只是 WordCount.scala写的方式不一样,整个过程,逻辑是一样的
package com.opensourceteams.module.bigdata.flink.example.dataset.worldcount
import com.opensourceteams.module.bigdata.flink.common.ConfigurationUtil
import org.apache.flink.api.scala.ExecutionEnvironment
/**
* 批处理,DataSet WordCount分析
*/
object WordCountRun {
def main(args: Array[String]): Unit = {
//调试设置超时问题
val env : ExecutionEnvironment= ExecutionEnvironment.createLocalEnvironment(ConfigurationUtil.getConfiguration(true))
env.setParallelism(2)
val dataSet = env.readTextFile("file:/opt/n_001_workspaces/bigdata/flink/flink-maven-scala-2/src/main/resources/data/line.txt")
import org.apache.flink.streaming.api.scala._
val result = dataSet.flatMap(x => x.split(" ")).map((_,1)).groupBy(0).sum(1)
result.print()
}
}
源码分析(文件拆分成切片)
-
预拆分数据,之所以叫做预,就不是实际的,实际读取时,会考虑更多因素,会有一定变化,下面有详细说明
-
把文件按并行度拆分成FileInputSplit的个数,当然并不是完全有几个并行度就生成几个FileInputSplit对象,根据具体算法得到,但是FileInputSplit个数,一定是(并行度个数,或者并行度个数+1),因为计算FileInputSplit个数时,参照物是文件大小 / 并行度 ,如果没有余数,刚好整除,那么FileInputSplit个数一定是并行度,如果有余数,FileInputSplit个数就为是(并行度个数,或者并行度个数+1)
-
本示例拆分的结果
[0] file:/opt/n_001_workspaces/bigdata/flink/flink-maven-scala-2/src/main/resources/data/line.txt:0+5 [1] file:/opt/n_001_workspaces/bigdata/flink/flink-maven-scala-2/src/main/resources/data/line.txt:5+4
ExecutionGraphBuilder.buildGraph
-
JobMaster在实例化时,构建ExecutionGraph,会调用 ExecutionGraphBuilder.buildGraph(jobGraph)
-
把jobGraph是由JobVertex组成,调用executionGraph.attachJobGraph(sortedTopology) 把JobGraph转成ExecutionGraph,ExecutionGraph由ExecutionJobVertex组成,即把JobVertex转成ExecutionJobVertex
executionGraph.attachJobGraph(sortedTopology);
sortedTopology = {ArrayList@5177} size = 3
0 = {InputFormatVertex@5459} “CHAIN DataSource (at com.opensourceteams.module.bigdata.flink.example.dataset.worldcount.WordCountRun . m a i n ( W o r d C o u n t R u n . s c a l a : 19 ) ( o r g . a p a c h e . f l i n k . a p i . j a v a . i o . T e x t I n p ) − > F l a t M a p ( F l a t M a p a t c o m . o p e n s o u r c e t e a m s . m o d u l e . b i g d a t a . f l i n k . e x a m p l e . d a t a s e t . w o r l d c o u n t . W o r d C o u n t R u n .main(WordCountRun.scala:19) (org.apache.flink.api.java.io.TextInp) -> FlatMap (FlatMap at com.opensourceteams.module.bigdata.flink.example.dataset.worldcount.WordCountRun .main(WordCountRun.scala:19)(org.apache.flink.api.java.io.TextInp)−>FlatMap(FlatMapatcom.opensourceteams.module.bigdata.flink.example.dataset.worldcount.WordCountRun.main(WordCountRun.scala:23)) -> Map (Map at com.opensourceteams.module.bigdata.flink.example.dataset.worldcount.WordCountRun$.main(WordCountRun.scala:23)) -> Combine (SUM(1)) (org.apache.flink.runtime.operators.DataSourceTask)”
1 = {JobVertex@5460} “Reduce (SUM(1)) (org.apache.flink.runtime.operators.BatchTask)”
2 = {OutputFormatVertex@5461} “DataSink (collect()) (org.apache.flink.runtime.operators.DataSinkTask)”
```
- 调用ExecutionGraph.attachJobGraph
/**
* Builds the ExecutionGraph from the JobGraph.
* If a prior execution graph exists, the JobGraph will be attached. If no prior execution
* graph exists, then the JobGraph will become attach to a new empty execution graph.
*/
@Deprecated
public static ExecutionGraph buildGraph(
@Nullable ExecutionGraph prior,
JobGraph jobGraph,
Configuration jobManagerConfig,
ScheduledExecutorService futureExecutor,
Executor ioExecutor,
SlotProvider slotProvider,
ClassLoader classLoader,
CheckpointRecoveryFactory recoveryFactory,
Time rpcTimeout,
RestartStrategy restartStrategy,
MetricGroup metrics,
int parallelismForAutoMax,
BlobWriter blobWriter,
Time allocationTimeout,
Logger log)
throws JobExecutionException, JobException {
checkNotNull(jobGraph, "job graph cannot be null");
final St