1.textFile算子源码
1.1.源码
def textFile(
path: String ,
minPartitions: Int = defaultMinPartitions) : RDD[ String ] = withScope {
assertNotStopped( )
hadoopFile( path, classOf[ TextInputFormat] , classOf[ LongWritable] , classOf[ Text] ,
minPartitions) . map( pair => pair. _2. toString) . setName( path)
}
1.2.textFile算子参数
path:文本所在路径;HDFS、本地文件系统(在所有节点上可用)或任何hadoop支持的文件系统;
minPartitions :最小分区;默认值根据defaultMinPartitions决定;
1.3.算子逻辑
调用hadoopFile算子生成fileRDD;
将文件路径作为rdd名称;
2.defaultMinPartitions参数
2.1.defaultMinPartitions源码
def defaultMinPartitions: Int = math. min( defaultParallelism, 2 )
2.1.2.算子逻辑
在defaultParallelism参数值(默认分区参数)和2之间取最小值;
2.2.defaultParallelism算子
2.2.1.源码
def defaultParallelism: Int = {
assertNotStopped( )
taskScheduler. defaultParallelism
}
override def defaultParallelism( ) : Int = backend. defaultParallelism( )
override def defaultParallelism( ) : Int = {
conf. getInt( "spark.default.parallelism" , math. max( totalCoreCount. get( ) , 2 ) )
}
override def defaultParallelism( ) : Int =
scheduler. conf. getInt( "spark.default.parallelism" , totalCores)
2.2.2.算子逻辑
taskSheduler的默认分区值的设置分2种:
1. 全局配置的spark.default.parallelism参数有值,就取参数值
2. 全局配置的spark.default.parallelism参数没有设置
集群模式下,在cpu总核数 vs 2之间取最大值
local模式下,取cpu总核数;
2.3.defaultMinPartitions参数逻辑
spark.default.parallelism参数有值为p:
当p>2时,defaultMinPartitions=2,即textFile()算子默认最小分区数为2;
当p<=2时,defaultMinPartitions=p,即textFile()算子默认最小分区数为p;
spark.default.parallelism参数无值:
集群模式下,defaultMinPartitions=2,即textFile()算子默认最小分区数为2;
local模式下,defaultMinPartitions=min(cpu总核数,2);
3.hadoopFile算子
3.1.源码
def hadoopFile[ K, V] (
path: String ,
inputFormatClass: Class[ _ < : InputFormat[ K, V] ] ,
keyClass: Class[ K] ,
valueClass: Class[ V] ,
minPartitions: Int = defaultMinPartitions) : RDD[ ( K, V) ] = withScope {
assertNotStopped( )
FileSystem. getLocal( hadoopConfiguration)
val confBroadcast = broadcast( new SerializableConfiguration( hadoopConfiguration) )
val setInputPathsFunc = ( jobConf: JobConf) => FileInputFormat. setInputPaths( jobConf, path)
new HadoopRDD(
this ,
confBroadcast,
Some( setInputPathsFunc) ,
inputFormatClass,
keyClass,
valueClass,
minPartitions) . setName( path)
}
4.RDD分区规则
4.1.源码
override def getPartitions: Array[ Partition] = {
val jobConf = getJobConf( )
SparkHadoopUtil. get. addCredentials( jobConf)
try {
val allInputSplits = getInputFormat( jobConf) . getSplits( jobConf, minPartitions)
val inputSplits = if ( ignoreEmptySplits) {
allInputSplits. filter( _. getLength > 0 )
} else {
allInputSplits
}
val array = new Array[ Partition] ( inputSplits. size)
for ( i <- 0 until inputSplits. size) {
array( i) = new HadoopPartition( id, i, inputSplits( i) )
}
array
} catch {
case e: InvalidInputException if ignoreMissingFiles =>
logWarning( s " ${ jobConf. get( FileInputFormat. INPUT_DIR) } doesn't exist and no" +
s " partitions returned from this path." , e)
Array. empty[ Partition]
}
}
4.2. 算子逻辑
根据一定规则切分文件返回文件分片数组
文件分片数组大小作为RDD分区数
将文件分片信息存入RDD分区列表中
5.文件切片规则
5.1.源码
public InputSplit[ ] getSplits( JobConf job, int numSplits) throws IOException {
StopWatch sw = ( new StopWatch( ) ) . start( ) ;
FileStatus[ ] files = this . listStatus( job) ;
job. setLong( "mapreduce.input.fileinputformat.numinputfiles" , ( long) files. length) ;
long totalSize = 0L ;
FileStatus[ ] var7 = files;
int var8 = files. length;
for ( int var9 = 0 ; var9 < var8; ++ var9) {
FileStatus file = var7[ var9] ;
if ( file. isDirectory( ) ) {
throw new IOException( "Not a file: " + file. getPath( ) ) ;
}
totalSize += file. getLen( ) ;
}
long goalSize = totalSize / ( long) ( numSplits == 0 ? 1 : numSplits) ;
long minSize = Math. max( job. getLong( "mapreduce.input.fileinputformat.split.minsize" , 1L ) , this . minSplitSize) ;
ArrayList< FileSplit > splits = new ArrayList( numSplits) ;
NetworkTopology clusterMap = new NetworkTopology( ) ;
FileStatus[ ] var13 = files;
int var14 = files. length;
for ( int var15 = 0 ; var15 < var14; ++ var15) {
FileStatus file = var13[ var15] ;
Path path = file. getPath( ) ;
long length = file. getLen( ) ;
if ( length == 0L ) {
splits. add( this . makeSplit( path, 0L , length, new String [ 0 ] ) ) ;
} else {
FileSystem fs = path. getFileSystem( job) ;
BlockLocation[ ] blkLocations;
if ( file instanceof LocatedFileStatus) {
blkLocations = ( ( LocatedFileStatus) file) . getBlockLocations( ) ;
} else {
blkLocations = fs. getFileBlockLocations( file, 0L , length) ;
}
if ( ! this . isSplitable( fs, path) ) {
String [ ] [ ] splitHosts = this . getSplitHostsAndCachedHosts( blkLocations, 0L , length, clusterMap) ;
splits. add( this . makeSplit( path, 0L , length, splitHosts[ 0 ] , splitHosts[ 1 ] ) ) ;
} else {
long blockSize = file. getBlockSize( ) ;
long splitSize = this . computeSplitSize( goalSize, minSize, blockSize) ;
long bytesRemaining;
String [ ] [ ] splitHosts;
for ( bytesRemaining = length; ( double) bytesRemaining / ( double) splitSize > 1.1D ; bytesRemaining -= splitSize) {
splitHosts = this . getSplitHostsAndCachedHosts( blkLocations, length - bytesRemaining, splitSize, clusterMap) ;
splits. add( this . makeSplit( path, length - bytesRemaining, splitSize, splitHosts[ 0 ] , splitHosts[ 1 ] ) ) ;
}
if ( bytesRemaining != 0L ) {
splitHosts = this . getSplitHostsAndCachedHosts( blkLocations, length - bytesRemaining, bytesRemaining, clusterMap) ;
splits. add( this . makeSplit( path, length - bytesRemaining, bytesRemaining, splitHosts[ 0 ] , splitHosts[ 1 ] ) ) ;
}
}
}
}
sw. stop( ) ;
if ( LOG. isDebugEnabled( ) ) {
LOG. debug( "Total # of splits generated by getSplits: " + splits. size( ) + ", TimeTaken: " + sw. now( TimeUnit. MILLISECONDS) ) ;
}
return ( InputSplit[ ] ) splits. toArray( new FileSplit[ splits. size( ) ] ) ;
}
5.2. 算子逻辑
可以通过mapreduce.input.fileinputformat.split.minsize参数设置文件分片的最小size
文件切分时,存在有些分片size大于分片设置的size,但是小于分片设置大小的1.1倍
存在文件分片数不等于最小分区数的情况:
分片大小期望值R = 文件大小/最小分区数;
假设【 文件大小/最小分区数 】不能整除,余数 + R > R* 1.1,此种情况下实际分片数 = 最小分区数 + 1;
文件大小totalSize = 1250Mb,最小分区数numSplits = 10;
分区大小期望值goalSize= totalSize / numSplits = 1250Mb / 10 = 123 余 20
根据分片size决定原则,分片大小期望值 = 123 < block size 128Mb,分区size = 123MB
此时(123 + 20 = 143) > (123 * 1.1 =135.3);此时,文件切片数 = 11 > 最小分区数
分片size的决定原则:
分片大小期望size和block size中取小,然后和最小size中取大;
由于最小size一般都设置的比较小,所以分片size基本上在期望size和block size中取较小的一个;
可以通过mapreduce.input.fileinputformat.split.minsize参数设置文件分片的最小size
6.总结
textFile()算子可以一次处理一批文件,path参数通过逗号分割不懂的文件路径;
由defaultParallelism参数值、cpu总核数、2确定最小分区数;
由最小分区数、文件分片数决定返回RDD实际分区数;存在实际分区数比最小分区数大1的情况;
文件切分时,分片大小可以是常规分片大小的1.1倍;
7.参考文件
Spark中textFile源码分析