一:spark读取hdfs分片机制
spark sc.textFile底层调用的是hadoop的代码,所以分片机制也是hadoop的机制
goalSize=totalSize是文件的总字节数/numSplits是有多少个分区,没有配置的话默认 minPartitions=2。就是算每个分区有多少数据
long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits); long minSize = Math.max(job.getLong(org.apache.hadoop.mapreduce.lib.input. FileInputFormat.SPLIT_MINSIZE, 1), minSplitSize); SPLIT_MINSIZE="mapreduce.input.fileinputformat.split.minsize"
private long minSplitSize = 1; long blockSize = file.getBlockSize(); long splitSize = computeSplitSize(goalSize, minSize, blockSize);
protected long computeSplitSize(long goalSize, long minSize, long blockSize) { return Math.max(minSize, Math.min(goalSize, blockSize)); }
computeSplitSize方法中当每个分区的数据比blockSize(128M)大时,返回blockSize,当goalSize小,返回goalSize。
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) { String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations, length-bytesRemaining, splitSize, clusterMap); splits.add(makeSplit(path, length-bytesRemaining, splitSize, splitHosts[0], splitHosts[1])); bytesRemaining -= splitSize; }
其中
long bytesRemaining =file.getLen();
private static final double SPLIT_SLOP = 1.1; // 10% slop
splits=文件长度/splitSize(blockSize或者goalSize) >1.1 来循环的增加分区