Spark读取外部数据分区策略
先来看一段代码,使用textFile方式读取外部数据。
val conf: SparkConf = new SparkConf().setAppName("SparkWordCount").setMaster("local[*]")
val sc: SparkContext = new SparkContext(conf)
val lines: RDD[String] = sc.textFile("/Users/liyapeng/Spark/data",2)
textFile可以将文件作为数据处理的数据源,用户未给定时Hadoop RDD的默认最小分区数为2,注意最小分区数并不是最终分区数。
/**
* Default min number of partitions for Hadoop RDDs when not given by user
* Notice that we use math.min so the "defaultMinPartitions" cannot be higher than 2.
* The reasons for this are discussed in https://github.com/mesos/spark/pull/718
*/
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
Spark读取文件,底层其实使用的就是Hadoop的读取方式。
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
}
package org.apache.hadoop.mapred;
public class TextInputFormat extends FileInputFormat<LongWritable, Text>
读取字节数总和
/** Splits files returned by {@link #listStatus(JobConf)} when
* they're too big.*/
public InputSplit[] getSplits(JobConf job, int numSplits)
throws IOException {
StopWatch sw = new StopWatch().start();
FileStatus[] stats = listStatus(job);
// Save the number of input files for metrics/loadgen
job.setLong(NUM_INPUT_FILES, stats.length);
long totalSize = 0; // compute total size
boolean ignoreDirs = !job.getBoolean(INPUT_DIR_RECURSIVE, false)
&& job.getBoolean(INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS, false);
List<FileStatus> files = new ArrayList<>(stats.length);
for (FileStatus file: stats) { // check we have valid files
if (file.isDirectory()) {
if (!ignoreDirs) {
throw new IOException("Not a file: "+ file.getPath());
}
} else {
files.add(file);
totalSize += file.getLen();
}
}
计算每个分区存储字节数
long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
最终结果:
注:有个hadoop分区存储有个1.1的概念即是10%
totalSize = 7
goalSize = 7 / 2 = 3(byte)
part = 7 / 3 = 2...1 (1.1) + 1 = 3(分区)