第一章.RDD概述
1.什么是RDD
RDD(Resilient Distributed DataSet)叫做弹性分布式数据集,是Spark中最基本的数据抽象
代码中是一个抽象类,它代表一个弹性的,不可变,可分区,里面的元素可并行计算的集合
一.RDD类比工厂生产
二.WordCount工作流程
2.RDD五大特性
第二章.RDD创建
1.根据本地集合创建RDD
- 在pom文件中加入junit依赖
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
- 编写代码
package com.atguigu.spark.day02
import org.apache.spark.{SparkConf, SparkContext}
import org.junit.Test
class $01_RDDCreate {
val conf = new SparkConf().setMaster("local[4]").setAppName("test")
val sc = new SparkContext(conf)
/**
* RDD的创建
* 1.根据本地集合创建
* 2.读取文件创建
* 3.其他RDD衍生
*/
/**
* 根据本地集合创建RDD
* makeRDD
* parallelize
* makeRDD底层就是使用parallelize
*/
@Test
def createRDDByLocalCollection(): Unit ={
val list = List(1,4,3,2,8,10)
val rdd = sc.makeRDD(list)
val rdd2 = sc.parallelize(list)
println(rdd.collect().toList)
println(rdd2.collect().toList)
}
}
2.读取文件创建RDD
/**
* spark读取文件
* 1.如果spark中配置HADOOP_CONF_DIR[工作中一般都有配置]
* 此时默认读取的是HDFS文件
* 读取HDFS文件
* 1.sc.textFile("/.../...")
* 2.sc.textFile("hdfs:///.../...")
* 3.sc.textFile("hdfs://namenode主机名:端口/.../...")
* 读取本地文件:sc.textFile("file:///.../...")
* 2.如果spark没有配置HADOOP_CONF_DIR
* 此时默认读取的是本地文件
* 读取HDFS文件: sc.textFile("hdfs://namenode主机名:端口/.../...")
* 读取本地文件
* 1.sc.textFile("file:///.../...")
* 2.sc.textFile("/.../...")
*/
@Test
def createRDDByFile()={
val rdd = sc.textFile("hdfs://hadoop102:9820/input/hello.txt")
println(rdd.collect().toList)
}
3.根据其他RDD衍生
/**
* 根据其他RDD衍生
*/
@Test
def createRDDByRDD()={
val rdd = sc.textFile("hdfs://hadoop102:9820/input/hello.txt")
val rdd2 = rdd.flatMap(_.split(" "))
println(rdd2.collect().toList)
}
第三章.RDD分区的创建
1.根据本地集合创建RDD的分区数
package com.atguigu.spark.day02
import org.apache.spark.{SparkConf, SparkContext}
import org.junit.Test
class $02_RDDPartitions {
/**
* 通过本地集合创建RDD的分区数: sc.parallelize(data,numSlices)
* 1.numSlices没有设置值,分区数 = defaultParallelism
* defaultParallelism的值:
* 1.有设置spark.default.parallelism参数的值,defaultParallelism = spark.default.parallelism的值
* 2.没有设置spark.default.parallelism参数的值
* 1.master=local,defaultParallelism = 1
* 2.master=local[N],defaultParallelism = N
* 3.master=local[*],defaultParallelism = CPU个数
* 3.master=spark://.../...,defaultParallelism = math.max(任务CPU总核数,2)
* 2.numSlices有设置值,分区数=numSlices的值
*
*/
val sc = new SparkContext(new SparkConf().setAppName("text").set("spark.default.parallelism","10").setMaster("local[4]"))
@Test
def createRDDByCollectionPartitions()={
val rdd = sc.parallelize(List(1, 4, 3, 7, 2, 9, 10, 33, 22), 6)
println(rdd.getNumPartitions)
val func = (index:Int,it:Iterator[Int])=>{
println(s"index =${index} data=${it.toList}")
it
}
}
}
2.通过读取文件创建RDD的分区数
/**
* 通过读取文件创建RDD的分区数
* 1.有指定minPartitions参数,分区数>=指定minPartitions参数值
* 2.没有指定minPartitions参数,分区数>= math.min(defaultParallelism,2)
*/
@Test
def createRDDByFile()={
val rdd = sc.textFile("datas/wc.txt",4)
println(rdd.getNumPartitions)
}
3.其他RDD衍生创建RDD分区数
/**
* 通过其他RDD衍生创建新RDD分区数=父RDD的分区数
*/
@Test
def createRDDByRDD()={
val rdd = sc.textFile("datas/wc.txt",4)
println(rdd.getNumPartitions)
val rdd2 = rdd.flatMap(_.split(" "))
println(rdd2.getNumPartitions)
}
4.根据集合创建RDD分区的数据规划
- 源码解析
new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
//seq=List(1,4,3,7,2,9,10,33,22) numSlices=6
override def getPartitions: Array[Partition] = {
//data = List(1,4,3,7,2,9,10,33,22) numSlices=6
val slices = ParallelCollectionRDD.slice(data, numSlices).toArray
//Array( Seq(1),Seq(4,3), Seq(7),Seq(2,9),Seq(10),Seq(33,22))
slices.indices
//[0,1,2,3,4,5]
.map(i => new ParallelCollectionPartition(id, i, slices(i))).toArray
}
def slice[T: ClassTag](seq: Seq[T], numSlices: Int): Seq[Seq[T]] = {
//seq=List(1,4,3,7,2,9,10,33,22) numSlices=6
def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] = {
//length = 9 numSlices =6
(0 until numSlices)
//[0,1,2,3,4,5]
.iterator
.map { i =>
//第一次遍历: i = 0 start = 0 end = 1 结果:(0,1)
//第二次遍历: i = 1 start = 1 end = 3 结果:(1,3)
//第三次遍历: i = 2 start = 3 end = 4 结果:(3,4)
//第四次遍历: i = 3 start = 4 end = 6 结果:(4,6)
//第五次遍历: i = 4 start = 6 end = 7 结果:(6,7)
//第六次遍历: i = 5 start = 7 end = 9 结果:(7,9)
val start = ((i * length) / numSlices).toInt
val end = (((i + 1) * length) / numSlices).toInt
(start, end)
}
}
seq match {
case _ =>
val array = seq.toArray
// array = Array(1,4,3,7,2,9,10,33,22)
positions(array.length, numSlices)
//Iterator[ (0,1), (1,3), (3,4), (4,6) , (6,7), (7,9)]
.map {
case (start, end) =>
//第一次遍历: start=0 end=1 结果: Seq(1)
//第二次遍历: start=1 end=3 结果: Seq(4,3)
//第三次遍历: start=3 end=4 结果: Seq(7)
//第四次遍历: start=4 end=6 结果: Seq(2,9)
//第五次遍历: start=6 end=7 结果: Seq(10)
//第六次遍历: start=7 end=9 结果: Seq(33,22)
array.slice(start, end).toSeq
}.toSeq
//Seq( Seq(1),Seq(4,3), Seq(7),Seq(2,9),Seq(10),Seq(33,22))
}
}
- 代码验证
@Test
def createRDDByCollectionPartitions(): Unit={
val rdd = sc.parallelize(List(1, 4, 3, 7, 2, 9, 10, 33, 22), 6)
println(rdd.getNumPartitions)
/* val func = (index:Int,it:Iterator[Int])=>{
println(s"index =${index} data=${it.toList}")
it
}
rdd.mapPartitionsWithIndex(func).collect()*/
rdd.mapPartitionsWithIndex((index,it)=>{
println(s"index=${index} data=${it.toList}")
it
}).collect()
}
5.读取文件创建RDD的切片源码
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],minPartitions)
new HadoopRDD(this,confBroadcast,Some(setInputPathsFunc),inputFormatClass,keyClass,valueClass,minPartitions)
public InputSplit[] getSplits(JobConf job, int numSplits)
//numSplits = 4
throws IOException {
StopWatch sw = new StopWatch().start();
//获取要读取的所有文件
FileStatus[] files = listStatus(job);
job.setLong(NUM_INPUT_FILES, files.length);
//统计待读取的所有文件的总大小
long totalSize = 0; // compute total size
for (FileStatus file: files) { // check we have valid files
if (file.isDirectory()) {
throw new IOException("Not a file: "+ file.getPath());
}
totalSize += file.getLen();
}
//totalSize = 75B
//goalSize = 75B / 4 = 18B
long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
//minSize = 1
long minSize = Math.max(job.getLong(org.apache.hadoop.mapreduce.lib.input.
FileInputFormat.SPLIT_MINSIZE, 1), minSplitSize);
//创建装载切片的容器
ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);
NetworkTopology clusterMap = new NetworkTopology();
//遍历待读取的文件
for (FileStatus file: files) {
Path path = file.getPath();
//获取当前文件大小
//length=75B
long length = file.getLen();
if (length != 0) {
FileSystem fs = path.getFileSystem(job);
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
if (isSplitable(fs, path)) {
//获取文件块的大小 blockSize=32M
long blockSize = file.getBlockSize();
//goalSize=18B minSize=1B blockSize=32M
long splitSize = computeSplitSize(goalSize, minSize, blockSize);
//splitSize = Math.max(minSize, Math.min(goalSize, blockSize)) =18B
//针对当前文件循环切片
long bytesRemaining = length;
//bytesRemaining = 75B
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
//第一次循环 75/18 = 4.16 >1.1 从0B开始切,切18B bytesRemaining=57B
//第二次循环 57/18 = 3.16 >1.1 从18B开始切,切18B bytesRemaining=39B
//第三次循环 39/18 = 2.16 >1.1 从36B开始切,切18B bytesRemaining=21B
//第四次循环 21/18 = 1.12 >1.1 从54B开始切,切18B bytesRemaining=3B
String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations,
length-bytesRemaining, splitSize, clusterMap);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
splitHosts[0], splitHosts[1]));
bytesRemaining -= splitSize;
}
//最后一个切片
if (bytesRemaining != 0) {
String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations, length
- bytesRemaining, bytesRemaining, clusterMap);
splits.add(makeSplit(path, length - bytesRemaining, bytesRemaining,
splitHosts[0], splitHosts[1]));
}
} else {
String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations,0,length,clusterMap);
splits.add(makeSplit(path, 0, length, splitHosts[0], splitHosts[1]));
}
} else {
//Create empty hosts array for zero length files
splits.add(makeSplit(path, 0, length, new String[0]));
}
}
sw.stop();
if (LOG.isDebugEnabled()) {
LOG.debug("Total # of splits generated by getSplits: " + splits.size()
+ ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));
}
return splits.toArray(new FileSplit[splits.size()]);
}