一、RDD创建
1、从集合中创建
1)parallelize
def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T]
从一个Seq集合创建RDD。
参数1:Seq集合,必须。
参数2:分区数,默认为该Application分配到的资源的CPU核数
scala> var rdd = sc.parallelize(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:24
scala> rdd.collect
res3: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> rdd.partitions.size
res4: Int = 40
//创建分区为3
scala> var rdd = sc.parallelize(1 to 10,3)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> rdd.partitions.size
res0: Int = 3
2)makeRDD
def makeRDD[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T]
这种用法和parallelize完全相同
def makeRDD[T](seq: Seq[(T, Seq[String])])(implicit arg0: ClassTag[T]): RDD[T]
该用法可以指定每一个分区的preferredLocations。
scala> var rdd = sc.parallelize(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at :21
scala> rdd.partitions.size
res4: Int = 15
//设置RDD为3个分区
scala> var rdd2 = sc.parallelize(1 to 10,3)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at :21
scala> rdd2.collect
res5: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> rdd2.partitions.size
res6: Int = 3
2、从外部存储创建RDD
1)textFile
//从hdfs文件创建
scala> var rdd = sc.textFile("hdfs:///tmp/1.txt")
rdd: org.apache.spark.rdd.RDD[String] = hdfs:///tmp/1.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> rdd.count
res1: Long = 4
//从本地文件创建
scala> var rdd = sc.textFile("file:///etc/hadoop/conf/core-site.xml")
rdd: org.apache.spark.rdd.RDD[String] = file:///etc/hadoop/conf/core-site.xml MapPartitionsRDD[2] at textFile at <console>:24
scala> rdd.count
res1: Long = 145
注意这里的本地文件路径需要在Driver和Executor端存在。
从其他HDFS文件格式创建:hadoopFile、sequenceFile、objectFile、newAPIHadoopFile;从Hadoop接口API创建:hadoopRDD、newAPIHadoopRDD
二、RDD基本转换操作(1)–map、flatMap、distinct
1、map
将一个RDD中的每个数据项,通过map中的函数映射变为一个新的元素,输入分区与输出分区一对一,即:有多少个输入分区,就有多少个输出分区。
[root@hadoop211 python]# hadoop fs -cat /tmp/1.txt
hello world
hello spark
hello hive
示例:
//从HDFS读取数据到RDD
scala> var data = sc.textFile("/tmp/1.txt")
data: org.apache.spark.rdd.RDD[String] = /tmp/1.txt MapPartitionsRDD[4] at textFile at <console>:24
//使用map算子
scala> var mapresult = data.map(line => line.split("\\s+")) "\\s"表示:空格,回车,换行等空白符;"+"号表示一个或多个的意思
mapresult: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at map at <console>:26
//运算map算子结果
scala> mapresult.collect
res2: Array[Array[String]] = Array(Array(hello, world), Array(hello, spark), Array(hello, hive))
2、flatMap
属于Transformation算子,第一步和map一样,最后将所有的输出分区合并成一个。
示例:
//使用flatMap算子
scala> var flatmapresult = data.flatMap(line => line.split("\\s+"))
flatmapresult: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at flatMap at <console>:26
//运算flatMap算子结果
scala> flatmapresult.collect
res5: Array[String] = Array(hello, world, hello, spark, hello, hive)
使用flatMap时需注意,flatMap会将字符串看成一个字符数组,示例:
scala> data.map(_.toUpperCase).collect //将data中元素转为大写
res6: Array[String] = Array(HELLO WORLD, HELLO SPARK, HELLO HIVE)
scala> data.flatMap(_.toUpperCase).collect
res7: Array[Char] = Array(H, E, L, L, O, , W, O, R, L, D, H, E, L, L, O, , S, P, A, R, K, H, E, L, L, O, , H, I, V, E)
scala> data.map(x => x.split("\\s+")).collect
res8: Array[Array[String]] = Array(Array(hello, world), Array(hello, spark), Array(hello, hive))
scala> data.flatMap(x => x.split("\\s+")).collect
res9: Array[String] = Array(hello, world, hello, spark, hello, hive)
这次的结果好像是预期的,最终结果里面并没有把字符串当成字符数组,这是因为这次map函数中返回的类型为Array[String],并不是String,flatMap只会将String扁平化成字符数组,并不会把Array[String]也扁平化成字符数组。
3、distinct
对RDD中元素做去重操作
示例:
scala> data.flatMap(x => x.split("\\s+")).collect
res9: Array[String] = Array(hello, world, hello, spark, hello, hive)
scala> data.flatMap(x => x.split("\\s+")).distinct.collect
res10: Array[String] = Array(hive, hello, world, spark)
三、RDD基本转换操作(2)–coalesce、repartition---重分区
1、coalesce
def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T]
该函数用于将RDD进行重分区,使用HashPartitioner,第一个参数为重分区的数目,第二个为是否进行shuffle,默认为false;
示例:
scala> var data = sc.textFile("/tmp/1.txt")
data: org.apache.spark.rdd.RDD[String] = /tmp/1.txt MapPartitionsRDD[17] at textFile at <console>:24
scala> data.collect
res11: Array[String] = Array(hello world, hello spark, hello hive, hi spark, hi hadoop)
scala> data.partitions.size
res12: Int = 2 //RDD data默认为2分区
scala> var rdd1 = data.coalesce(1)
rdd1: org.apache.spark.rdd.RDD[String] = CoalescedRDD[18] at coalesce at <console>:26
scala> rdd1.partitions.size
res13: Int = 1 //rdd1的分区数修改为1
scala> var rdd1 = data.coalesce(4)
rdd1: org.apache.spark.rdd.RDD[String] = CoalescedRDD[19] at coalesce at <console>:26
scala> rdd1.partitions.size
res14: Int = 2 //如果重分区的数目大于原来的分区数,那么必须指定shuffle参数为true,否则分区数为初始默认,不变
scala> var rdd1 = data.coalesce(4,true)
rdd1: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[23] at coalesce at <console>:26
scala> rdd1.partitions.size
res15: Int = 4
2、reparation
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]
该函数其实就是coalesce函数的第二个参数为true的实现
示例:
scala> var rdd2 = data.repartition(1)
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[27] at repartition at <console>:26
scala> rdd2.partitions.size
res16: Int = 1
scala> var rdd2 = data.repartition(3)
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[31] at repartition at <console>:26
scala> rdd2.partitions.size
res17: Int = 3
四、RDD基本转换操作(3)–randomSplit、glom---RDD分裂
1、randomSplit随机分裂
def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]
该函数根据weights权重,将一个RDD切分成多个RDD,该权重参数为一个Double数组,第二个参数为random的种子,基本可忽略。
示例:
scala> var rdd = sc.makeRDD(1 to 10,10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at <console>:24
scala> rdd.collect
res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> val splitRDD = rdd.randomSplit(Array(1.0,2.0,3.0,4.0))
splitRDD: Array[org.apache.spark.rdd.RDD[Int]] = Array(MapPartitionsRDD[1] at randomSplit at <console>:26, MapPartitionsRDD[2] at randomSplit at <console>:26, MapPartitionsRDD[3] at randomSplit at <console>:26, MapPartitionsRDD[4] at randomSplit at <console>:26)
//这里注意:randomSplit的结果是一个RDD数组
scala> splitRDD.size
res1: Int = 4
//由于randomSplit的第一个参数weights中传入的值有4个,因此,就会切分成4个RDD,
//把原来的rdd按照权重1.0,2.0,3.0,4.0(也可以为1,2,3,4),随机划分到这4个RDD中,权重高的RDD,划分到的几率就大一些。
//注意,权重的总和加起来为1,否则会不正常
scala> splitRDD(0).collect
res2: Array[Int] = Array(1, 6)
scala> splitRDD(1).collect
res3: Array[Int] = Array(7)
scala> splitRDD(2).collect
res4: Array[Int] = Array(3, 5, 8, 9)
scala> splitRDD(3).collect
res5: Array[Int] = Array(2, 4, 10)
2、glom
def glom(): RDD[Array[T]]
该函数是将RDD中每一个分区中类型为T的元素转换成Array[T],这样每一个分区就只有一个数组元素。
示例:
scala> var rdd = sc.makeRDD(1 to 10,3)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at makeRDD at <console>:24
scala> rdd.partitions.size
res12: Int = 3
scala> rdd.collect
res13: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> rdd.glom().collect
res14: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10))
//glom将每个分区中的元素放到一个数组中,这样,结果就变成了3个数组
五、RDD基本转换操作(4)–union、intersection、subtract
1、union
def union(other: RDD[T]): RDD[T]
该函数比较简单,就是将两个RDD进行合并,不去重。
示例:
scala> var rdd1 = sc.makeRDD(1 to 2,1)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at makeRDD at <console>:24
scala> rdd1.collect
res15: Array[Int] = Array(1, 2)
scala> var rdd2 = sc.makeRDD(2 to 3,1)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at makeRDD at <console>:24
scala> rdd2.collect
res16: Array[Int] = Array(2, 3)
scala> rdd1.union(rdd2).collect
res17: Array[Int] = Array(1, 2, 2, 3)
2、intersection
def intersection(other: RDD[T]): RDD[T]
def intersection(other: RDD[T], numPartitions: Int): RDD[T]
def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]
该函数返回两个RDD的交集,并且去重。
参数numPartitions指定返回的RDD的分区数。
参数partitioner用于指定分区函数
示例:
scala> rdd1.union(rdd2).collect
res17: Array[Int] = Array(1, 2, 2, 3)
scala> rdd1.intersection(rdd2).collect
res18: Array[I