wholeTextFiles函数
分析参数:
path: String 是一个URI,這个URI可以是HDFS、本地文件(全部的节点都可以),或者其他Hadoop支持的文件系统URI,每个文件读取为一个record,返回的是一个key-value的pairRDD,key是這个文件的路径名
value 是這个文件的具体内容,也就是RDD的内部形式是Iterator[(String,String)])
minPartitions= math.min(defaultParallelism, 2) 是指定数据的分区,如果不指定分区,当你的核数大于2的时候,不指定分区数那么就是 2
当你的数据大于128M时候,Spark是为每一个快(block)创建一个分片(Hadoop-2.X之后为128m一个block)
note:
(1)、wholeTextFiles对于大量的小文件效率比较高,大文件效果没有那么高
(2)、一些文件系统的路径名采用通配符的形式效果比一个一个文件名添加上去更高效
- sc.wholeTextFiles("/root/application/temp/*",2)
比下面表达更高效。
- sc.wholeTextFiles("/root/application/temp/people1.txt,/root/application/temp/people2.txt",2)
1、从HDFS读取一个文件
- val path = "hdfs://master:9000/examples/examples/src/main/resources/people.txt"
- val rdd1 = sc.wholeTextFiles(path,3)
- rdd1.foreach(println)
- (hdfs://master:9000/examples/examples/src/main/resources/people.txt,Michael, 29
- Andy, 30
- Justin, 19
- )
它的key是hdfs://master:9000/examples/examples/src/main/resources/people.txt
value是:
Michael, 29
Andy, 30
Justin, 19
2、从本地读取多个文件
- val path = "/usr/local/spark/spark-1.6.0-bin-hadoop2.6/data/*/*.txt" //local file
- val rdd1 = sc.wholeTextFiles(path,2)
从本地读取多个文件,但是和textFile有一些分区上的区别
用wholeTextFiles函数读取的這些文件,這个RDD是只有2个partition,它就像把全部文件数据放在一些,之后再按照
后面设置的partition来进行分区。比如此处设置的是两个分区。
虽然這有两个分区,但是内部的RDD是的key-value的个数是和文件数是一样多的。
map(function)
map是对RDD中的每个元素都执行一个指定的函数来产生一个新的RDD。任何原RDD中的元素在新RDD中都有且只有一个元素与之对应。
举例:
val a = sc.parallelize(1 to 9, 3)
val b = a.map(x => x*2)//x => x*2是一个函数,x是传入参数即RDD的每个元素,x*2是返回值
a.collect
//结果Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
b.collect
//结果Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18)
- 1
- 2
- 3
- 4
- 5
- 6
当然map也可以把Key变成Key-Value对
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", " eagle"), 2)
val b = a.map(x => (x, 1))
b.collect.foreach(println(_))
/*
(dog,1)
(tiger,1)
(lion,1)
(cat,1)
(panther,1)
( eagle,1)
*/
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
mapPartitions(function)
map()的输入函数是应用于RDD中每个元素,而mapPartitions()的输入函数是应用于每个分区
package test
import scala.Iterator
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object TestRdd {
def sumOfEveryPartition(input: Iterator[Int]): Int = {
var total = 0
input.foreach { elem =>
total += elem
}
total
}
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Rdd Test")
val spark = new SparkContext(conf)
val input = spark.parallelize(List(1, 2, 3, 4, 5, 6), 2)//RDD有6个元素,分成2个partition
val result = input.mapPartitions(
partition => Iterator(sumOfEveryPartition(partition)))//partition是传入的参数,是个list,要求返回也是list,即Iterator(sumOfEveryPartition(partition))
result.collect().foreach {
println(_)//6 15
}
spark.stop()
}
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
mapValues(function)
原RDD中的Key保持不变,与新的Value一起组成新的RDD中的元素。因此,该函数只适用于元素为KV对的RDD。
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", " eagle"), 2)
val b = a.map(x => (x.length, x))
b.mapValues("x" + _ + "x").collect
- 1
- 2
- 3
//"x" + _ + "x"
等同于everyInput =>"x" + everyInput + "x"
//结果
Array(
(3,xdogx),
(5,xtigerx),
(4,xlionx),
(3,xcatx),
(7,xpantherx),
(5,xeaglex)
)
mapWith和flatMapWith
感觉用得不多,参考http://blog.csdn.net/jewes/article/details/39896301
flatMap(function)
与map类似,区别是原RDD中的元素经map处理后只能生成一个元素,而原RDD中的元素经flatmap处理后可生成多个元素
val a = sc.parallelize(1 to 4, 2)
val b = a.flatMap(x => 1 to x)//每个元素扩展
b.collect
/*
结果 Array[Int] = Array( 1,
1, 2,
1, 2, 3,
1, 2, 3, 4)
*/
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
flatMapValues(function)
val a = sc.parallelize(List((1,2),(3,4),(5,6)))
val b = a.flatMapValues(x=>1 to x)
b.collect.foreach(println(_))
/*
(1,1)
(1,2)
(3,1)
(3,2)
(3,3)
(3,4)
(5,1)
(5,2)
(5,3)
(5,4)
(5,5)
(5,6)
*/
groupByKey
def groupByKey(): RDD[(K, Iterable[V])]
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]
该函数用于将RDD[K,V]中每个K对应的V值,合并到一个集合Iterable[V]中,
参数numPartitions用于指定分区数;
参数partitioner用于指定分区函数;
- scala> var rdd1 = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))
- rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[89] at makeRDD at :21
- scala> rdd1.groupByKey().collect
- res81: Array[(String, Iterable[Int])] = Array((A,CompactBuffer(0, 2)), (B,CompactBuffer(2, 1)), (C,CompactBuffer(1)))
reduceByKey
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]
该函数用于将RDD[K,V]中每个K对应的V值根据映射函数来运算。
参数numPartitions用于指定分区数;
参数partitioner用于指定分区函数;
- scala> var rdd1 = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))
- rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[91] at makeRDD at :21
- scala> rdd1.partitions.size
- res82: Int = 15
- scala> var rdd2 = rdd1.reduceByKey((x,y) => x + y)
- rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[94] at reduceByKey at :23
- scala> rdd2.collect
- res85: Array[(String, Int)] = Array((A,2), (B,3), (C,1))
- scala> rdd2.partitions.size
- res86: Int = 15
- scala> var rdd2 = rdd1.reduceByKey(new org.apache.spark.HashPartitioner(2),(x,y) => x + y)
- rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[95] at reduceByKey at :23
- scala> rdd2.collect
- res87: Array[(String, Int)] = Array((B,3), (A,2), (C,1))
- scala> rdd2.partitions.size
- res88: Int = 2
reduceByKeyLocally
def reduceByKeyLocally(func: (V, V) => V): Map[K, V]
该函数将RDD[K,V]中每个K对应的V值根据映射函数来运算,运算结果映射到一个Map[K,V]中,而不是RDD[K,V]。
- scala> var rdd1 = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))
- rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[91] at makeRDD at :21
- scala> rdd1.reduceByKeyLocally((x,y) => x + y)
- res90: scala.collection.Map[String,Int] = Map(B -> 3, A -> 2, C -> 1)