问题阐述:已知一个数组,数组中只有一个数据是出现一遍的,其他数据都是出现两遍,将出现一次的数据找出。
1.实例描述
输入为3个文件:
1.txt 内容为:
1,2,1,3,3
2.txt :
4,5,4,6,5
3.txt :
6,7,8,8,7
利用异或运算将列表中的所有ID异或,之后得到的值即为所求ID。先将每个分区的数据异或,然后将结果进行异或运算
package com.fly.spark
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.SparkContext._
import scala.collection.mutable._
object MapPartitionDemo {
def lineXor(line:String)={
val array=line.trim.split(",")
var temp=array(0).toInt
for(i<-1 until array.length){
temp^=array(i).toInt
}
temp
}
def myfunc(iter: Iterator[String]) : Iterator[(Int, Int)] = {
var temp =lineXor(iter.next().toString)
while (iter.hasNext)
{
temp^=lineXor(iter.next().toString)
}
Seq((1,temp)).iterator
}
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("MapPartitionDemo").setMaster("local[1]")
val sc = new SparkContext(conf)
val data=sc.textFile("hdfs://master:9000/xor")
val result=data.mapPartitions(myfunc).reduceByKey(_^_)
val lastResult=result.collect()
println(lastResult(0))
}
}
(1,2)
此处也可以用map,但map和mapPartitions还是有一定的区别的,网上有如下解释
As a note, a presentation provided by a speaker at the 2013 San Francisco Spark Summit (goo.gl/JZXDCR) highlights that tasks with high per-record overhead perform better with a mapPartition than with a map transformation. This is, according to the presentation, due to the high cost of setting up a new task.
That said, not sure if there is a difference in parallel execution and memory usage between map and mapPartitions. For instance, map could work in parallel implicitly, mapPartitions forces you to iterate. Thus computation could be faster with map but if your execution on a single tuple uses a lot of temporary memory, mapPartitions could avoid GC and memory issues. No idea if this is the way it actually works, but my anecdotal evidence seems to imply this. Would love to have confirmation.
当然两者在使用上有明显区别,map是对rdd所有分区一个一个元素的操作,而mapPartitions是对rdd每个分区进行操作