CountOnce

问题阐述:已知一个数组,数组中只有一个数据是出现一遍的,其他数据都是出现两遍,将出现一次的数据找出。
1.实例描述
输入为3个文件:
1.txt 内容为:
1,2,1,3,3
2.txt :
4,5,4,6,5
3.txt :
6,7,8,8,7

2.设计思路
   利用异或运算将列表中的所有ID异或,之后得到的值即为所求ID。先将每个分区的数据异或,然后将结果进行异或运算

3.代码示例

package com.fly.spark
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.SparkContext._
import scala.collection.mutable._

object MapPartitionDemo {
  def lineXor(line:String)={
    val array=line.trim.split(",")
    var temp=array(0).toInt
    for(i<-1 until array.length){
      temp^=array(i).toInt
    }
    temp
  }
  def myfunc(iter: Iterator[String]) : Iterator[(Int, Int)] = {
    var temp =lineXor(iter.next().toString)
    while (iter.hasNext)
    {
      temp^=lineXor(iter.next().toString)
    }
    Seq((1,temp)).iterator
  }
 
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("MapPartitionDemo").setMaster("local[1]")
    val sc = new SparkContext(conf)
    val data=sc.textFile("hdfs://master:9000/xor")
    val result=data.mapPartitions(myfunc).reduceByKey(_^_)
    val lastResult=result.collect()
    println(lastResult(0))
  }
}

4.运行结果
(1,2)

5.程序说明
此处也可以用map,但map和mapPartitions还是有一定的区别的,网上有如下解释
As a note, a presentation provided by a speaker at the 2013 San Francisco Spark Summit (goo.gl/JZXDCR) highlights that tasks with high per-record overhead perform better with a mapPartition than with a map transformation. This is, according to the presentation, due to the high cost of setting up a new task.
That said, not sure if there is a difference in parallel execution and memory usage between map and mapPartitions. For instance, map could work in parallel implicitly, mapPartitions forces you to iterate. Thus computation could be faster with map but if your execution on a single tuple uses a lot of temporary memory, mapPartitions could avoid GC and memory issues. No idea if this is the way it actually works, but my anecdotal evidence seems to imply this. Would love to have confirmation.

当然两者在使用上有明显区别,map是对rdd所有分区一个一个元素的操作,而mapPartitions是对rdd每个分区进行操作
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值