CountOnce

最新推荐文章于 2024-08-09 21:50:21 发布

仰望星空_

最新推荐文章于 2024-08-09 21:50:21 发布

阅读量578

点赞数

分类专栏：大数据文章标签： spark scala countonce

本文链接：https://blog.csdn.net/u010987621/article/details/44064139

版权

大数据专栏收录该内容

6 篇文章 0 订阅

订阅专栏

问题阐述：已知一个数组，数组中只有一个数据是出现一遍的，其他数据都是出现两遍，将出现一次的数据找出。

1.实例描述

输入为3个文件：

1.txt 内容为：

1,2,1,3,3

2.txt :

4,5,4,6,5

3.txt :

6,7,8,8,7

2.设计思路

利用异或运算将列表中的所有ID异或，之后得到的值即为所求ID。先将每个分区的数据异或，然后将结果进行异或运算

3.代码示例

package com.fly.spark

import org.apache.spark.{SparkContext, SparkConf}

import org.apache.spark.SparkContext._

import scala.collection.mutable._

object MapPartitionDemo {

def lineXor(line:String)={

val array=line.trim.split(",")

var temp=array(0).toInt

for(i<-1 until array.length){

temp^=array(i).toInt

}

temp

}

def myfunc(iter: Iterator[String]) : Iterator[(Int, Int)] = {

var temp =lineXor(iter.next().toString)

while (iter.hasNext)

{

temp^=lineXor(iter.next().toString)

}

Seq((1,temp)).iterator

}

def main(args: Array[String]) {

val conf = new SparkConf().setAppName("MapPartitionDemo").setMaster("local[1]")

val sc = new SparkContext(conf)

val data=sc.textFile("hdfs://master:9000/xor")

val result=data.mapPartitions(myfunc).reduceByKey(_^_)

val lastResult=result.collect()

println(lastResult(0))

}

4.运行结果

(1,2)

5.程序说明

此处也可以用map,但map和mapPartitions还是有一定的区别的，网上有如下解释

As a note, a presentation provided by a speaker at the 2013 San Francisco Spark Summit (goo.gl/JZXDCR) highlights that tasks with high per-record overhead perform better with a mapPartition than with a map transformation. This is, according to the presentation, due to the high cost of setting up a new task.

That said, not sure if there is a difference in parallel execution and memory usage between map and mapPartitions. For instance, map could work in parallel implicitly, mapPartitions forces you to iterate. Thus computation could be faster with map but if your execution on a single tuple uses a lot of temporary memory, mapPartitions could avoid GC and memory issues. No idea if this is the way it actually works, but my anecdotal evidence seems to imply this. Would love to have confirmation.

当然两者在使用上有明显区别，map是对rdd所有分区一个一个元素的操作，而mapPartitions是对rdd每个分区进行操作

仰望星空_

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
CountOnce

问题阐述：已知一个数组，数组中只有一个数据是出现一遍的，其他数据都是出现两遍，将出现一次的数据找出。1.实例描述输入为3个文件：1.txt 内容为：1,2,1,3,32.txt :4,5,4,6,53.txt :6,7,8,8,72.设计思路利用异或运算将列表中的所有ID异或，之后得到的值即为所求ID。先将每个分区的数据异或，然后将结果进行异或
复制链接

扫一扫