关于spark 使用自定义类进行 RDD 间的i交并补操作时，不能进行比较的问题分析

最新推荐文章于 2023-03-11 15:33:28 发布

labracy

最新推荐文章于 2023-03-11 15:33:28 发布

阅读量694

点赞数

分类专栏： spark 文章标签： spark intersection scala 源码自定义类

本文链接：https://blog.csdn.net/u013516079/article/details/80223567

版权

本文分析了Spark中使用自定义类进行RDD交集操作时遇到的问题，指出在进行RDD的intersection操作时，自定义类需重写hashCode和equals方法。通过CoGroupedRDD的实现原理，解释了比较两个对象的关键在于hashCode相等和equals方法的正确实现。通过测试案例展示了问题的解决方法。

摘要由CSDN通过智能技术生成

1. 首先看下取交集函数intersection的实现：
def intersection(other: RDD[T]): RDD[T] = withScope {
this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
.filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
.keys
}
两个RDD之间的交集操作，其实是使用map函数映射成为键值对的形式后，使用cogroup函数进行key键的合并，最终使用filter将两个rdd中都存在的key过滤出来。

2. 来看下cogroup 的实现：
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
: RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {
if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
cg.mapValues { case Array(vs, w1s) =>
(vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
}
}
其主要功能使用的是CoGroupedRDD类，然后使用mapValue函数将双值数组对应的键值对保留下来，并且将双值数组的对象进行强制类型转换。

3. 看下CoGroupedRDD 类的实现：

@DeveloperApi
class CoGroupedRDD[K: ClassTag](
@transient var rdds: Seq[RDD[_ <: Product2[K, _]]],
part: Partitioner)
extends RDD[(K, Array[Iterable[_]])](rdds.head.context, Nil) {

// For example, `(k, a) cogroup (k, b)` produces k -> Array(ArrayBuffer as, ArrayBuffer bs).
// Each ArrayBuffer is represented as a CoGroup, and the resulting Array as a CoGroupCombiner.
// CoGroupValue is the intermediate state of each value before being merged in compute.
private type CoGroup = CompactBuffer[Any]
private type CoGroupValue = (Any, Int) // Int is dependency number
private type CoGroupCombiner = Array[CoGroup]

override def compute(s: Partition, context: TaskContext): Iterator[(K, Array[Iterable[_]])] = {
val split = s.asInstanceOf[CoGroupPartition]
val numRdds = dependencies.length

// A list of (rdd iterator, dependency number) pairs
val rddIterators = new ArrayBuffer[(Itera