spark 双value 类型交互
union(otherDataset) 案例
1. 作用:对源RDD和参数RDD求并集后返回一个新的RDD
2. 需求:创建两个RDD,求并集
union() 源码:
/**
* Return the union of this RDD and another one. Any identical elements will appear multiple
* times (use `.distinct()` to eliminate them).
*/
def union(other: RDD[T]): RDD[T] = withScope {
sc.union(this, other)
}
demo:
val rdd1 = sc.parallelize(1 to 5)
val rdd2 = sc.parallelize(4 to 8)
val rdd3 =rdd1.union(rdd2)
result:
rdd3.collect=
Array[Int] = Array(1, 2, 3, 4, 5, 4, 5, 6, 7, 8)
subtract (otherDataset) 案例
1. 作用:计算差的一种函数,去除两个RDD中相同的元素,不同的RDD将保留下来
2. 需求:创建两个RDD,求第一个RDD与第二个RDD的差集
源码:
/**
* Return an RDD with the elements from `this` that are not in `other`.
*
* Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
* RDD will be <= us.
*/
def subtract(other: RDD[T]): RDD[T] = withScope {
subtract(other, partitioner.getOrElse(new HashPartitioner(partitions.length)))
}
demo:
val rdd1 = sc.parallelize(1 to 5)
val rdd2 = sc.parallelize(4 to 6)
val rdd3 = rdd1.subtract(rdd2)
result:
Array[Int] = Array(1, 2, 3)
intersection(otherDataset) 案例
1. 作用:对源RDD和参数RDD求交集后返回一个新的RDD
2. 需求:创建两个RDD,求两个RDD的交集
源码:
/**
* Return the intersection of this RDD and another one. The output will not contain any duplicate
* elements, even if the input RDDs did.
*
* @note This method performs a shuffle internally.
*/
def intersection(other: RDD[T]): RDD[T] = withScope {
this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
.filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
.keys
}
val rdd = sc.parallelize(1 to 6)
val rdd2 = sc.parallelize(4 to 7)
val rdd3 = rdd.intersection(rdd2)
result:
Array[Int] = Array(4, 5, 6)
cartesian(otherDataset) 案例
1. 作用:笛卡尔积(尽量避免使用)
2. 需求:创建两个RDD,计算两个RDD的笛卡尔积
val rdd =sc.parallelize( 1 to 5)
val rdd2=sc.parallelize(6 to 10)
val rdd3 = rdd.cartesian(rdd2)
result:
Array[(Int, Int)] = Array((1,6), (1,7), (1,8), (1,9), (1,10), (2,6), (2,7), (2,8), (2,9), (2,10), (3,6), (3,7), (3,8), (3,9), (3,10),
(4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10))
源码:
/**
* Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of
* elements (a, b) where a is in `this` and b is in `other`.
*/
def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
new CartesianRDD(sc, this, other)
}
zip(otherDataset)案例
1. 作用:将两个RDD组合成Key/Value形式的RDD,这里默认两个RDD的partition数量以及元素数量都相同,否则会抛出异常。
2. 需求:创建两个RDD,并将两个RDD组合到一起形成一个(k,v)RDD
源码:
/**
* Zips this RDD with another one, returning key-value pairs with the first element in each RDD,
* second element in each RDD, etc. Assumes that the two RDDs have the *same number of
* partitions* and the *same number of elements in each partition* (e.g. one was made through
* a map on the other).
*/
def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
zipPartitions(other, preservesPartitioning = false) { (thisIter, otherIter) =>
new Iterator[(T, U)] {
def hasNext: Boolean = (thisIter.hasNext, otherIter.hasNext) match {
case (true, true) => true
case (false, false) => false
case _ => throw new SparkException("Can only zip RDDs with " +
"same number of elements in each partition")
}
def next(): (T, U) = (thisIter.next(), otherIter.next())
}
}
}
demo:
val rdd1 = sc.parallelize( Array(1,2,"big"))
val rdd2 = sc.parallelize(Array("one","two","cat"))
val rdd3= rdd1.zip(rdd2)
result:
Array[(Any, String)] = Array((1,one), (2,two), (big,cat))