1. pyspark 版本
2.3.0版本
2. 解释
union() 并集
intersection() 交集
subtract() 差集
cartesian() 笛卡尔
union
官网:
union
(other)[source]
Return the union of this RDD and another one.
中文: 返回此RDD和另一个RDD的并集。
>>> rdd = sc.parallelize([1, 1, 2, 3])
>>> rdd.union(rdd).collect()
[1, 1, 2, 3, 1, 1, 2, 3]
案列1
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local").setAppName("union")
sc = SparkContext(conf=conf)
x = sc.parallelize(['A','A','B'])
y = sc.parallelize(['D','C','A'])
z = x.union(y)
print(z.collect())
>>> ['A', 'A', 'B', 'D', 'C', 'A']
案列2
x1 = sc.parallelize(['A','A','B'])
y1 = sc.parallelize([['1', '2'], ['3', '4'], ['5', '6']])
z1 = x1.union(y1)
print(z1.collect())
>>> ['A', 'A', 'B', ['1', '2'], ['3', '4'], ['5', '6']]
intersection
官网
Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did.
中文: 返回这个RDD和另一个RDD的交集。即使输入RDDs包含任何重复的元素,输出也不会包含任何重复的元素。
>>> rdd1 = sc.parallelize([1, 10, 2, 3, 4, 5])
>>> rdd2 = sc.parallelize([1, 6, 2, 3, 7, 8])
>>> rdd1.intersection(rdd2).collect()
[1, 2, 3]
案列1
x = sc.parallelize(['A','A','B'])
y = sc.parallelize(['D','C','A'])
# 交集
z = x.intersection(y)
print('x 和 y 的交集是: ', z.collect())
>>> x 和 y 的交集是: ['A']
案列2
# list中套list失败,套元组就没问题
# x1 = sc.parallelize([["a", 1], ["b", 4], ["a", 3]])
# y1 = sc.parallelize([["a", 3], ["c", None]])
x1 = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])
y1 = sc.parallelize([("a", 3), ("c", None)])
# 交集
z1 = x1.intersection(y1)
print('x1 和 y1 的交集是: ', z1.collect())
>>> x1 和 y1 的交集是: [('a', 3)]
subtract
官网
subtract
(other, numPartitions=None)[source]
Return each value in self
that is not contained in other
.
中文: 返回自身未包含在其他值中的每个值。
>>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])
>>> y = sc.parallelize([("a", 3), ("c", None)])
>>> sorted(x.subtract(y).collect())
[('a', 1), ('b', 4), ('b', 5)]
案列1
x = sc.parallelize(['A','A','B'])
y = sc.parallelize(['D','C','A'])
z = x.subtract(y)
print('x 和 y 的差集是: ', z.collect())
>>> x 和 y 的差集是: ['B']
案列2
x1 = sc.parallelize([("a", 1), ("b", 4), ("a", 3)])
y1 = sc.parallelize([("a", 3), ("c", None)])
# 差集
z1 = x1.subtract(y1)
print('x1 和 y1 的差集是: ', z1.collect())
>>> x1 和 y1 的差集是: [('a', 1), ('b', 4)]
cartesian
官网
Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a
is in self
and b
is in other
.
中文:返回这个RDD和另一个RDD的笛卡尔积,即所有对元素(a, b)的RDD,其中a在self中,b在other中。
>>> rdd = sc.parallelize([1, 2])
>>> sorted(rdd.cartesian(rdd).collect())
[(1, 1), (1, 2), (2, 1), (2, 2)]
案列1
x = sc.parallelize(['A','A','B'])
y = sc.parallelize(['D','C','A'])
# 笛卡尔
z = x.cartesian(y)
print('x 和 y 的笛卡尔是: ', z.collect())
>>> x 和 y 的笛卡尔是: [('A', 'D'), ('A', 'C'), ('A', 'A'), ('A', 'D'), ('A', 'C'), ('A', 'A'), ('B', 'D'), ('B', 'C'), ('B', 'A')]
案列2
x1 = sc.parallelize([("a", 1), ("b", 4), ("a", 3)])
y1 = sc.parallelize([("a", 3), ("c", None)])
# 笛卡尔
z1 = x1.cartesian(y1)
print('x1 和 y1 的笛卡尔是: ', z1.collect())
>>> x1 和 y1 的笛卡尔是: [(('a', 1), ('a', 3)), (('a', 1), ('c', None)), (('b', 4), ('a', 3)), (('b', 4), ('c', None)), (('a', 3), ('a', 3)), (('a', 3), ('c', None))]