1. pyspark 版本
2.3.0版本
2. 官网
filter
(f)[source]
Return a new RDD containing only the elements that satisfy a predicate.
中文: 返回仅包含满足条件的元素的新RDD。
>>> rdd = sc.parallelize([1, 2, 3, 4, 5])
>>> rdd.filter(lambda x: x % 2 == 0).collect()
[2, 4]
3. 我的代码
案列1
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local").setAppName("fliter")
sc = SparkContext(conf=conf)
rdd1 = sc.parallelize([1, 2, 3, 4, 5])
new_rdd1 = rdd1.filter(lambda x: x>2)
print('new_rdd1 = ', new_rdd1.collect())
>>> new_rdd1 = [3, 4, 5]
案列2
rdd2 = sc.parallelize([[1, 'a'], [2, 'b'], [3, 'c']])
new_rdd2 = rdd2.filter(lambda x: x[0] > 1)
print('new_rdd2 = ', new_rdd2.collect())
>>> new_rdd2 = [[2, 'b'], [3, 'c']]
案列3: 筛选出健中有a的元素
rdd3 = sc.parallelize([{'a': 1}, {'b': 2}, {'c': 3}])
def myfilter(x):
print('x= ', x, type(x))
if 'a' in x.keys():
return x
new_rdd3 = rdd3.filter(lambda x: myfilter(x))
print('new_rdd3 = ', new_rdd3.collect())
>>> new_rdd3 = [{'a': 1}]
打印在notebook dos窗口显示出来了: