RDD Operation(RDD 操作)
transformations(转变): create a new dataset from an existing one(从现有数据集创建新数据集)
RDDA ---transformation--> RDDB
y = f(x)
rddb = rdda.map(....)
lazy(*****)
rdda.map().filter()......collect
map/filter/group by/distinct/.....
actions(动作):
return a value to the driver program after running a computation on the dataset
(在数据集上运行计算后,向驱动程序返回一个值)
count/reduce/collect......
1) transformation are lazy, nothing actually happens until an action is called;
(转换是懒惰的,在调用某个操作之前实际上什么都不会发生;)
2) action triggers the computation;
(动作触发计算;)
3) action returns values to driver or writes data to external storage;
(操作将值返回给驱动程序或将数据写入外部存储器;)
map:
map(func)
将func函数作用到数据集的每一个元素上,生成一个新的分布式的数据集返回 word => (word,1)
#!/usr/bin/python
-- coding:utf-8 --
from pyspark import SparkConf, SparkContext
if name == ‘main’:
conf = SparkConf().setMaster(“local[2]”).setAppName(“spark0401”)
sc = SparkContext(conf = conf)
def my_map():
data = [1,2,3,4,5]
rdd1 = sc.parallelize(data)
rdd2 = rdd1.map(lambda x:x*2)
print(rdd2.collect())
def my_map2():
a = sc.parallelize(["dog","tiger","lion","cat","panther","eagle"])
b = a.map(lambda x:(x,1))
print(b.collect())
my_map2()
sc.stop()
filter:
filter(func) 过滤
选出所有func返回值为true的元素,生成一个新的分布式的数据集返回
def my_filter():
data = [1,2,3,4,5]
rdd1 = sc.parallelize(data)
mapRdd = rdd1.map(lambda x:x*2)
filterRdd = mapRdd.filter(lambda x:x>5)
#链式编程写法
print(sc.parallelize(data).map(lambda x:x*2).filter(lambda x:x>5).collect())
flatMap
flatMap(func)
输入的item能够被map到0或者多个items输出,返回值是一个Sequence
def my_flatMap():
data = [“hello spark”, “hello world”, “hello world”]
rdd = sc.parallelize(data)
rdd2= rdd.flatMap(lambda line:line.split(" "))
print(rdd2.collect())
groupByKey:把相同的key的数据分发到一起
['hello', 'spark', 'hello', 'world', 'hello', 'world']
('hello',1) ('spark',1)........
def my_groupBy():
data = [“hello spark”, “hello world”, “hello world”]
rdd = sc.parallelize(data)
mapRdd = rdd.flatMap(lambda line:line.split(" ")).map(lambda x:(x,1))
groupByRdd = mapRdd.groupByKey()
#print(groupByRdd.collect())
print(groupByRdd.map(lambda x:{x[0]:list(x[1])}).collect())
reduceByKey: 把相同的key的数据分发到一起并进行相应的计算
mapRdd.reduceByKey(lambda a,b:a+b)
[1,1] 1+1
[1,1,1] 1+1=2+1=3
[1] 1
def my_reduceByKey():
data = [“hello spark”, “hello world”, “hello world”]
rdd = sc.parallelize(data)
mapRdd = rdd.flatMap(lambda line:line.split(" ")).map(lambda x:(x,1))
reduceByKeyRdd = mapRdd.reduceByKey(lambda a,b:a+b)
print(reduceByKeyRdd.collect())
需求: 请按wc结果中出现的次数降序排列 sortByKey
('hello', 3), ('world', 2), ('spark', 1)
def my_sort():
data = [“hello spark”, “hello world”, “hello world”]
rdd = sc.parallelize(data)
mapRdd = rdd.flatMap(lambda line:line.split(" ")).map(lambda x:(x,1))
reduceByRdd = mapRdd.reduceByKey(lambda a,b:a+b)
sortByKeyRdd = reduceByRdd.sortByKey(False) #倒序\正序不加False
print(reduceByRdd.map(lambda x:(x[1],x[0])).sortByKey(False).map(lambda x:(x[1],x[0])).collect())
union:
def my_union():
a = sc.parallelize([1,2,3,4])
b = sc.parallelize([5,6,7,8])
c = a.union(b)
distinct 去重
def my_distinct():
a = sc.parallelize([1,2,3,4])
b = sc.parallelize([5,3,2,8])
c = a.union(b).distinct()
print(c.collect())
join: 内链接
def my_jion():
a = sc.parallelize([(“A”,“a1”), (“C”,“c1”), (“D”,“d1”), (“F”,“f1”), (“F”,“f2”)])
b = sc.parallelize([(“A”,“a2”), (“C”,“c2”), (“C”,“c3”), (“E”,“e1”)])
c = a.join(b) #内链接
d = a.leftOuterJoin(b) #左链接
e = a.rightOuterJoin(b) #右链接
f = a.fullOuterJoin(b) #全链接
action:
def my_action():
a = [1,2,3,4,5,6,7,8]
b = sc.parallelize(a)
print(b.collect())#打印出来
print(b.count())#集合数量
print(b.take(3))#打印前3个
print(b.max())#最大值
print(b.min())#最小值
print(b.sum())#相加之和
print(b.reduce(lambda x,y:x+y))#两两相加的和
b.foreach(lambda x:print(x))#打印出来
词频案例:wc
1) input: 1/n文件 文件夹 后缀名
hello spark
hello hadoop
hello welcome
2) 开发步骤分析
文本内容的每一行转成一个个的单词 : flatMap
单词 ==> (单词, 1): map
把所有相同单词的计数相加得到最终的结果: reduceByKey
TopN
1) input : 1/n文件 文件夹 后缀名
2) 求某个维度的topn
3)开发步骤分析
文本内容的每一行根据需求提取出你所需要的字段: map
单词 ==> (单词, 1): map
把所有相同单词的计数相加得到最终的结果: reduceByKey
取最多出现次数的降序: sortByKey
平均数:统计平均年龄
id age
3 96
4 44
5 67
6 4
7 98
开发步骤分析:
1) 取出年龄 map
2)计算年龄综合 reduce
3)计算记录总数 count
4)求平均数