RDD------弹性分布式数据集,分布式计算的实现载体(数据抽象)
RDD五大特点分别是:
1.存在分区
2.每个分区都有计算API
3.前后RDD相互依赖
4.键值存储型RDD按hash分类
5.计算分区按存储分区就近原则
程序执行入口 SparkContext对象
Spark RDD编程的程序入口对象是SparkContext对象(不论何种编程语言)只有构建出sparkcontext,基于它才能执行后续的API调用和计算
本质上, SparkContext对编程来说,主要功能就是创建第一个RDD出来
from pyspark import SparkConf,SparkContext
if __name__ == '__main__':
#conf = SparkConf().setMaster("local[*]").setAppName("wordCountHelloWorld")
conf = SparkConf().setAppName("wordCountHelloWorld")
#通过SparkConf对象构建SparkContext对象
sc = SparkContext(conf=conf)
RDD创建两种方式:
1.并行化创建将本地文件转化为分布式RDD 2
# coding:utf8
# 导入Spark的相关包
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
# 0. 初始化执行环境 构建SparkContext对象
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
# 演示通过并行化集合的方式去创建RDD, 本地集合 -> 分布式对象(RDD)
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9])
# parallelize方法, 没有给定 分区数, 默认分区数是多少? 根据CPU核心来定
print("默认分区数: ", rdd.getNumPartitions())
rdd = sc.parallelize([1, 2, 3], 3)
print("分区数: ", rdd.getNumPartitions())
# collect方法, 是将RDD(分布式对象)中每个分区的数据, 都发送到Driver中, 形成一个Python List对象
# collect: 分布式 转 -> 本地集合
print("rdd的内容是: ", rdd.collect())
2.直接读取本地文件或hdfs数据
#coding:utf8
from pyspark import SparkConf,SparkContext
if __name__ == '__main__':
conf=SparkConf().setAppName("test").setMaster("local[*]")
sc=SparkContext(conf=conf)
RDD1=sc.textFile("../data/input/words.txt",5)
print("默认分区数:",RDD1.getNumPartitions())
print("RDD1内容:",RDD1.collect())
RDD2=sc.textFile("hdfs://192.168.88.161:8020/test/input/wordcount.txt")
print("默认分区数:",RDD2.getNumPartitions())
print("RDD1内容:",RDD2.collect())
常见算子:
以下为Transformation算子,懒加载,返回结果为RDD类型,相当于流水线,不直接对结果进行更改,遇到Action算子时才更改结果
map:
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd = sc.parallelize([1, 2, 3, 4, 5, 6], 3)
# 定义方法, 作为算子的传入函数体
def add(data):
return data * 10
print(rdd.map(add).collect())
# 更简单的方式 是定义lambda表达式来写匿名函数
print(rdd.map(lambda data: data * 10).collect())
"""
对于算子的接受函数来说, 两种方法都可以
lambda表达式 适用于 一行代码就搞定的函数体, 如果是多行, 需要定义独立的方法.
"""
flatmap:
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd = sc.parallelize(["hadoop spark hadoop", "spark hadoop hadoop", "hadoop flink spark"])
rdd2 = rdd.map(lambda line: line.split(" "))
print(rdd2.collect())
print("--------------------------------------------------------------------------------------------------")
rdd = sc.parallelize(["hadoop spark hadoop", "spark hadoop hadoop", "hadoop flink spark"])
# 得到所有的单词, 组成RDD, flatMap的传入参数 和map一致, 就是给map逻辑用的, 解除嵌套无需逻辑(传参)
rdd2 = rdd.flatMap(lambda line: line.split(" "))
print(rdd2.collect())
reduceByKey:
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd = sc.parallelize([('bob', 1), ('alice', 1), ('bob', 3), ('lily', 5), ('alice', 2)])
# reduceByKey 对相同key 的数据执行聚合相加
print(rdd.reduceByKey(lambda a, b: a * b).collect())
mapValues:
#coding:utf-8
from pyspark import SparkConf,SparkContext
if __name__ == '__main__':
conf=SparkConf().setAppName("test").setMaster("local[*]")
sc=SparkContext(conf=conf)
#只对value做处理
rdd=sc.parallelize([('bob', 1), ('alice', 1), ('bob', 3), ('lily', 5), ('alice', 2)])
print(rdd.mapValues(lambda data: data*10).collect())
(map,flatmap,reduceBykey,mapValues)综合运用:
from pyspark import *
if __name__ == '__main__':
conf=SparkConf().setAppName("words_count").setMaster("local[*]")
sc=SparkContext(conf=conf)
file_rdd=sc.textFile("../data/input/words.txt")
word_rdd=file_rdd.flatMap(lambda data:data.split(" "))
word_with_one_rdd=word_rdd.map(lambda word:(word,1))
result_rdd=word_with_one_rdd.reduceByKey(lambda x1,x2:x1+x2)
print(result_rdd.collect())
groupBy:
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd = sc.parallelize([('alice', 1), ('alice', 1), ('bob', 1), ('bob', 2), ('bob', 3)])
# 通过groupBy对数据进行分组
# groupBy传入的函数的 意思是: 通过这个函数, 确定按照谁来分组(返回谁即可)
# 分组规则 和SQL是一致的, 也就是相同的在一个组(Hash分组)
result = rdd.groupBy(lambda t: t[0])
print(result.collect())
print(result.map(lambda t:(t[0], list(t[1]))).collect())
Filter:
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9])
# 通过Filter算子, 选出3的倍数
result = rdd.filter(lambda x: x % 3 == 0)
print(result.collect())
distinct:
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd = sc.parallelize([1, 1, 1, 2, 2, 2, 3, 3, 3])
# distinct 进行RDD数据去重操作
print(rdd.distinct().collect())
rdd2 = sc.parallelize([('a', 1), ('a', 1), ('a', 3)])
print(rdd2.distinct().collect())
Union:
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd1 = sc.parallelize([1, 1, 3, 3])
rdd2 = sc.parallelize(["a", "b", "a"])
rdd3 = sc.parallelize([('a', 1), ('a', 1), ('a', 3)])
rdd4 = rdd1.union(rdd2).union(rdd3)
print(rdd4.collect())
"""
1. 可以看到 union算子是不会去重的
2. RDD的类型不同也是可以合并的.
"""
join:
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd1 = sc.parallelize([ (1001, ""), (1002, "lisi"), (1003, "wangwu"), (1004, "zhaoliu") ])
rdd2 = sc.parallelize([ (1001, "销售部"), (1002, "科技部")])
# 通过join算子来进行rdd之间的关联
# 对于join算子来说 关联条件 按照二元元组的key来进行关联
print(rdd1.join(rdd2).collect())
# 左外连接, 右外连接 可以更换一下rdd的顺序 或者调用rightOuterJoin即可
print(rdd1.leftOuterJoin(rdd2).collect())
intersection:
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd1 = sc.parallelize([('a', 1), ('a', 3)])
rdd2 = sc.parallelize([('a', 1), ('b', 3)])
# 通过intersection算子求RDD之间的交集, 将交集取出 返回新RDD
rdd3 = rdd1.intersection(rdd2)
print(rdd3.collect())
glom:
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9], 4)
print(rdd.glom().collect())
print(rdd.glom().flatMap(lambda x: x).collect())
GroupByKey:
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd = sc.parallelize([('a', 1), ('a', 1), ('b', 1), ('b', 1), ('b', 1)])
rdd2 = rdd.groupByKey()
print(rdd2.collect())
print(rdd2.map(lambda x: (x[0], list(x[1]))).collect())
sortBy:
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd = sc.parallelize([('c', 3), ('f', 1), ('b', 11), ('c', 3), ('a', 1), ('c', 5), ('e', 1), ('n', 9), ('a', 1)], 3)
# 使用sortBy对rdd执行排序
# 按照value 数字进行排序
# 参数1函数, 表示的是 , 告知Spark 按照数据的哪个列进行排序
# 参数2: True表示升序 False表示降序
# 参数3: 排序的分区数
"""注意: 如果要全局有序, 排序分区数请设置为1"""
print(rdd.sortBy(lambda x: x[1], ascending=True, numPartitions=1).collect())
# 按照key来进行排序
print(rdd.sortBy(lambda x: x[0], ascending=False, numPartitions=1).collect())
sortByKey:
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd = sc.parallelize([('a', 1), ('E', 1), ('C', 1), ('D', 1), ('b', 1), ('g', 1), ('f', 1),
('y', 1), ('u', 1), ('i', 1), ('o', 1), ('p', 1),
('m', 1), ('n', 1), ('j', 1), ('k', 1), ('l', 1)], 3)
print(rdd.sortByKey(ascending=True, numPartitions=1, keyfunc=lambda key: str(key).lower()).collect())
以下为Action算子-------返回对象不是RDD
countByKey:
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd = sc.textFile("../data/input/words.txt")
rdd2 = rdd.flatMap(lambda x: x.split(" ")).map(lambda x: (x, 1))
# 通过countByKey来对key进行计数, 这是一个Action算子
result = rdd2.countByKey()
print(result)
print(type(result))
collect:
功能:将RDD各个分区内的数据,统一收集到Driver中,形成一个List对象
reduce:
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd = sc.parallelize([1, 2, 3, 4, 5])
print(rdd.reduce(lambda a, b: a + b))
fold:
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9], 3)
print(rdd.glom().collect())
print(rdd.fold(10, lambda a, b: a + b))
first,take,top,count:
takeSample:
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd = sc.parallelize([1, 3, 5, 3, 1, 3, 2, 6, 7, 8, 6], 1)
#给第三个参数时每次取的数值都是一样的
print(rdd.takeSample(True, 22, 1))
print(rdd.takeSample(False, 5))
takeOrdered:
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd = sc.parallelize([1, 3, 2, 4, 7, 9, 6], 1)
print(rdd.takeOrdered(3))
print(rdd.takeOrdered(3, lambda x: -x))
foreach:每个分区都分别运行函数,不汇总到Driver
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd = sc.parallelize([1, 3, 2, 4, 7, 9, 6], 3)
rdd2=rdd.glom().collect()
print(rdd2)
#直接从分区输出
result = rdd.foreach(lambda x: print(x * 10))
saveAsTestFile:每个分区写一份文件
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd = sc.parallelize([1, 3, 2, 4, 7, 9, 6], 3)
rdd.saveAsTextFile("../data/output/saveAsTestFile")
分区操作算子--------在各个分区内部处理完成后一次性传输,普通算子是一条条传输的,可以减少网络流量
mapPartitions
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd = sc.parallelize([1, 3, 2, 4, 7, 9, 6], 3)
def process(iter):
result = list()
for it in iter:
result.append(it * 10)
return result
print(rdd.mapPartitions(process).collect())
foreachPartition:
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd = sc.parallelize([1, 3, 2, 4, 7, 9, 6], 3)
def process(iter):
result = list()
for it in iter:
result.append(it * 10)
print(result)
print("----------")
rdd.foreachPartition(process)
partitionBy:
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd = sc.parallelize([('hadoop', 1), ('spark', 1), ('hello', 1), ('flink', 1), ('hadoop', 1), ('spark', 1)])
# 使用partitionBy 自定义 分区
def process(k):
if 'hadoop' == k or 'hello' == k: return 0
if 'spark' == k: return 1
return 2
print(rdd.partitionBy(3, process).glom().collect())
repartition和coalesce:
# coding:utf8
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
rdd = sc.parallelize([1, 2, 3, 4, 5], 3)
# repartition 修改分区
print(rdd.repartition(1).getNumPartitions())
print(rdd.repartition(5).getNumPartitions())
# coalesce 修改分区
print(rdd.coalesce(1).getNumPartitions())
print(rdd.coalesce(5, shuffle=True).getNumPartitions())
1.RDD创建有哪几种方法?
通过并行化集合的方式(本地集合转分布式集合)
或者读取数据的方式创建(TextFile\wholeTextFile)
2.RDD分区数如何查看?
通过getNumPartitions API查看,返回值Int
3.Transformation和Action的区别?
转换算子的返回值100%是RDD,而Action算子的返回值100%不是RDD.
转换算子是懒加载的,只有遇到Action才会执行.Action就是转换算子处理链条的开关.
4.哪两个Action算子的结果不经过Driver,直接输出?
foreach和saveAsTextFile直接由Executor执行后输出,不会将结果发送到Driver上去。
5.reduceByKey 和groupByKey的区别?
reduceByKey自带聚合逻辑, groupByKey不带
如果做数据聚合reduceByKey的效率更好,因为可以先聚合后shuffle再最终聚合,传输的IO小。
6.mapPartitions和foreachPartition的区别?
mapPartitions带有返回值foreachPartition不带。
7.对于分区操作有什么要注意的地方?
尽量不要增加分区,可能破坏内存迭代的计算管道。