Spark - 笔记 3

最新推荐文章于 2023-10-11 11:24:47 发布

此心光明-超然

最新推荐文章于 2023-10-11 11:24:47 发布

阅读量435

点赞数

分类专栏： Spark 文章标签： Spark

本文链接：https://blog.csdn.net/weixin_43364172/article/details/95758866

版权

Spark 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

The architecture of a Spark Application

不可变的分布式的对象集合：只包含对象引用，实际对象在集群的节点上。
弹性、容错。
Transformations：operations都是增加新的RDD，original增加后不再修改。

默认地，RRD使用hash算法做分区。
分区数依赖节点数和数据大小。

Spark uses different workers to load different partitions

RDD Creation

Parallelizing a collection: splits成分区，跨集群distributes分区
Reading data from an external source
Transformation of an existing RDD
Streaming API

Transformations

从已经存在的RDD，增加新的RDD。比如，splitting、filtering、排序。
可以按顺序执行几个transformations。
coalesce

Actions

Action triggers the entire DAG (Directed Acyclic Graph) of transformations。
To trigger the computation, we run an action.
action指示Spark计算一系列transformations的结果。

两种类型：

Driver：比如collect count等
Distributed：比如saveAsTextfile。

Actions可以：

在控制台查看结果
使用相应的语言，将数据收集成native objects
把数据写到数据源

reduce

flightData2015 = spark\
    .read\
    .option("inferSchema", "true")\
    .option("header", "true")\
    .csv("/data/flight-data/csv/2015-summary.csv")

flightData2015.sort("count").explain()

== Physical Plan ==
*Sort [count#195 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(count#195 ASC NULLS FIRST, 200)
    +- *FileScan csv [DEST_COUNTRY_NAME#193,ORIGIN_COUNTRY_NAME#194,count#195] ...

Reading, sorting, and collecting a DataFrame

Shuffling

为repartitioning，而移动数据叫shuffling。
how a Spark Job is split into stages

如果

spark.conf.set("spark.sql.shuffle.partitions", "5")
flightData2015.sort("count").take(2)

The process of logical and physical DataFrame manipulation

shuffling越多，影响性能的stages就越多。

Narrow Dependencies

简单的一对一transformation，比如filter、map、flatMap等，子RDD是一对一依赖于父RDD的。
数据在相同节点（父所在的节点）转换。不会跨executors传输。

Narrow dependencies are in the same stage of the job execution.

Wide Dependencies

repartition或者redistribute数据，比如aggregateByKey、reduceByKey。

Wide dependencies introduce new stages in the job execution.

Broadcast variables

Broadcast variables are shared variables across all executors.

driver只能广播它拥有的数据，而不能广播使用引用的RDDs。
how broadcast works

Accumulators

Accumulators是跨executors的共享变量。

Lazy Evaluation

直到最后，才执行graph of computation instructions。

register any DataFrame as a table or view (a temporary table)
使用pure SQL查询
没有性能差异 between writing SQL queries or writing DataFrame code
都编译成相同的底层计划

比如：

sqlWay = spark.sql("""
    SELECT DEST_COUNTRY_NAME, count(1)
    FROM flight_data_2015
    GROUP BY DEST_COUNTRY_NAME
    """)
sqlWay.explain()

和

dataFrameWay = flightData2015\
    .groupBy("DEST_COUNTRY_NAME")\
    .count()
    
dataFrameWay.explain()

的执行计划都是

*HashAggregate(keys=[DEST_COUNTRY_NAME#182], functions=[count(1)])
+- Exchange hashpartitioning(DEST_COUNTRY_NAME#182, 5)
    +- *HashAggregate(keys=[DEST_COUNTRY_NAME#182], functions=[partial_count(1)])
        +- *FileScan csv [DEST_COUNTRY_NAME#182] ...

对于

maxSql = spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count) as destination_total
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
ORDER BY sum(count) DESC
LIMIT 5
""")

maxSql.show()

如果使用DataFrame

flightData2015
    .groupBy("DEST_COUNTRY_NAME")
    .sum("count")
    .withColumnRenamed("sum(count)", "destination_total")
    .sort(desc("destination_total"))
    .limit(5)
    .show()

程序的逻辑是
The entire DataFrame transformation flow

执行计划是

TakeOrderedAndProject(limit=5, orderBy=[destination_total#16194L DESC], outpu...
+- *HashAggregate(keys=[DEST_COUNTRY_NAME#7323], functions=[sum(count#7325L)])
    +- Exchange hashpartitioning(DEST_COUNTRY_NAME#7323, 5)
        +- *HashAggregate(keys=[DEST_COUNTRY_NAME#7323], functions=[partial_sum...
            +- InMemoryTableScan [DEST_COUNTRY_NAME#7323, count#7325L]
                +- InMemoryRelation [DEST_COUNTRY_NAME#7323, ORIGIN_COUNTRY_NA...
                    +- *Scan csv [DEST_COUNTRY_NAME#7578,ORIGIN_COUNTRY_NAME...

DataFrames Versus Datasets

Datasets的类型，在编译时确定。
而DataFrames的类型，在运行时确定。

Datasets只在基于JVM的语言中有效。
大多数情况下，你可能更喜欢使用DataFrames。它相当于Row类型的Datasets。
“Row”类型是专门为内存计算而优化的内部表达（Catalyst格式），和JVM类型相比，GC成本也低。

partitioning scheme定义存储位置（物理分布）。

Columns

简单类型，比如integer或者string
complex type，比如数组或者map
可以是null

Schemas定义列的名字和类型。

Structured API Execution

写DataFrame/Dataset/SQL代码
如果代码有效，转换成Logical Plan
转换成Physical Plan，优化
在集群内执行Physical Plan

Logical Planning

logical planning process

logical plan只包含transformations，没有executors和drivers。
结果传给Catalyst Optimizer。

Physical Planning

logical plan该如何执行。
会根据cost model比较不同策略。返回一系列RDDs and transformations。
The physical planning process

Execution

runs all of this code over RDDs。
会在运行时进一步优化。生成native Java bytecode。

StructType:

name
type
是否可空
metadata（可选的）

Columns and Expressions

Columns提供表达式功能的子集。
逻辑树，比如(((col(“someCol”) + 5) * 200) - 6) < col(“otherCol”)

DataFrame Transformations

Different kinds of transformations

How Spark Performs Joins

要先理解两个概念：

node-to-node communication strategy
per node computation strategy

有两种不同的通信方式，shuffle join（all-to-all communication）或者broadcast join。
随着基于成本的优化器和通信策略的改进，这些内部优化会随着时间推移而变化。

Big table–to–big table

使用shuffle join
Joining two big tables

每个节点（worker nodes）告诉其他所有节点，而且他们跨节点共享数据。
如果数据没分区好，通信会很昂贵。

Big table–to–small table

当表足够小，可以加载到单个worker node的内存中。会把每个小的DataFrame复制到每个worker node。
A broadcast join

SQL接口支持连接的hints（注释语法）。不过，这不是强制的，优化器可能选择忽略他们。
可选项包括MAPJOIN、BROADCAST和BROADCASTJOIN。

SELECT /*+ MAPJOIN(graduateProgram) */ * FROM person JOIN graduateProgram
    ON person.graduate_program = graduateProgram.id

如果broadcast太大的数据，driver node可能会crash。

Little table–to–little table

最好让Spark决定怎样连接他们。

Broadcast Variables

跨集群共享，不可变，没有封装在闭包中。
一般来说，在闭包中，使用引用访问对象。但是，对于大对象，worker nodes可能要反序列化对象很多次。
如果在多个actions和jobs中访问相同的对象，每个job都会重新发送给workers。

而Broadcast variables会在每个machine共享，而不用每次发送。

Broadcast Variables在我们触发action的时候，才会被发送。通过value方法访问值。

此心光明-超然

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录