关于Spark关注的一些问题

最新推荐文章于 2023-02-21 18:02:44 发布

Rilakkuma

最新推荐文章于 2023-02-21 18:02:44 发布

阅读量340

点赞数

分类专栏： spark 文章标签： Spark

本文链接：https://blog.csdn.net/CRISPY_RICE/article/details/79354091

版权

spark 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

Issues

spark.storage.replication.proactive

针对RDD，开启block proactive(主动)复制机制：当Cached RDD在executor处理上失败时，如果有可用复制集则可以恢复过来，恢复至replica factor的数量；

spark.storage.replication.topologyMapper

spark.storage.exceptionOnPinLeak

参考：
1. https://github.com/apache/spark/pull/17519/files

PipedRDD

参考：
* SPARK-13793
* SPARK-14110
* SPARK-14542
* SPARK-15826

SparkStreaming

SparkStreaming动态分配

如何正确地开启SparkStreaming的动态分配？

在邮件组中这个问题被讨论过，想要开启针对SparkStreaming的动态分配策略需要如下配置，其中必须设置spark.dynamicAllocation.enabled=false和spark.executor.instances=0，然后再开启spark.streaming.dynamicAllocation.enabled=true:

spark-submit run-example \
  --conf spark.streaming.dynamicAllocation.enabled=true \
  --conf spark.executor.instances=0 \
  --conf spark.dynamicAllocation.enabled=false \
  --conf spark.master=yarn \
  --conf spark.submit.deployMode=client \
  org.apache.spark.examples.streaming.HdfsWordCount /foo

SparkStreaming动态分配跟原动态分配策略的区别

动态分配现在有两种策略：
* spark.dynamicAllocation.enabled=false所对应的org.apache.spark.ExecutorAllocationManager；
* spark.streaming.dynamicAllocation.enabled=true所对应的org.apache.spark.streaming.scheduler.ExecutorAllocationManager；

策略一主要针对的是普通ETL任务或者SparkSQL任务的动态分配策略；策略二主要是针对SparkStraming任务类型对原策略做了改进；

策略一

策略一是根据SparkListener实现的，原理是通过onStageSubmitted()/onStageCompleted()/onTaskStart()/onTaskEnd()方法来关联executor -> 是否空闲的关系；
如果在该executor上没有任何task运行，则加入removeExecutors列表，并观察spark.dynamicAllocation.executorIdleTimeout(默认60s)内是否有使用（当然对于有cached的rdd是根据spark.dynamicAllocation.cachedExecutorIdleTimeout这个参数），如果空闲这么长时间则调用removeExecutor()接口，移除该executor；

策略二：SPARK-12133

在SPARK-12133中提出了一中动态分配策略，可以了解到策略一存在如下弊端：
* 针对SparkStreaming任务，executor用户不会处于idle的状态因为每个batch每个一段时间都会运行；
* 策略一方法没有考虑batch队列的状况；
* 针对运行reciever的executors应该做不一样的处理；

所以在该PR种提出了一种针对SparkStreaming的动态分配算法，分三部分进行介绍：
1. 基础功能；
2. Recieiver Executor独立调度；
3. Receiver的变化数量；

进行调度的基础是 $R = BatchProcessTime/BatchDuration$ ，具体策略如下：
* 如果 $R_{avg} > 0.9$ ，增加executors, 每次增加 $minx(1, round(R_{avg}))$ 个executor;
* 如果 $R_{avg} < 0.5$ ，减少executors，每次减少一个executor

具体实现参考：https://github.com/apache/spark/pull/12154/files

SparkLauncher

DataSource V2设计

首先，DataSource API的逻辑设计主要是面向开发者，什么样的开发者呢？如果Spark DataFrame/DataSet/SQL想要连接、处理mysql/csv/cassandra/mongodb/redis等，就需要封装实现DataSource的一些API：
* RelationProvider: 创建一个该类型的Relation关系，可以在Analyzer阶段解析出；
* SchemaRelationProvider: 创建一个带有schema的Relation；
* TableScan：全列扫描；
* PrunedScan：裁剪扫描；
* PrunedFilteredScan：裁剪过滤扫描；
* CatalystScan：扫描表达式；

参考：
* DataSource V1介绍：http://www.spark.tc/exploring-the-apache-spark-datasource-api/
* Cassandra 连接器的设计实现：https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md
* DataSource V2: https://github.com/cloud-fan/spark/pull/10
* DataSource V2 设计： https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit#heading=h.mi1fbff5f8f9

Checkpoint

Checkpoint与persist()/cache()的区别

参考：
* https://stackoverflow.com/questions/35127720/what-is-the-difference-between-spark-checkpoint-and-persist-to-a-disk

Spark 一些Issue

[SPARK-18085]HistoryServer 持久化

https://issues.apache.org/jira/browse/SPARK-18085

Consolidate Shuffle File

https://issues.apache.org/jira/browse/SPARK-751

Sort-Based Shuffle

https://issues.apache.org/jira/browse/SPARK-2045

[Spark-4550]

URL: https://issues.apache.org/jira/browse/SPARK-4550

[Spark-16026]
Spark CBO: https://issues.apache.org/jira/browse/SPARK-16026

Resue Exchange

https://issues.apache.org/jira/browse/SPARK-13523

Struct Streaming

https://spark.apache.org/docs/2.1.1/structured-streaming-kafka-integration.html

SparkStreaming介绍及Kafka Receiver-base/Direct Approach实现

Rilakkuma

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
关于Spark关注的一些问题

Issuesspark.storage.replication.proactive 针对RDD，开启block proactive(主动)复制机制：当Cached RDD在executor处理上失败时，如果有可用复制集则可以恢复过来，恢复至replica factor的数量；spark.storage.replication.topologyMapperspar...
复制链接

扫一扫