光速入门spark（待续）

Generalzy

已于 2024-05-15 22:51:30 修改

阅读量793

点赞数 20

分类专栏： AI 文章标签： spark 大数据分布式

于 2024-02-27 22:59:26 首次发布

本文链接：https://blog.csdn.net/General_zy/article/details/136221416

版权

AI 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Spark概述

Spark 是什么

在这里插入图片描述

官网的定义是：Apache Spark™是一个用于在单节点机器或集群上执行数据工程、数据科学和机器学习的多语言引擎。

Spark 是一种基于内存的快速、通用、可扩展的大数据分析计算引擎。

Spark 最早源于一篇论文Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing，该论文是由加州大学柏克莱分校的Matei Zaharia 等人发表的。论文中提出了一种弹性分布式数据集（即RDD）的概念：

在这里插入图片描述
翻译过来就是：RDD 是一种分布式内存抽象，其使得程序员能够在大规模集群中做内存运算，并且有一定的容错方式。而这也是整个Spark 的核心数据结构，Spark 整个平台都围绕着RDD进行。

Spark可以计算：结构化、半结构化、非结构化等各种类型的数据结构，同时也支持使用Python、Java、Scala、R以及SQL语言去开发应用程序计算数据。（主流的还是使用scala和python开发）

在这里插入图片描述

Spark VS Hadoop (MapReduce)

Spark和Hadoop技术栈有何区别呢？

Hadoop：

Hadoop是由java语言编写的，在分布式服务器集群上存储海量数据并运行分布式分析应用的开源框架
作为Hadoop 分布式文件系统，HDFS处于Hadoop生态圈的最下层，存储着所有的数据，支持着 Hadoop 的所有服务。它的理论基础源于 Google 的TheGoogleFileSystem 这篇论文，它是GFS的开源实现。
MapReduce 是一种编程模型，Hadoop根据Google的MapReduce 论文将其实现，作为Hadoop 的分布式计算模型，是Hadoop的核心。基于这个框架，分布式并行程序的编写变得异常简单。综合了HDFS的分布式存储和MapReduce的分布式计算，Hadoop在处理海量数据时，性能横向扩展变得非常容易。
HBase是对Google 的Bigtable 的开源实现，但又和Bigtable 存在许多不同之处。HBase 是一个基于HDFS的分布式数据库，擅长实时地随机读/写超大规模数据集。它也是Hadoop非常重要的组件。

Spark：

Spark是一种由Scala语言开发的快速、通用、可扩展的大数据分析引擎
Spark Core 中提供了Spark最基础与最核心的功能
Spark SQL是Spark用来操作结构化数据的组件。通过Spark SQL，用户可以使用SQL 或者Apache Hive 版本的SQL方言（HQL）来查询数据。
Spark Streaming 是 Spark 平台上针对实时数据进行流式计算的组件，提供了丰富的处理数据流的API。

总而言之，Spark出现的时间相对较晚（年轻），并且主要功能主要是用于数据计算。

下面给出一个更直观的对比表格：

在这里插入图片描述
尽管Spark相对于Hadoop而言具有较大优势，但Spark并不能完全替代Hadoop：

在计算层面，Spark相比较MR（MapReduce）有巨大的性能优势，但至今仍有许多计算工具基于MR构架，比如非常成熟的Hive
Spark仅做计算，而Hadoop生态圈不仅有计算（MR）也有存储（HDFS）和资源管理调度（YARN），HDFS和YARN仍是许多大数据体系的核心架构。

Spark or Hadoop

Hadoop 的MR框架和Spark框架都是数据处理框架，那么我们在使用时如何选择呢？

Hadoop MapReduce 由于其设计初衷并不是为了满足循环迭代式数据流处理，因此在多并行运行的数据可复用场景（如：机器学习、图挖掘算法、交互式数据挖掘算法）中存在诸多计算效率等问题。所以Spark应运而生，Spark就是在传统的MapReduce 计算框架的基础上，利用其计算过程的优化，从而大大加快了数据分析、挖掘的运行和读写速度，并将计算单元缩小到更适合并行计算和重复使用的RDD计算模型。
机器学习中 ALS、凸优化梯度下降等。这些都需要基于数据集或者数据集的衍生数据反复查询反复操作。MR这种模式不太合适，即使多MR串行处理，性能和时间也是一个问题。数据的共享依赖于磁盘。另外一种是交互式数据挖掘，MR 显然不擅长。而Spark 所基于的scala语言恰恰擅长函数的处理。
Spark 是一个分布式数据快速分析项目。它的核心技术是弹性分布式数据集（Resilient Distributed Datasets），提供了比 MapReduce 丰富的模型，可以快速在内存中对数据集进行多次迭代，来支持复杂的数据挖掘算法和图形计算算法。
Spark和Hadoop的根本差异是多个作业之间的数据通信问题 : Spark多个作业之间数据通信是基于内存，而Hadoop是基于磁盘。
Spark Task 的启动时间快。Spark采用fork线程的方式，而Hadoop采用创建新的进程的方式。
Spark只有在shuffle的时候将数据写入磁盘，而Hadoop中多个MR作业之间的数据交互都要依赖于磁盘交互
Spark的缓存机制比HDFS的缓存机制高效。

经过上面的比较，可以看出在绝大多数的数据计算场景中，Spark确实会比MapReduce更有优势。但是Spark是基于内存的，所以在实际的生产环境中，由于内存的限制，可能会由于内存资源不够导致Job执行失败，此时，MapReduce其实是一个更好的选择，所以Spark并不能完全替代MR。

Hadoop中的MR中每个map/reduce task都是一个java进程方式运行，好处在于进程之间是互相独立的，每个task独享进程资源，没
有互相干扰，监控方便，但是问题在于task之间不方便共享数据，执行效率比较低。比如多个map task读取不同数据源文件需要将数据源加
载到每个map task中，造成重复加载和浪费内存。而基于线程的方式计算是为了数据共享和提高执行效率，Spark采用了线程的最小的执行
单位，但缺点是线程之间会有资源竞争。

Spark四大特点

速度快

由于Apache Spark支持内存计算，并且通过DAG（有向无环图）执行引擎支持无环数据流，所以官方宣称其在内存中的运算速度要比Hadoop的MapReduce快100倍，在硬盘中要快10倍。

Spark处理数据与MapReduce处理数据相比，有如下两个不同点：

Spark处理数据时，可以将中间处理结果数据存储到内存中；
Spark提供了非常丰富的算子(API), 可以做到复杂任务在一个Spark 程序中完成.

易于使用

Spark 的版本已经更新到Spark 3.2.0（截止日期2021.10.13），支持了包括 Java、Scala、Python 、R和SQL语言在内的多种语言。为了兼容Spark2.x企业级应用场景，Spark仍然持续更新Spark2版本。
在这里插入图片描述

通用性强

在Spark 的基础上，Spark 还提供了包括Spark SQL、Spark Streaming、MLib 及GraphX在内的多个工具库。

在这里插入图片描述

运行方式

Spark 支持多种运行方式，包括在Hadoop 和Mesos 上，也支持Standalone的独立运行模式，同时也可以运行在云Kubernetes（Spark 2.3开始支持）上。

在这里插入图片描述
对于数据源而言，Spark 支持从HDFS、HBase、Cassandra 及 Kafka 等多种途径获取数据。

Spark 框架模块（架构）

在这里插入图片描述

Spark Core：Spark的核心，Spark核心功能均由Spark Core模块提供，是Spark运行的基础。Spark Core以RDD为数据抽象，提供Python、Java、Scala、R语言的API，可以编程进行海量离线数据批处理计算。
SparkSQL：基于SparkCore之上，提供结构化数据的处理模块。SparkSQL支持以SQL语言对数据进行处理，SparkSQL本身针对离线计算场景。同时基于SparkSQL，Spark提供了StructuredStreaming模块，可以以SparkSQL为基础，进行数据的流式计算。
SparkStreaming：以SparkCore为基础，提供数据的流式计算功能。
MLlib：以SparkCore为基础，进行机器学习计算，内置了大量的机器学习库和API算法等。方便用户以分布式计算的模式进行机器学习计算。
GraphX：以SparkCore为基础，进行图计算，提供了大量的图计算API，方便用于以分布式计算模式进行图计算。

Spark的运行模式

Spark提供多种运行模式，包括：

本地模式（单机）：本地模式就是以一个独立的进程，通过其内部的多个线程来模拟整个Spark运行时环境
Standalone模式（集群）：Spark中的各个角色以独立进程的形式存在，并组成Spark集群环境
Hadoop YARN模式（集群）：Spark中的各个角色运行在YARN的容器内部，并组成Spark集群环境
Kubernetes模式（容器集群）：Spark中的各个角色运行在Kubernetes的容器内部，并组成Spark集群环境

在这里插入图片描述

Spark的架构角色

在这里插入图片描述
注：正常情况下Executor是干活的角色，不过在特殊场景下（Local模式）Driver可以即管理又干活

Spark环境搭建

Local

本质：启动一个JVM Process进程(一个进程里面有多个线程)，执行任务Task，Local模式可以限制模拟Spark集群环境的线程数量, 即Local[N] 或Local[*]，其中N代表可以使用N个线程，每个线程拥有一个cpu core。

如果不指定N，则默认是1个线程（该线程有1个core）。通常Cpu有几个Core，就指定几个线程，最大化利用计算能力.如果是local[*]，则代表Run Spark locally with as many worker threads as logical cores on your machine.按照Cpu最多的Cores设置线程数。

在这里插入图片描述

Local 下的角色分布：

资源管理：
- Master：Local进程本身
- Worker：Local进程本身
任务执行：
- Driver：Local进程本身
- Executor：不存在，没有独立的Executor角色, 由Local进程(也就是Driver)内的线程提供计算能力

Driver也算一种特殊的Executor, 只不过多数时候, 我们将Executor当做纯Worker对待, 这样和Driver好区分(一类是管理一类是工人)。

在官网下载安装包，并解压，之后将bin目录添加到环境目录中：
在这里插入图片描述
然后执行spark-shell即可：

默认的spark-shell是基于scala进行交互式开发的，如果需要使用python进行开发，也可以执行pyspark命令打开python交互页面。

在这里插入图片描述
实际开发中都不会使用交互式开发，比如python开发要下载pip install pyspark，然后使用spark-submit提交计算任务，spark-shell也只是spark-submit加了参数后提交的一个任务：

在这里插入图片描述
所以我们学习的就是，1.如何编写任务 2. 如何提交任务。

pyspark/spark-shell/spark-submit 对比：

在这里插入图片描述

Standalone

Standalone模式是Spark自带的一种集群模式，不同于前面本地模式启动多个进程来模拟集群的环境，Standalone模式是真实地在多个机器之间搭建Spark集群的环境，完全可以利用该模式搭建多机器集群，用于实际的大数据处理。

StandAlone 是完整的Spark运行环境,其中：

Master角色以Master进程存在, Worker角色以Worker进程存在
Driver和Executor运行于Worker进程内, 由Worker提供资源供给它们运行

在这里插入图片描述
StandAlone集群在进程上主要有3类进程:

主节点Master进程：Master角色, 管理整个集群资源，并托管运行各个任务的Driver
从节点Workers：Worker角色, 管理每个机器的资源，分配对应的资源来运行Executor(Task)；每个从节点分配资源信息给Worker管理，资源信息包含内存Memory和CPU Cores核数
历史服务器HistoryServer(可选)：Spark Application运行完成以后，保存事件日志数据至HDFS，启动HistoryServer可以查看应用运行相关信息。

集群模式下程序是在集群上运行的，不要直接读取本地文件，应该读取外部数据源上的，因为程序运行在集群上，具体在哪个节点上我们运行并不知道，其他节点可能并没有那个数据文件。

在这里插入图片描述

Spark程序运行层次结构

用户程序从最开始的提交到最终的计算执行，需要经历以下几个阶段：

用户程序创建SparkContext 时，新创建的SparkContext 实例会连接到ClusterManager。 Cluster Manager 会根据用户提交时设置的CPU 和内存等信息为本次提交分配计算资源，启动Executor。
Driver会将用户程序划分为不同的执行阶段Stage，每个执行阶段Stage由一组完全相同Task组成，这些Task分别作用于待处理数据的不同分区。在阶段划分完成和Task创建后，Driver会向Executor发送Task；
Executor在接收到Task后，会下载Task的运行时依赖，在准备好Task的执行环境后，会开始执行Task，并且将Task的运行状态汇报给Driver；
Driver会根据收到的Task的运行状态来处理不同的状态更新。Task分为两种：一种是Shuffle Map Task，它实现数据的重新洗牌，洗牌的结果保存到Executor 所在节点的文件系统中；另外一种是Result Task，它负责生成结果数据；
Driver 会不断地调用Task，将Task发送到Executor执行，在所有的Task 都正确执行或者超过执行次数的限制仍然没有执行成功时停止；

可以发现在一个Spark Application中，包含多个Job，每个Job有多个Stage组成，每个Job执行按照DAG图进行的。

在这里插入图片描述
其中每个Stage中包含多个Task任务，每个Task以线程Thread方式执行，需要1Core CPU。

Spark Application程序运行时三个核心概念：Job、Stage、Task，说明如下：

Job：由多个Task 的并行计算部分，一般Spark 中的action 操作（如 save、collect，后面进一步说明），会
生成一个Job。
Stage：Job 的组成单位，一个Job 会切分成多个Stage，Stage 彼此之间相互依赖顺序执行，而每个Stage 是多个Task 的集合，类似map 和reduce stage。
Task：被分配到各个Executor 的单位工作内容，它是Spark 中的最小执行单位，一般来说有多少个Paritition
（物理层面的概念，即分支可以理解为将数据划分成不同部分并行处理），就会有多少个Task，每个Task 只会处理单一分支上的数据。

在这里插入图片描述

Spark Standalone集群是Master-Slaves架构的集群模式，和大部分的Master-Slaves结构集群一样，存在着Master单点故障（SPOF）的问题。

Spark on YARN

YARN本身是一个资源调度框架, 负责对运行在内部的计算框架进行资源调度管理。作为典型的计算框架, Spark本身也是直接运行在YARN中, 并接受YARN的调度的。

Spark On Yarn的本质：

Master角色由YARN的ResourceManager担任.
Worker角色由YARN的NodeManager担任.
Driver角色运行在YARN容器内或提交任务的客户端进程中
真正干活的Executor运行在YARN提供的容器内

Spark On Yarn需要啥？

需要Yarn集群；
需要Spark客户端工具, 比如spark-submit, 可以将Spark程序提交到YARN中
需要被提交的代码程序:，如spark/examples/src/main/python/pi.py此示例程序,或我们后续自己开发的Spark任务

在这里插入图片描述

部署模式DeployMode

Spark On YARN是有两种运行模式的,一种是Cluster模式一种是Client模式，这两种模式的区别就是Driver运行的位置.

Cluster模式即:Driver运行在YARN容器内部, 和ApplicationMaster在同一个容器内
Client模式即:Driver运行在客户端进程中, 比如Driver运行在spark-submit程序的进程中

如图, 此为Cluster模式，Driver运行在容器内部：

在这里插入图片描述
如图, 此为Client模式，Driver运行在客户端程序进程中(以spark-submit为例)：

在这里插入图片描述
两种模式的区别：

在这里插入图片描述
假设运行圆周率PI程序，采用client模式，命令如下：

SPARK_HOME=/export/server/spark
 ${SPARK_HOME}/bin/spark-submit \--master yarn \--deploy-mode client \--driver-memory 512m \--executor-memory 512m \--num-executors 1 \--total-executor-cores 2 \
 ${SPARK_HOME}/examples/src/main/python/pi.py \
 10

采用cluster模式，命令如下：

SPARK_HOME=/export/server/spark
 ${SPARK_HOME}/bin/spark-submit \--master yarn \--deploy-mode cluster \--driver-memory 512m \--executor-memory 512m \--num-executors 1 \--total-executor-cores 2 \--conf "spark.pyspark.driver.python=/root/anaconda3/bin/python3" \--conf "spark.pyspark.python=/root/anaconda3/bin/python3" \
 ${SPARK_HOME}/examples/src/main/python/pi.py \
 10

总之，Client模式和Cluster模式最最本质的区别是：Driver程序运行在哪里。

Client模式：学习测试时使用，生产不推荐(要用也可以,性能略低,稳定性略低)
1. Driver运行在Client上,和集群的通信成本高
2. Driver输出结果会在客户端显示
Cluster模式：生产环境中使用该模式
1. Driver程序在YARN集群中，和集群的通信成本低
2. Driver输出结果不能在客户端显示
3. 该模式下Driver运行ApplicattionMaster这个节点上,由Yarn管理，如果出现问题，yarn会重启
  ApplicattionMaster(Driver)

扩展阅读：两种模式详细流程

在YARN Client模式下，Driver在任务提交的本地机器上运行，示意图如下：
在这里插入图片描述
具体流程步骤如下：

Driver在任务提交的本地机器上运行，Driver启动后会和ResourceManager通讯申请启动ApplicationMaster；
随后ResourceManager分配Container，在合适的NodeManager上启动ApplicationMaster，此时的
ApplicationMaster的功能相当于一个ExecutorLaucher，只负责向ResourceManager申请Executor内存；
ResourceManager接到ApplicationMaster的资源申请后会分配Container，然后ApplicationMaster在资源分
配指定的NodeManager上启动Executor进程；
Executor进程启动后会向Driver反向注册，Executor全部注册完成后Driver开始执行main函数；
之后执行到Action算子时，触发一个Job，并根据宽依赖开始划分Stage，每个Stage生成对应的TaskSet，之后将Task分发到各个Executor上执行。

在YARN Cluster模式下，Driver运行在NodeManager Contanier中，此时Driver与AppMaster合为一体，示意图如:

在这里插入图片描述
具体流程步骤如下：

任务提交后会和ResourceManager通讯申请启动ApplicationMaster;
随后ResourceManager分配Container，在合适的NodeManager上启动ApplicationMaster，此时的
ApplicationMaster就是Driver；
Driver启动后向ResourceManager申请Executor内存，ResourceManager接到ApplicationMaster的资源申请
后会分配Container,然后在合适的NodeManager上启动Executor进程;
Executor进程启动后会向Driver反向注册;
Executor全部注册完成后Driver开始执行main函数，之后执行到Action算子时，触发一个job，并根据宽依赖开始划分stage，每个stage生成对应的taskSet，之后将task分发到各个Executor上执行

扩展阅读：YARN

Yarn 是 hadoop 集群的资源管理层。它允许不同的数据处理引擎（如图形处理、交互式 SQL、流处理、批处理）运行在 hadoop 集群中并处理 HDFS 中的数据（移动计算而非数据），除了资源管理外，Yarn 还用于作业调用。

详细参考：Apache Hadoop YARN

在这里插入图片描述
Yarn 采用传统的 master-slave 架构模式，其主要由 4 种组件组成，它们的主要功能如下：

ResourceManager（RM）：全局资源管理器，负责整个系统的资源管理和分配；
- 处理客户端请求
- 启动/监控ApplicationMaster
- 监控NodeManager
- 资源分配与调度
ApplicationMaster（AM）：负责应用程序（Application）的管理；
- 为应用程序申请资源，并分配给内部任务
- 任务调度、监控与容错
NodeManager（NM）：负责 slave 节点的资源管理和使用；
- 单个节点上的资源管理
- 处理来自ResourceManger的命令
- 处理来自ApplicationMaster的命令
Container（容器）：对任务运行环境的一个抽象。

外部博客参考链接：

https://www.cnblogs.com/gentlescholar/p/15048301.html
http://kentt.top/2018/09/16/Yarn-Architecture/

spark on k8s

Spark在Kubernetes上是一个令人兴奋的组合，它使您可以利用Apache Spark进行分布式数据处理，并使用Kubernetes进行容器编排。通过在Kubernetes上运行Spark应用程序，您可以利用Kubernetes平台提供的可扩展性、弹性和灵活性。Kubernetes为运行和管理应用程序提供了一个容器化基础设施，而Spark则实现了大规模数据处理和分析。这种组合使您能够轻松扩展Spark应用程序，高效利用资源，并将其与其他Kubernetes原生工具和服务无缝集成。在本文中，我们将带您逐步了解在Kubernetes集群上设置和运行Spark应用程序的过程。我们将涵盖集群配置、应用程序部署、监控、调试和扩展等主题。

首先假设已经有了一个完备的k8s集群，并且配置了对应ServiceAccount和PVC…

官方文档：https://spark.apache.org/docs/latest/running-on-kubernetes.html

在这里插入图片描述
spark-submit可以直接用于向Kubernetes集群提交Spark应用程序。提交机制的工作原理如下：

Spark 创建一个在Kubernetes pod中运行的 Spark driver。
driver创建也在 Kubernetes Pod 中运行的 executors 并连接到它们，然后执行应用程序代码。
当应用程序完成时，executors Pod 终止并被清理，但driver Pod 会保留日志并在 Kubernetes API 中保持“已完成”状态，直到最终被垃圾收集或手动清理。（在完成状态下，驱动程序 Pod 不使用任何计算或内存资源。）

向 Kubernetes 提交应用程序

Cluster Mode

要在集群模式下启动 Spark Pi，

$ ./bin/spark-submit \
    --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
    --deploy-mode cluster \
    --name spark-pi \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.executor.instances=5 \
    --conf spark.kubernetes.container.image=<spark-image> \
    local:///path/to/examples.jar

Spark master（通过将–master命令行参数传递给应用程序的配置spark-submit或通过 spark.master参数在应用程序的配置中进行设置来指定）的格式必须为：k8s://<api_server_host>:<k8s-apiserver-port>。必须始终指定端口，即使它是 HTTPS 端口 443。

在主字符串前添加 k8s:// 将导致 Spark 应用程序在 Kubernetes 集群上启动。如果 URL 中未指定 HTTP 协议，则默认为https. 例如，将 master 设置为k8s://example.com:443相当于将其设置为k8s://https://example.com:443，但要在不同端口上并且不使用 TLS 进行连接，则 master 必须设置为k8s://http://example.com:8080。

Client Mode

Spark executors必须能够通过可路由的主机名和端口连接到Spark driver。

在客户端模式下，Spark正常运行所需的特定网络配置将因设置而异。可以使用无头服务来使driver pod 能够通过稳定的主机名从 executors 路由。

在部署无头服务时，确保服务的标签选择器只匹配 driver pod 而不匹配其他pod。（为 driver pod分配一个足够唯一的标签，并在无头服务的标签选择器中使用该标签）通过spark.driver.host配置指定Spark driver的主机名，通过spark.driver.port配置指定Spark driver 的端口。

Secret资源 Management

Kubernetes Secrets可用于为 Spark 应用程序提供访问安全服务的凭据。

--conf spark.kubernetes.driver.secrets.spark-secret=/etc/secrets
--conf spark.kubernetes.executor.secrets.spark-secret=/etc/secrets

要通过环境变量使用机密，使用以下命令选项spark-submit：

--conf spark.kubernetes.driver.secretKeyRef.ENV_NAME=name:key
--conf spark.kubernetes.executor.secretKeyRef.ENV_NAME=name:key

命名空间

Spark on Kubernetes 可以使用命名空间来启动 Spark 应用程序。可以通过配置spark.kubernetes.namespace来使用它。

RBAC

在启用RBAC的 Kubernetes 集群中，用户可以配置 Kubernetes RBAC 角色和各种 Spark on Kubernetes 组件使用的服务帐户来访问 Kubernetes API 服务器。

Spark 驱动程序 pod 使用 Kubernetes 服务帐户访问 Kubernetes API 服务器来创建和监视执行程序 pod。至少必须向服务帐户授予 Role或ClusterRole权限，以允许驱动程序 Pod 创建 Pod 和服务。

--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark

其他spark on k8s配置

有关 Spark 配置的信息，请参阅配置页面。以下配置特定于 Kubernetes 上的 Spark。

Spark Properties：

Property Name	Default	Meaning
`spark.kubernetes.namespace`	`default`	指定driver和executor的命名空间
`spark.kubernetes.container.image`	`(none)`	用于指定 Spark 应用程序的容器镜像。必填，除非为每种不同的容器类型提供了显式图像。（见下方两个位置）
`spark.kubernetes.driver.container.image`	`(value of spark.kubernetes.container.image)`	driver镜像
`spark.kubernetes.executor.container.image`	`(value of spark.kubernetes.container.image)`	executor镜像
`spark.kubernetes.container.image.pullPolicy`	`IfNotPresent`	在 Kubernetes 中拉取镜像时使用的容器镜像拉取策略。
`spark.kubernetes.container.image.pullSecrets`		用于从私有映像仓库中提取镜像的 Kubernetes 密钥。
`spark.kubernetes.allocation.batch.size`	`5`	每轮执行程序 Pod 分配中一次启动的 Pod 数量。
`spark.kubernetes.allocation.batch.delay`	`1s`	每轮执行程序 Pod 分配之间等待的时间。指定小于 1 秒的值可能会导致 Spark driver的 CPU 使用率过高。
`spark.kubernetes.authenticate.submission.caCertFile`	(none)	Path to the CA cert file for connecting to the Kubernetes API server over TLS when starting the driver. This file must be located on the submitting machine's disk. Specify this as a path as opposed to a URI (i.e. do not provide a scheme). In client mode, use `spark.kubernetes.authenticate.caCertFile` instead.
`spark.kubernetes.authenticate.submission.clientKeyFile`	(none)	Path to the client key file for authenticating against the Kubernetes API server when starting the driver. This file must be located on the submitting machine's disk. Specify this as a path as opposed to a URI (i.e. do not provide a scheme). In client mode, use `spark.kubernetes.authenticate.clientKeyFile` instead.
`spark.kubernetes.authenticate.submission.clientCertFile`	(none)	Path to the client cert file for authenticating against the Kubernetes API server when starting the driver. This file must be located on the submitting machine's disk. Specify this as a path as opposed to a URI (i.e. do not provide a scheme). In client mode, use `spark.kubernetes.authenticate.clientCertFile` instead.
`spark.kubernetes.authenticate.submission.oauthToken`	(none)	OAuth token to use when authenticating against the Kubernetes API server when starting the driver. Note that unlike the other authentication options, this is expected to be the exact string value of the token to use for the authentication. In client mode, use `spark.kubernetes.authenticate.oauthToken` instead.
`spark.kubernetes.authenticate.submission.oauthTokenFile`	(none)	Path to the OAuth token file containing the token to use when authenticating against the Kubernetes API server when starting the driver. This file must be located on the submitting machine's disk. Specify this as a path as opposed to a URI (i.e. do not provide a scheme). In client mode, use `spark.kubernetes.authenticate.oauthTokenFile` instead.
`spark.kubernetes.authenticate.driver.caCertFile`	(none)	Path to the CA cert file for connecting to the Kubernetes API server over TLS from the driver pod when requesting executors. This file must be located on the submitting machine's disk, and will be uploaded to the driver pod. Specify this as a path as opposed to a URI (i.e. do not provide a scheme). In client mode, use `spark.kubernetes.authenticate.caCertFile` instead.
`spark.kubernetes.authenticate.driver.clientKeyFile`	(none)	Path to the client key file for authenticating against the Kubernetes API server from the driver pod when requesting executors. This file must be located on the submitting machine's disk, and will be uploaded to the driver pod as a Kubernetes secret. Specify this as a path as opposed to a URI (i.e. do not provide a scheme). In client mode, use `spark.kubernetes.authenticate.clientKeyFile` instead.
`spark.kubernetes.authenticate.driver.clientCertFile`	(none)	Path to the client cert file for authenticating against the Kubernetes API server from the driver pod when requesting executors. This file must be located on the submitting machine's disk, and will be uploaded to the driver pod as a Kubernetes secret. Specify this as a path as opposed to a URI (i.e. do not provide a scheme). In client mode, use `spark.kubernetes.authenticate.clientCertFile` instead.
`spark.kubernetes.authenticate.driver.oauthToken`	(none)	OAuth token to use when authenticating against the Kubernetes API server from the driver pod when requesting executors. Note that unlike the other authentication options, this must be the exact string value of the token to use for the authentication. This token value is uploaded to the driver pod as a Kubernetes secret. In client mode, use `spark.kubernetes.authenticate.oauthToken` instead.
`spark.kubernetes.authenticate.driver.oauthTokenFile`	(none)	Path to the OAuth token file containing the token to use when authenticating against the Kubernetes API server from the driver pod when requesting executors. Note that unlike the other authentication options, this file must contain the exact string value of the token to use for the authentication. This token value is uploaded to the driver pod as a secret. In client mode, use `spark.kubernetes.authenticate.oauthTokenFile` instead.
`spark.kubernetes.authenticate.driver.mounted.caCertFile`	(none)	Path to the CA cert file for connecting to the Kubernetes API server over TLS from the driver pod when requesting executors. This path must be accessible from the driver pod. Specify this as a path as opposed to a URI (i.e. do not provide a scheme). In client mode, use `spark.kubernetes.authenticate.caCertFile` instead.
`spark.kubernetes.authenticate.driver.mounted.clientKeyFile`	(none)	Path to the client key file for authenticating against the Kubernetes API server from the driver pod when requesting executors. This path must be accessible from the driver pod. Specify this as a path as opposed to a URI (i.e. do not provide a scheme). In client mode, use `spark.kubernetes.authenticate.clientKeyFile` instead.
`spark.kubernetes.authenticate.driver.mounted.clientCertFile`	(none)	Path to the client cert file for authenticating against the Kubernetes API server from the driver pod when requesting executors. This path must be accessible from the driver pod. Specify this as a path as opposed to a URI (i.e. do not provide a scheme). In client mode, use `spark.kubernetes.authenticate.clientCertFile` instead.
`spark.kubernetes.authenticate.driver.mounted.oauthTokenFile`	(none)	Path to the file containing the OAuth token to use when authenticating against the Kubernetes API server from the driver pod when requesting executors. This path must be accessible from the driver pod. Note that unlike the other authentication options, this file must contain the exact string value of the token to use for the authentication. In client mode, use `spark.kubernetes.authenticate.oauthTokenFile` instead.
`spark.kubernetes.authenticate.driver.serviceAccountName`	`default`	Service account that is used when running the driver pod. The driver pod uses this service account when requesting executor pods from the API server. Note that this cannot be specified alongside a CA cert file, client key file, client cert file, and/or OAuth token. In client mode, use `spark.kubernetes.authenticate.serviceAccountName` instead.
`spark.kubernetes.authenticate.caCertFile`	(none)	In client mode, path to the CA cert file for connecting to the Kubernetes API server over TLS when requesting executors. Specify this as a path as opposed to a URI (i.e. do not provide a scheme).
`spark.kubernetes.authenticate.clientKeyFile`	(none)	In client mode, path to the client key file for authenticating against the Kubernetes API server when requesting executors. Specify this as a path as opposed to a URI (i.e. do not provide a scheme).
`spark.kubernetes.authenticate.clientCertFile`	(none)	In client mode, path to the client cert file for authenticating against the Kubernetes API server when requesting executors. Specify this as a path as opposed to a URI (i.e. do not provide a scheme).
`spark.kubernetes.authenticate.oauthToken`	(none)	In client mode, the OAuth token to use when authenticating against the Kubernetes API server when requesting executors. Note that unlike the other authentication options, this must be the exact string value of the token to use for the authentication.
`spark.kubernetes.authenticate.oauthTokenFile`	(none)	In client mode, path to the file containing the OAuth token to use when authenticating against the Kubernetes API server when requesting executors.
`spark.kubernetes.driver.label.[LabelName]`	(none)	Add the label specified by `LabelName` to the driver pod. For example, `spark.kubernetes.driver.label.something=true`. Note that Spark also adds its own labels to the driver pod for bookkeeping purposes.
`spark.kubernetes.driver.annotation.[AnnotationName]`	(none)	Add the annotation specified by `AnnotationName` to the driver pod. For example, `spark.kubernetes.driver.annotation.something=true`.
`spark.kubernetes.executor.label.[LabelName]`	(none)	Add the label specified by `LabelName` to the executor pods. For example, `spark.kubernetes.executor.label.something=true`. Note that Spark also adds its own labels to the driver pod for bookkeeping purposes.
`spark.kubernetes.executor.annotation.[AnnotationName]`	(none)	Add the annotation specified by `AnnotationName` to the executor pods. For example, `spark.kubernetes.executor.annotation.something=true`.
`spark.kubernetes.driver.pod.name`	(none)	驱动程序 Pod 的名称。在集群模式下，如果未设置此选项，则驱动程序 pod 名称将设置为“spark.app.name”并添加当前时间戳后缀，以避免名称冲突。在客户端模式下，如果您的应用程序在 pod 内运行，强烈建议将其设置为您的驱动程序运行所在的 pod 的名称。在客户端模式下设置此值允许驱动程序成为其执行程序 pod 的所有者，这反过来又允许集群对执行器 Pod 进行垃圾收集。
`spark.kubernetes.executor.lostCheck.maxAttempts`	`10`	Number of times that the driver will try to ascertain the loss reason for a specific executor. The loss reason is used to ascertain whether the executor failure is due to a framework or an application error which in turn decides whether the executor is removed and replaced, or placed into a failed state for debugging.
`spark.kubernetes.submission.waitAppCompletion`	`true`	在集群模式下，是否等待应用程序完成后再退出启动器进程。When changed to false, the launcher has a "fire-and-forget" behavior when launching the Spark job.
`spark.kubernetes.report.interval`	`1s`	Interval between reports of the current Spark job status in cluster mode.
`spark.kubernetes.driver.limit.cores`	(none)	指定driver pod 的CPU限制。
`spark.kubernetes.executor.request.cores`	(none)	指定每个执行器pod的CPU请求。其值符合Kubernetes的规范。示例值包括0.1、500m、1.5、5等，CPU单位的定义详见CPU单位文档。这与spark.executor.cores不同：它仅在设置了执行器pod的CPU请求时使用，并优先于spark.executor.cores。这不影响任务并行性，例如，执行器可以同时运行的任务数量不受此影响。
`spark.kubernetes.executor.limit.cores`	(none)	为executor pod 指定CPU限制。
`spark.kubernetes.node.selector.[labelKey]`	(none)	Adds to the node selector of the driver pod and executor pods, with key `labelKey` and the value as the configuration's value. For example, setting `spark.kubernetes.node.selector.identifier` to `myIdentifier` will result in the driver pod and executors having a node selector with key `identifier` and value `myIdentifier`. Multiple node selector keys can be added by setting multiple configurations with this prefix.
`spark.kubernetes.driverEnv.[EnvironmentVariableName]`	(none)	Add the environment variable specified by `EnvironmentVariableName` to the Driver process. The user can specify multiple of these to set multiple environment variables.
`spark.kubernetes.driver.secrets.[SecretName]`	(none)	Add the Kubernetes Secret named `SecretName` to the driver pod on the path specified in the value. For example, `spark.kubernetes.driver.secrets.spark-secret=/etc/secrets`.
`spark.kubernetes.executor.secrets.[SecretName]`	(none)	Add the Kubernetes Secret named `SecretName` to the executor pod on the path specified in the value. For example, `spark.kubernetes.executor.secrets.spark-secret=/etc/secrets`.
`spark.kubernetes.driver.secretKeyRef.[EnvName]`	(none)	Add as an environment variable to the driver container with name EnvName (case sensitive), the value referenced by key `key` in the data of the referenced Kubernetes Secret. For example, `spark.kubernetes.driver.secretKeyRef.ENV_VAR=spark-secret:key`.
`spark.kubernetes.executor.secretKeyRef.[EnvName]`	(none)	Add as an environment variable to the executor container with name EnvName (case sensitive), the value referenced by key `key` in the data of the referenced Kubernetes Secret. For example, `spark.kubernetes.executor.secrets.ENV_VAR=spark-secret:key`.
`spark.kubernetes.driver.volumes.[VolumeType].[VolumeName].mount.path`	(none)	Add the Kubernetes Volume named `VolumeName` of the `VolumeType` type to the driver pod on the path specified in the value. For example, `spark.kubernetes.driver.volumes.persistentVolumeClaim.checkpointpvc.mount.path=/checkpoint`.
`spark.kubernetes.driver.volumes.[VolumeType].[VolumeName].mount.readOnly`	(none)	Specify if the mounted volume is read only or not. For example, `spark.kubernetes.driver.volumes.persistentVolumeClaim.checkpointpvc.mount.readOnly=false`.
`spark.kubernetes.driver.volumes.[VolumeType].[VolumeName].options.[OptionName]`	(none)	Configure Kubernetes Volume options passed to the Kubernetes with `OptionName` as key having specified value, must conform with Kubernetes option format. For example, `spark.kubernetes.driver.volumes.persistentVolumeClaim.checkpointpvc.options.claimName=spark-pvc-claim`.
`spark.kubernetes.executor.volumes.[VolumeType].[VolumeName].mount.path`	(none)	Add the Kubernetes Volume named `VolumeName` of the `VolumeType` type to the executor pod on the path specified in the value. For example, `spark.kubernetes.executor.volumes.persistentVolumeClaim.checkpointpvc.mount.path=/checkpoint`.
`spark.kubernetes.executor.volumes.[VolumeType].[VolumeName].mount.readOnly`	false	Specify if the mounted volume is read only or not. For example, `spark.kubernetes.executor.volumes.persistentVolumeClaim.checkpointpvc.mount.readOnly=false`.
`spark.kubernetes.executor.volumes.[VolumeType].[VolumeName].options.[OptionName]`	(none)	Configure Kubernetes Volume options passed to the Kubernetes with `OptionName` as key having specified value. For example, `spark.kubernetes.executor.volumes.persistentVolumeClaim.checkpointpvc.options.claimName=spark-pvc-claim`.
`spark.kubernetes.memoryOverheadFactor`	`0.1`	This sets the Memory Overhead Factor that will allocate memory to non-JVM memory, which includes off-heap memory allocations, non-JVM tasks, and various systems processes. For JVM-based jobs this value will default to 0.10 and 0.40 for non-JVM jobs. This is done as non-JVM tasks need more non-JVM heap space and such tasks commonly fail with "Memory Overhead Exceeded" errors. This prempts this error with a higher default.
`spark.kubernetes.pyspark.pythonVersion`	`"2"`	Python版本，如果使用python的话

其余spark配置参考：https://spark.apache.org/docs/2.4.8/configuration.html。

举例：提交spark到k8s

可以这样来提交一个任务，同时设置 driver 和 executor 的 CPU、内存的资源 request 和 limit 值（driver 的内存 limit 值为 request 值的 110%）。

./spark-submit \
  // 设置cluster模式启动
  --deploy-mode cluster \
  --class org.apache.spark.examples.SparkPi \
  // 指定k8s apiserver的地址
  --master k8s://https://172.20.0.113:6443 \
  --kubernetes-namespace spark-cluster \
  // 指定k8s的serviceAccount
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  // k8s资源限额
  --conf spark.driver.memory=100G \
  --conf spark.executor.memory=10G \
  --conf spark.driver.cores=30 \
  --conf spark.executor.cores=2 \
  --conf spark.driver.maxResultSize=10240m \
  --conf spark.kubernetes.driver.limit.cores=32 \
  --conf spark.kubernetes.executor.limit.cores=3 \
  --conf spark.kubernetes.executor.memoryOverhead=2g \
  --conf spark.executor.instances=5 \
  --conf spark.app.name=spark-pi \
  // spark创建Pod模板
  --conf spark.kubernetes.driver.docker.image=sz-pg-oam-docker-hub-001.tendcloud.com/library/spark-driver:v2.1.0-kubernetes-0.3.1-1 \
  --conf spark.kubernetes.executor.docker.image=sz-pg-oam-docker-hub-001.tendcloud.com/library/spark-executor:v2.1.0-kubernetes-0.3.1-1 \
  --conf spark.kubernetes.initcontainer.docker.image=sz-pg-oam-docker-hub-001.tendcloud.com/library/spark-init:v2.1.0-kubernetes-0.3.1-1 \
  // 提交真正的计算任务
local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.4.0-SNAPSHOT.jar 10000000

这将启动一个包含一千万个 task 的计算 pi 的 spark 任务，任务运行过程中，drvier 的 CPU 实际消耗大约为 3 核，内存 40G，每个 executor 的 CPU 实际消耗大约不到 1 核，内存不到 4G，我们可以根据实际资源消耗不断优化资源的 request 值。

有一点值得注意，local指的是spark这个pod中，也就是说，你得把jar包打包到spark镜像中，这样做无疑很麻烦。另外一种解决方案是，采用对象存储去提供jar包：

spark-submit --master k8s://{{your k8s apiserver addr}}
--deploy-mode cluster 
--conf spark.hadoop.fs.s3a.access.key={{ your s3a access key}}
--conf spark.hadoop.fs.s3a.secret.key={{ your s3a secret key}} 
--conf spark.hadoop.fs.s3a.endpoint={{ your s3a endpoint}}
--conf spark.hadoop.fs.s3a.connection.ssl.enabled=false 
--conf spark.hadoop.fs.s3a.path.style.access=true 
--conf spark.kubernetes.executor.deleteOnTermination=false 
--conf spark.kubernetes.memoryOverheadFactor=0.2 
--conf spark.kubernetes.namespace={{ your namespace }}
--conf spark.kubernetes.authenticate.driver.serviceAccountName={{ your sa }}
--conf spark.kubernetes.container.image={{ your image }}
--conf spark.kubernetes.driver.podTemplateFile=s3a://spark/drivertpl.yaml  
--conf spark.kubernetes.executor.podTemplateFile=s3a://spark/executortpl.yaml 
--driver-memory 4G 
--driver-cores 2 
--executor-memory 4G 
--executor-cores 1 
--num-executors 2 
--class helloworld.Main
--name helloworld s3a://spark/helloworld-1.0.jar --env "release" --id "8e3f19hj1j0888e4b0"  --sql "select  * from csdn where username='generalzy' limit 1"

上述命令中，由实现s3a协议的存储（如minio）来提供spark所需要的资源。

PySpark库

什么是PySpark

前面使用过bin/pyspark 程序, 提供一个Python解释器执行环境来运行Spark任务，PySpark指的是Python的运行类库, 可以在Python代码中:import pyspark。

PySpark 是Spark官方提供的一个Python类库, 内置了完全的Spark API, 可以通过PySpark类库来编写Spark应用程序,并将其提交到Spark集群中运行.

下图是PySpark类库和标准Spark框架的简单对比：
在这里插入图片描述

可以通过Python自带的pip程序进行安装：pip install pyspark -i https://pypi.tuna.tsinghua.edu.cn/simple

应用入口：SparkContext

Spark Application程序入口为：SparkContext，任何一个应用首先需要构建SparkContext对象，如下两步构建：

第一步、创建SparkConf对象，设置Spark Application基本信息，比如应用的名称AppName和应用运行Master
第二步、基于SparkConf对象，创建SparkContext对象

在这里插入图片描述

WordCount代码实战

首先创建一个本地word.txt文件：

Hello Spark
Hello World
Hello Hello

然后编写对应处理脚本：

如果有如下报错：

24/03/17 19:46:06 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2)
org.apache.spark.SparkException: Python worker failed to connect back.
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:192)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:166)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:139)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.net.SocketTimeoutException: Accept timed out
	at java.base/sun.nio.ch.NioSocketImpl.timedAccept(NioSocketImpl.java:708)
	at java.base/sun.nio.ch.NioSocketImpl.accept(NioSocketImpl.java:752)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:675)
	at java.base/java.net.ServerSocket.platformImplAccept(ServerSocket.java:641)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:617)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:574)
	at java.base/java.net.ServerSocket.accept(ServerSocket.java:532)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:179)
	... 15 more

就需要加上两个环境变量：

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

这两行代码的作用是设置环境变量，用于指定 Spark 使用的 Python 解释器以及驱动程序所使用的 Python 解释器。

os.environ[‘PYSPARK_PYTHON’] = sys.executable 将当前 Python 解释器的路径赋给环境变量 PYSPARK_PYTHON，这告诉 Spark 使用与当前 Python 解释器相同的解释器来执行 Python 代码。

os.environ[‘PYSPARK_DRIVER_PYTHON’] = sys.executable 将当前 Python 解释器的路径赋给环境变量 PYSPARK_DRIVER_PYTHON，这告诉 Spark 驱动程序使用与当前 Python 解释器相同的解释器。

from pyspark import SparkContext, SparkConf
import os
import sys

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

spark_config = SparkConf().setAppName("minProject").setMaster("local[*]")
spark_ctx = SparkContext(conf=spark_config)

# spark读取本地文件
rdd = spark_ctx.textFile("./word.txt")
print(rdd.collect())

# 分割为单词
flatMapRdd = rdd.flatMap(lambda line: line.split(" "))
print(flatMapRdd.collect())

# 转换为二元组，表示每个单词出现一次
mapRDD = flatMapRdd.map(lambda x: (x, 1))
print(mapRDD.collect())

# 按照Key分组聚合
resultRDD = mapRDD.reduceByKey(lambda a, b: a + b)

# 第三步、输出数据
print(resultRDD.collect())

在这里插入图片描述
原理分析：

最终将结果保存到本地：

# 第三步、输出数据
print(resultRDD.collect())
resultRDD.saveAsTextFile("./output.txt")

在这里插入图片描述

Python On Spark 执行原理

PySpark宗旨是在不破坏Spark已有的运行时架构，在Spark架构外层包装一层Python API，借助Py4j实现Python和Java的交互，进而实现通过Python编写Spark应用程序，其运行时架构如下图所示。

在这里插入图片描述

Spark核心编程

Spark 计算框架为了能够进行高并发和高吞吐的数据处理，封装了三大数据结构，用于处理不同的应用场景。三大数据结构分别是：

➢ RDD : 弹性分布式数据集

➢ 累加器：分布式共享只写变量

➢ 广播变量：分布式共享只读变量

RDD

RDD（Resilient Distributed Dataset）叫做弹性分布式数据集，是 Spark 中最基本的数据处理模型。代码中是一个抽象类，它代表一个弹性的、不可变、可分区、里面的元素可并行计算的集合。

弹性
- 存储的弹性：内存与磁盘的自动切换；
- 容错的弹性：数据丢失可以自动恢复；
- 计算的弹性：计算出错重试机制；
- 分片的弹性：可根据需要重新分片。
分布式：数据存储在大数据集群不同节点上
数据集：RDD封装了计算逻辑，并不保存数据
数据抽象：RDD是一个抽象类，需要子类具体实现
不可变：RDD封装了计算逻辑，是不可以改变的，想要改变，只能产生新的RDD，在新的RDD里面封装计算逻辑
可分区、并行计算

在这里插入图片描述

RDD（Resilient Distributed Dataset）弹性分布式数据集，是Spark中最基本的数据抽象，代表一个不可变、可分区、里面的元素可并行计算的集合。
所有的运算以及操作都建立在RDD 数据结构的基础之上。
可以认为RDD是分布式的列表List或数组Array，抽象的数据结构，RDD是一个抽象类Abstract Class和泛型Generic Type

在这里插入图片描述

RDD的五大特性

RDD是有分区的

在这里插入图片描述

RDD的方法会作用到所有分区上

在这里插入图片描述

RDD之间是有依赖关系的

在这里插入图片描述

Key-Value型的RDD可以有分区器

在这里插入图片描述

RDD的分区规划，会尽量靠近数据所在服务器

在这里插入图片描述

WordCount案例分析

在这里插入图片描述

RDD编程

RDD的创建

Spark RDD 编程的程序入口对象是SparkContext对象(不论何种编程语言)，只有构建出SparkContext, 基于它才能执行后续的API调用和计算。

本质上, SparkContext对编程来说, 主要功能就是创建第一个RDD出来。

在这里插入图片描述
RDD的创建主要有2种方式:

通过并行化集合创建( 本地对象转分布式RDD )
读取外部数据源( 读取文件)

并行化创建

在这里插入图片描述

from pyspark import SparkContext, SparkConf
import os
import sys

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

spark_config = SparkConf().setAppName("create add").setMaster("local[4]")
spark_context = SparkContext(conf=spark_config)

data = [1, 2, 3, 4, 5, 6, 6, 7, 8]
rdd = spark_context.parallelize(data, 4)

print(rdd.collect())

在这里插入图片描述

读取文件创建

在这里插入图片描述

from pyspark import SparkContext, SparkConf
import os
import sys

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

spark_config = SparkConf().setAppName("create add").setMaster("local[4]")
spark_context = SparkContext(conf=spark_config)

rdd = spark_context.textFile("./word.txt")
print(rdd.collect())

在这里插入图片描述

RDD 算子

在这里插入图片描述

Transformation算子

map算子

在这里插入图片描述

flatMap算子

在这里插入图片描述

reduceByKey算子

在这里插入图片描述
对于一个WordCount的应用，可以：

groupBy算子

在这里插入图片描述
说白了也就是，一个函数的return值是分组的key，条件是分组的条件。

filter算子

在这里插入图片描述
和python的filter一个意思，如果返回true则代表当前值有用，否则抛弃。

distinct算子

在这里插入图片描述

union算子

在这里插入图片描述

join算子

在这里插入图片描述

intersection算子

在这里插入图片描述

glom算子

在这里插入图片描述

import os
import sys

from pyspark import SparkContext, SparkConf

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

spark_config = SparkConf().setAppName("create add").setMaster("local[4]")
spark_context = SparkContext(conf=spark_config)

rdd = spark_context.parallelize([1, 2, 3, 4, 5], 2)
rdd2 = rdd.glom()

print(rdd.collect())
[1, 2, 3, 4, 5]
print(rdd2.collect())
[[1, 2], [3, 4, 5]]

groupByKey算子

在这里插入图片描述

sortBy算子

在这里插入图片描述

import os
import sys
from random import shuffle

from pyspark import SparkContext, SparkConf

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

spark_config = SparkConf().setAppName("create add").setMaster("local[4]")
spark_context = SparkContext(conf=spark_config)

data = [number for number in range(100)]
# 原地打乱
shuffle(data)

rdd = spark_context.parallelize(data, 10)

rdd2 = rdd.glom()
print(rdd2.collect())

rdd3 = rdd2.sortBy(lambda x: x[-1], ascending=True, numPartitions=1)
print(rdd3.collect())

在这里插入图片描述
官方给的案例：

 Examples
 --------
  >>> tmp = [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]
  >>> sc.parallelize(tmp).sortBy(lambda x: x[0]).collect()
  [('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)]
  >>> sc.parallelize(tmp).sortBy(lambda x: x[1]).collect()
  [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]

sortByKey算子

在这里插入图片描述

案例

在这里插入图片描述

Action算子

collection算子

在这里插入图片描述

reduce算子

在这里插入图片描述

fold算子

在这里插入图片描述

first算子

在这里插入图片描述

import os
import sys
from random import shuffle

from pyspark import SparkContext, SparkConf

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

spark_config = SparkConf().setAppName("create add").setMaster("local[4]")
spark_context = SparkContext(conf=spark_config)

data = [number for number in range(100)]
# 原地打乱
shuffle(data)

rdd = spark_context.parallelize(data, 10)
print(rdd.getNumPartitions())
# 为了可观性,多加一步glom运算
print(rdd.glom().collect())

first_number = rdd.first()
print(first_number)

在这里插入图片描述
可以看出取的是第一个分区的第一个元素。

take算子

在这里插入图片描述
同理，take也是按顺序取出元素，当前分区不够就到下一个分区去找。

top算子

在这里插入图片描述

count算子

在这里插入图片描述

takeSample算子

在这里插入图片描述

takeOrdered算子

在这里插入图片描述

foreach算子

在这里插入图片描述

saveAsTextFile算子

在这里插入图片描述

分区操作算子

mapPartitions算子

在这里插入图片描述

foreachPartition算子

在这里插入图片描述

partitionBy算子

在这里插入图片描述

repartition算子

在这里插入图片描述

coalesce算子

在这里插入图片描述

mapValues算子

在这里插入图片描述

join算子

在这里插入图片描述

groupByKey和reduceByKey的区别

在这里插入图片描述

对于分区操作有什么要注意的地方

尽量不要增加分区, 可能破坏内存迭代的计算管道。

RDD的持久化

RDD 的数据是过程数据

在这里插入图片描述

RDD 的缓存

在这里插入图片描述

RDD 的CheckPoint

在这里插入图片描述

Cache和Checkpoint区别
- Cache是轻量化保存RDD数据, 可存储在内存和硬盘, 是分散存储, 设计上数据是不安全的(保留RDD
  血缘关系)
- CheckPoint是重量级保存RDD数据, 是集中存储, 只能存储在硬盘(HDFS)上, 设计上是安全的(不保留
  RDD血缘关系)
Cache 和 CheckPoint的性能对比?
- Cache性能更好, 因为是分散存储, 各个Executor并行执行, 效率高, 可以保存到内存中(占内存),更快
- CheckPoint比较慢, 因为是集中存储, 涉及到网络IO, 但是存储到HDFS上更加安全(多副本）

Spark案例练习

搜索引擎日志分析案例

使用搜狗的用户查询数据集：（我放到了https://scripterbro.github.io/files/SogouQ.txt）进行数据分析。

在这里插入图片描述

import os
import sys
import jieba

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
from pyspark import SparkContext, SparkConf

ctx = SparkContext(conf=SparkConf().setAppName("test").setMaster("local[*]"))

rdd = ctx.textFile("SogouQ.txt")

# 23:59:59 12608738215214038  [痣相算命] 7  5  mm.httpcn.com/News/xiangxue.shtml
rdd2 = rdd.map(lambda line: line.split())

# 提取用户的搜索词: [痣相算命]', '[魏宝晶]', '[纯净水设备]
rdd3 = rdd2.map(lambda x: x[2])


# 分词
def context_jieba(data):
    return [word for word in jieba.cut_for_search(data)]


rdd4 = rdd3.flatMap(context_jieba)


# 过滤
def filter_words(data):
    return data not in ['谷', '帮', '客', "的", "[", "]"]


rdd5 = rdd4.filter(filter_words)


# 对分词进行调整
# 如果分词为传直播,则纠正为传智播客
def append_words(data):
    if data == '传智播':
        data = '传智播客'
    if data == '院校':
        data = '院校帮'
    if data == '博学':
        data = '博学谷'
    return data, 1


rdd6 = rdd5.map(append_words)

rdd7 = rdd6.reduceByKey(lambda a, b: a + b).sortBy(lambda x: x[1], ascending=False, numPartitions=1).take(5)
print(rdd7)

在这里插入图片描述
提交到集群运行:

扩展阅读spark提交scala任务模板：

SPARK_ON_K8S_SUBMIT_COMMAND = ("spark-submit --master k8s://%(k8s_apiserver_host)s "
                               "--deploy-mode cluster "
                               "--conf spark.kubernetes.executor.deleteOnTermination=false "
                               "--conf spark.kubernetes.memoryOverheadFactor=0.2 "
                               "--conf spark.kubernetes.namespace=%(k8s_namespace)s "
                               "--conf spark.kubernetes.authenticate.driver.serviceAccountName=%(k8s_service_account)s "
                               "--conf spark.kubernetes.container.image=%(spark_image)s "
                               "--conf spark.kubernetes.driver.podTemplateFile=%(spark_driver_yaml_abspath)s  "
                               "--conf spark.kubernetes.executor.podTemplateFile=%(spark_executor_yaml_abspath)s "
                               "--driver-memory %(spark_driver_required_mem_per_pod)s "
                               "--driver-cores %(spark_driver_required_cpu_per_pod)s "
                               "--executor-cores %(spark_executor_required_mem_per_pod)s "
                               "--executor-memory %(spark_executor_required_cpu_per_pod)s "
                               "--num-executors %(spark_executor_num)s "
                               "--class %(run_class)s "
                               "--name %(jar_name)s "
                               "%(jar_path)s ")

SPARK_ON_LOCAL_SUBMIT_COMMAND = ("spark-submit --master local[*] "
                                 "--class %(run_class)s "
                                 "--name %(jar_name)s "
                                 "%(jar_path)s ")

共享变量

广播变量

在这里插入图片描述

累加器

在这里插入图片描述

FAQ

报错： HADOOP_HOME and hadoop.home.dir are unset

PySpark 一般会与 Hadoop 环境一起运行（依赖关系） , 如果在 Windows 中没有安装 Hadoop 运行环境 , 就会报上述错误 ;

Hadoop 发布版本在 https://hadoop.apache.org/releases.html 页面可下载 ;

在这里插入图片描述

下载好后需要用系统权限才可以解压，只需要给winrar以系统权限执行即可：

在这里插入图片描述
在环境变量中 , 设置：HADOOP_HOME = HADOOP安装目录

在这里插入图片描述
在 Path 环境变量中 , 增加

%HADOOP_HOME%\bin
%HADOOP_HOME%\sbin

在这里插入图片描述
设置hadoop-env.cmd 脚本中的 JAVA_HOME 为真实的 JDK 路径：

在这里插入图片描述
然后，重启电脑。

报错：Did not find winutils.exe

到github仓库下载：https://github.com/cdarlint/winutils，选择合适的版本，
在这里插入图片描述
下载好后，将winutils相关的文件放到hadoop的bin目录下：

报错：本地测试时无法连接k8s apiserver

在这里插入图片描述

使用kubectl proxy命令就可以使API server监听在本地端口上，设置API server接收所有主机的请求：

kubectl proxy --address='0.0.0.0'  --accept-hosts='^*$' --port=8009

使用curl访问：

$ curl -X GET -L http://k8s-master:8009/
{
  "paths": [
    "/api",
    "/api/v1",
    ...
  ]
}

如何获取 Kubernetes API 主机和端口，对于 Kubernetes，每个集群都有一个 API Server。因此，查找 API 主机和端口的最简单方法是查看kubectl cluster-info输出。例如：

$ kubectl cluster-info
Kubernetes control plane is running at https://192.168.58.2:8443

Generalzy

关注

20
点赞
踩
14

收藏

觉得还不错? 一键收藏
打赏
0
评论
光速入门spark（待续）

官网的定义是：Apache Spark™是一个用于在单节点机器或集群上执行数据工程、数据科学和机器学习的多语言引擎。Spark 是一种基于内存的快速、通用、可扩展的大数据分析计算引擎。Spark 最早源于一篇论文Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing，该论文是由加州大学柏克莱分校的Matei Zaharia 等人发表的。
复制链接

扫一扫