spark-submit提交模式
Saprk Core
RDD
- RDD的定义,简介
RDD源码定义
abstract class RDD[T: ClassTag](
@transient private var sc: SparkContext,
@transient private var deps: Seq[Dependency[]]
) extends Serializable with Logging
官网介绍
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel.
RDD:Resilient Distributed Dataset 弹性 分布式 数据集
Represents an
immutable:不可变
partitioned collection of elements :分区
Array(1,2,3,4,5,6,7,8,9,10) 3个分区: (1,2,3) (4,5,6) (7,8,9,10)
that can be operated on in parallel: 并行计算的问题
1)RDD是一个抽象类
2)带泛型的,可以支持多种类型: String、Person、User
- RDD创建方式
There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
RDD的两种创建方式
1. Parallelized Collections
2. External Datasets
- RDD的五大特性
Internally, each RDD is characterized by five main properties:
-
A list of partitions
一系列的分区/分片 -
A function for computing each split/partition
y = f(x)
rdd.map(_+1) -
A list of dependencies on other RDDs
rdd1 ==> rdd2 ==> rdd3 ==> rdd4
dependencies: *****rdda = 5个partition
==>map
rddb = 5个partition -
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
-
Optionally, a list of preferred locations to compute each split on (e.g.
block locations for an HDFS file) 数据在哪优先把作业调度到数据所在的节点进行计算:移动数据不如移动计算
https://blog.csdn.net/struct_slllp_main/article/details/76209056
https://blog.csdn.net/budong282712018/article/details/51458974