Resilient Distributed Dataset
RDDs -> Transformation -> ... -> Transformation -> RDDs -> Action -> Result/Persistent Storage
- Resilient means that Spark can automatically reconstruct a lost partition by RECOMPUTING IT FROM THE RDDS that it was computed from.
- Dataset is Read-only collection of objects.
- Distributed means Partitioned across the cluster.
Loading an RDD or performing a TRANSFORMATION on one does not trigger any data processing; it merely creates a plan for performing the computation. The computation is triggered only when an ACTION is called.
- If the return type is RDD, the function is TRANSFORMATION
- Otherwise, it's ACTION
Java RDD API
- JavaRDDLike Interface
- JavaRDD
- JavaPairRDD (key-value pairs)
RDD Creation
- From an in-memory collection of objects (Parallizing a Collection)
// RDD : 10 input values, i.e. 1 to 10. And Parallization level is 5
var params = sc.parallelize(1 to 10, 5)
// Computation : values are passed to the funcation and computation runs in parallel
var result = params.map(performExtensiveComputation)
- Using a dataset from external storage
- In the following example, Spark uses TextInputFormat (same as old MapReduce API) to split and read the file. So by default, in the case of HDFS, there is one Spark partition per HDFS block.
// TextInputFormat
val text: RDD[String] = sc.textFile(inputPath)
// Sequence File
sc.sequenceFile[IntWritable, Text](inputPath)
// For common Writable, Spark can map them to Java equivalents
sc.sequenceFile[int, String](inputPath)
// using newAPIHadoopFile() and newAPIHadoopRDD()
// to create RDDs from an arbitary Hadoop InputFormat, such HBase
- Transforming an existing RDD
- Transformation: mapping, grouping, aggregating, repartitioning, sampling and joining RDD.
- Action: materializing RDD as collections, computing statistics on RDD, sampling a fixed number of elements on RDD, saving RDD to external storage.
Cache
Spark will cache dataset in a cross-cluster in-memory cache, which means any computation on those datasets will be faster. MapReduce, however, to perform another calculation on the same input dataset will load the dataset from disk again. Even if there is intermediate dataset can be used as input, there is no getting away from the fact that the dataset has to be loaded from disk again.
This turns out to be tremendously helpful for interactive exploration of data, for example, getting the max, min and average on the same dataset.
Storage level
- MEMORY_ONLY
- MEMORY_ONLY_SER
- MEMORY_AND_DISK
- MEMORY_AND_DISK_SER
Spark Job
- The application (SparkContext) serves to group RDDs and shared variable.
- A job always runs in the Context of an Application.
- An Application can run more than one job, in series or in parallel.
- An Application provides the mechanism for a job to access an RDD that was cached by the previous job in the same application.
Job Run
- Driver: hosts Application (SparkConext) and schedule tasks for a job.
- Executor: execute Application's Tasks
- Job Submission: (Application -> Job -> Stages -> Tasks)
- Calling any RDD.Action will submit the job automatically
- runJob() will be called against SparkContext.
- Scheduler will be called
- DAG Scheduler breaks the job into a DAG of stages.
- Task Scheduler submit from each stage to the cluster.
- Task Execution
Cluster Resource Manager
- Local
- Standalone
- Mesos
- YARN
- YARN Client Mode:
- Client -> driver -> SparkContext
- SparkContext -> YARN application -> YARN Resoruce Manager
- YARN Node -> Application Master of Spark ExecutorLauncher
- YARN Cluster Mode: driver runs in a Application Master process.
- YARN Client Mode: