Hadoop Spark

最新推荐文章于 2024-09-27 00:45:01 发布

chiruxu4359

最新推荐文章于 2024-09-27 00:45:01 发布

阅读量79

点赞数

文章标签： java 大数据

原文链接：https://my.oschina.net/u/3551123/blog/1488538

版权

Resilient Distributed Dataset

RDDs -> Transformation -> ... -> Transformation -> RDDs -> Action -> Result/Persistent Storage

Resilient means that Spark can automatically reconstruct a lost partition by RECOMPUTING IT FROM THE RDDS that it was computed from.
Dataset is Read-only collection of objects.
Distributed means Partitioned across the cluster.

Loading an RDD or performing a TRANSFORMATION on one does not trigger any data processing; it merely creates a plan for performing the computation. The computation is triggered only when an ACTION is called.

If the return type is RDD, the function is TRANSFORMATION
Otherwise, it's ACTION

Java RDD API

JavaRDDLike Interface
- JavaRDD
- JavaPairRDD (key-value pairs)

RDD Creation

From an in-memory collection of objects (Parallizing a Collection)

// RDD : 10 input values, i.e. 1 to 10. And Parallization level is 5
var params = sc.parallelize(1 to 10, 5)

// Computation : values are passed to the funcation and computation runs in parallel
var result = params.map(performExtensiveComputation)

Using a dataset from external storage
- In the following example, Spark uses TextInputFormat (same as old MapReduce API) to split and read the file. So by default, in the case of HDFS, there is one Spark partition per HDFS block.

// TextInputFormat
val text: RDD[String] = sc.textFile(inputPath)

// Sequence File
sc.sequenceFile[IntWritable, Text](inputPath)
// For common Writable, Spark can map them to Java equivalents
sc.sequenceFile[int, String](inputPath)

// using newAPIHadoopFile() and newAPIHadoopRDD()
// to create RDDs from an arbitary Hadoop InputFormat, such HBase

Transforming an existing RDD
- Transformation: mapping, grouping, aggregating, repartitioning, sampling and joining RDD.
- Action: materializing RDD as collections, computing statistics on RDD, sampling a fixed number of elements on RDD, saving RDD to external storage.

Cache

Spark will cache dataset in a cross-cluster in-memory cache, which means any computation on those datasets will be faster. MapReduce, however, to perform another calculation on the same input dataset will load the dataset from disk again. Even if there is intermediate dataset can be used as input, there is no getting away from the fact that the dataset has to be loaded from disk again.

This turns out to be tremendously helpful for interactive exploration of data, for example, getting the max, min and average on the same dataset.

Storage level

MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER

Spark Job

The application (SparkContext) serves to group RDDs and shared variable.
A job always runs in the Context of an Application.
- An Application can run more than one job, in series or in parallel.
- An Application provides the mechanism for a job to access an RDD that was cached by the previous job in the same application.

Job Run

Driver: hosts Application (SparkConext) and schedule tasks for a job.
Executor: execute Application's Tasks
Job Submission: (Application -> Job -> Stages -> Tasks)
- Calling any RDD.Action will submit the job automatically
- runJob() will be called against SparkContext.
- Scheduler will be called
  - DAG Scheduler breaks the job into a DAG of stages.
  - Task Scheduler submit from each stage to the cluster.
- Task Execution

Cluster Resource Manager

Local
Standalone
Mesos
YARN
- YARN Client Mode:
  - Client -> driver -> SparkContext
  - SparkContext -> YARN application -> YARN Resoruce Manager
  - YARN Node -> Application Master of Spark ExecutorLauncher
- YARN Cluster Mode: driver runs in a Application Master process.

转载于:https://my.oschina.net/u/3551123/blog/1488538