Hadoop Spark

Resilient Distributed Dataset

RDDs -> Transformation -> ... -> Transformation -> RDDs -> Action -> Result/Persistent Storage

  • Resilient means that Spark can automatically reconstruct a lost partition by RECOMPUTING IT FROM THE RDDS that it was computed from.
  • Dataset is Read-only collection of objects.
  • Distributed means Partitioned across the cluster.

Loading an RDD or performing a TRANSFORMATION on one does not trigger any data processing; it merely creates a plan for performing the computation. The computation is triggered only when an ACTION is called.

  • If the return type is RDD, the function is TRANSFORMATION
  • Otherwise, it's ACTION

Java RDD API

  • JavaRDDLike Interface
    • JavaRDD
    • JavaPairRDD (key-value pairs)

RDD Creation

  • From an in-memory collection of objects (Parallizing a Collection)
// RDD : 10 input values, i.e. 1 to 10. And Parallization level is 5
var params = sc.parallelize(1 to 10, 5)

// Computation : values are passed to the funcation and computation runs in parallel
var result = params.map(performExtensiveComputation)
  • Using a dataset from external storage
    • In the following example,  Spark uses TextInputFormat (same as old MapReduce API) to split and read the file. So by default, in the case of HDFS, there is one Spark partition per HDFS block.
// TextInputFormat
val text: RDD[String] = sc.textFile(inputPath)

// Sequence File
sc.sequenceFile[IntWritable, Text](inputPath)
// For common Writable, Spark can map them to Java equivalents
sc.sequenceFile[int, String](inputPath)

// using newAPIHadoopFile() and newAPIHadoopRDD()
// to create RDDs from an arbitary Hadoop InputFormat, such HBase
  • Transforming an existing RDD
    • Transformation: mapping, grouping, aggregating, repartitioning, sampling and joining RDD.
    • Action: materializing RDD as collections, computing statistics on RDD, sampling a fixed number of elements on RDD, saving RDD to external storage.

Cache

Spark will cache dataset in a cross-cluster in-memory cache, which means any computation on those datasets will be faster. MapReduce, however, to perform another calculation on the same input dataset will load the dataset from disk again. Even if there is intermediate dataset can be used as input, there is no getting away from the fact that the dataset has to be loaded from disk again.

This turns out to be tremendously helpful for interactive exploration of data, for example, getting the max, min and average on the same dataset.

Storage level

  • MEMORY_ONLY
  • MEMORY_ONLY_SER
  • MEMORY_AND_DISK
  • MEMORY_AND_DISK_SER

Spark Job

  • The application (SparkContext) serves to group RDDs and shared variable.
  • A job always runs in the Context of an Application.
    • An Application can run more than one job, in series or in parallel.
    • An Application provides the mechanism for a job to access an RDD that was cached by the previous job in the same application. 

Job Run

  • Driver: hosts Application (SparkConext) and schedule tasks for a job.
  • Executor: execute Application's Tasks
  • Job Submission: (Application -> Job -> Stages -> Tasks)
    • Calling any RDD.Action will submit the job automatically
    • runJob() will be called against SparkContext.
    • Scheduler will be called
      • DAG Scheduler breaks the job into a DAG of stages.
      • Task Scheduler submit from each stage to the cluster.
    • Task Execution

Cluster Resource Manager

  • Local
  • Standalone
  • Mesos
  • YARN
    • YARN Client Mode: 
      • Client -> driver -> SparkContext
      • SparkContext -> YARN application -> YARN Resoruce Manager
      • YARN Node -> Application Master of Spark ExecutorLauncher
    • YARN Cluster Mode: driver runs in a Application Master process.

 

 

转载于:https://my.oschina.net/u/3551123/blog/1488538

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值