二十四:RDD源码分析

一:初识Spark:

进入官网 http://spark.apache.org
Apache Spark™ is a unified analytics engine for large-scale data processing
Apache Spark是一个标准的大型数据处理分析引擎,具有如下4个特性:

1.1:运行速度快:

相对于hadoop:编程模型不一样:mapreduce是基于进程计算,基本每一步都需要落到磁盘上,而spark是线程的,基于DAG的pipeline的计算。

1.2:易用:

可以用这么多语言来编程 scala python java R and SQL,支持80多个API

1.3:通用性

生态栈上体现,对各种问题可以有效的解决。

1.4:运行在任何地方:

Spark runs on Hadoop(on yanr), Apache Mesos, Kubernetes(2.3以后支持), standalone(spark集群), or in the cloud. It can access diverse data sources

二:RDD 源码:

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an
immutable 不可变的 map 成新的集合

  • partitioned collection of elements 分区集合
  • that can be operated on in parallel. 单机开发运行并行数据
  • This class contains the basic operations available on all RDDs, such as map, filter, and persist. In addition,

2.1 Internally, each RDD is characterized by five main properties:

—A list of partitions 一系列分区
—A function for computing each split 一个函数去计算每个分区
—A list of dependencies on other RDDs 一系列的依赖关系
–Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) 一个分区器
–Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file) 每个分区计算有个最佳位置:

RDD是一个继承了序列化和日志的一个抽象类:

abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging { 

2.2 RDD的五大特性实现的5个方法:

这些特性在HADOOPRDD和JDBCRDD等中需要去具体的实现

protected def getPartitions: Array[Partition]

  /**
   * Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
   * be called once, so it is safe to implement a time-consuming computation in it.
   */
def compute(split: Partition, context: TaskContext): Iterator[T]

  /**
   * Implemented by subclasses to return the set of partitions in this RDD. This method will only
   * be called once, so it is safe to implement a time-consuming computation in it.
   *
   * The partitions in this array must satisfy the following property:
   *   `rdd.partitions.zipWithIndex.forall { case (partition, index) => partition.index == index }`
   */
 protected def getDependencies: Seq[Dependency[_]] = deps

  /**
   * Optionally overridden by subclasses to specify placement preferences.
   */
protected def getPreferredLocations(split: Partition): Seq[String] = Nil

  /** Optionally overridden by subclasses to specify how they are partitioned. */
  @transient val partitioner: Option[Partitioner] = None

  // =======================================================================
  // Methods and fields available on all RDDs
  // =======================================================================

  /** The SparkContext that created this RDD. */

三:Initializing Spark

The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf object that contains information about your application.

Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.

3.1:SparkContext

Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.

3.2:SparkConf

Configuration for a Spark application. Used to set various Spark parameters as key-value pairs
Most of the time, you would create a SparkConf object with new SparkConf(), which will load
values from any spark.* Java system properties set in your application as well. In this case,
parameters you set directly on the SparkConf object take priority over system properties

import org.apache.spark.{SparkConf, SparkContext}

object SparkContextApp {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("First").setMaster("local[2]")
    val sc = new SparkContext(conf)
    
 // TODO----   
   
  sc.stop()
  

In practice, when running on a cluster, you will not want to hardcode master in the program, but rather launch the application with spark-submit and receive it there. However, for local testing and unit tests, you can pass “local” to run Spark in-process

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值