Spark学习1

最新推荐文章于 2023-10-12 16:40:29 发布

HYQ2018

最新推荐文章于 2023-10-12 16:40:29 发布

阅读量85

点赞数

本文链接：https://blog.csdn.net/qq_17286625/article/details/79829820

版权

Spark介绍

apache Spark是一个大数据处理框架。它是建立在Hadoop MapReduce之上，它扩展了 MapReduce 模式，有效地实现了MapReduce不擅长的计算工作，包括迭代式、交互式、流式。

这里写图片描述

Spark的应用场景：

复杂的批量数据处理(batch data processing)：
- 数据量超过了单机的尺度，或者需要进行大量的计算。
基于历史数据的交互查询(interactiva query)
基于实时数据流的数据处理(streaming data processing)

RDD[Resilient Distributed Dataset]

定义
1. Distributed：数据是存储在多个节点上，计算也是在多个节点上进行。
2. Resilient：体现在计算上。
3. Dataset: 就是一个数据block.
4. immutable：不可变的。
5. parallel：对RDD进行操作等价与对多个partition同时进行操作。

/*  Represents an immutable,
 * partitioned collection of elements that 
 * can be operated on in parallel. */

abstract class RDD[T: ClassTag](                  //抽象类，不能直接实例化，RDD必然由子类实现。
    @transient private var _sc: SparkContext,     //@transient,表示该成员不需要被序列化
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging {}          //Serializable序列化

//RDD子类的一个例子
class JdbcRDD[T: ClassTag](
    sc: SparkContext,
    getConnection: () => Connection,
    sql: String,
    lowerBound: Long,
    upperBound: Long,
    numPartitions: Int,
    mapRow: (ResultSet) => T = JdbcRDD.resultSetToObjectArray _)
  extends RDD[T](sc, Nil) with Logging {}

RDD五大特点：
1. A list of partitions
2. A function for computing each split/partitions
一个操作是对每个分区做相同的操作
3. A list of dependencies on other RDDs
RDDA => RDDB => RDDC => RDDD
4. Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
5. Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

def compute(split: Partition, context: TaskContext): Iterator[T]
\\对应特点2
protected def getPartitions: Array[Partition]
\\对应特点1
protected def getDependencies: Seq[Dependency[_]] = deps
\\对应特点3
protected def getPreferredLocations(split: Partition): Seq[String] = Nil
\\对应特点5
@transient val partitioner: Option[Partitioner] = None
\\对应特点4

在Spark中，计算时有多少partition就对应有多少个task来执行。

SparkContext

SparkContext tells Spark how to access a cluster.(To create a SparkContext you first need to build a SparkConf object)
1. local
2. standalone
3. yarn
4. mesos
SparkConf object containsinformation about your application.
create a SparkConf(key-value pairs)
1. Application Name/core/ memory
2. chi

val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)

In practice, when running on a cluster, you will not want to hardcode master in the program, but rather launch the application with spark-submit and receive it there.
* 尽可能使用local来做本地的测试。