Spark介绍
apache Spark是一个大数据处理框架。它是建立在Hadoop MapReduce之上,它扩展了 MapReduce 模式,有效地实现了MapReduce不擅长的计算工作,包括迭代式、交互式、流式。
Spark的应用场景:
- 复杂的批量数据处理(batch data processing):
- 数据量超过了单机的尺度,或者需要进行大量的计算。
- 基于历史数据的交互查询(interactiva query)
- 基于实时数据流的数据处理(streaming data processing)
RDD[Resilient Distributed Dataset]
- 定义
- Distributed:数据是存储在多个节点上,计算也是在多个节点上进行。
- Resilient: 体现在计算上。
- Dataset: 就是一个数据block.
- immutable:不可变的。
- parallel: 对RDD进行操作等价与对多个partition同时进行操作。
/* Represents an immutable,
* partitioned collection of elements that
* can be operated on in parallel. */
abstract class RDD[T: ClassTag]( //抽象类,不能直接实例化,RDD必然由子类实现。
@transient private var _sc: SparkContext, //@transient,表示该成员不需要被序列化
@transient private var deps: Seq[Dependency[_]]
) extends Serializable with Logging {} //Serializable序列化
//RDD子类的一个例子
class JdbcRDD[T: ClassTag](
sc: SparkContext,
getConnection: () => Connection,
sql: String,
lowerBound: Long,
upperBound: Long,
numPartitions: Int,
mapRow: (ResultSet) => T = JdbcRDD.resultSetToObjectArray _)
extends RDD[T](sc, Nil) with Logging {}
RDD五大特点:
1. A list of partitions
2. A function for computing each split/partitions
一个操作是对每个分区做相同的操作
3. A list of dependencies on other RDDs
RDDA => RDDB => RDDC => RDDD
4. Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
5. Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
def compute(split: Partition, context: TaskContext): Iterator[T]
\\对应特点2
protected def getPartitions: Array[Partition]
\\对应特点1
protected def getDependencies: Seq[Dependency[_]] = deps
\\对应特点3
protected def getPreferredLocations(split: Partition): Seq[String] = Nil
\\对应特点5
@transient val partitioner: Option[Partitioner] = None
\\对应特点4
- 在Spark中,计算时有多少partition就对应有多少个task来执行。
SparkContext
SparkContext tells Spark how to access a cluster.(To create a SparkContext you first need to build a SparkConf object)
- local
- standalone
- yarn
- mesos
SparkConf object containsinformation about your application.
- create a SparkConf(key-value pairs)
- Application Name/core/ memory
- chi
val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)
In practice, when running on a cluster, you will not want to hardcode master in the program, but rather launch the application with spark-submit and receive it there.
* 尽可能使用local来做本地的测试。
spark-shell
- 借助于–help
- 重要参数
- –master
- –name
- – jars
- –conf
- –driver **
- –executor-memory
- –executor-cores
- –driver–cores
- –queue
- –num-executors