mac上搭建Spark
mkdir App //将压缩包都丢进这个文件里
cd App
解压scala
tar -xvf scala-2.12.5.tgz
检查一下
cd scala-2.12.5
cd bin
./scala //可以运行
接下来配置环境变量
vi ~/.bash_profile
然后输入
## scala
export SCALA_HOME=/Users/hyq/Desktop/app/scala-2.12.5
export PATH=$PATH:$SCALA_HOME/bin
在
source ~/.bash_profile
scala -version //检查一下
5.安装Spark
解压Spark
tar -xvf scala-2.12.5.tgz
接下来配置环境变量
vi ~/.bash_profile
然后输入
##Spark
export SPARK_HOME=/Users/hyq/Desktop/app/spark-2.3.0-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin
source ~/.bash_profile
spark-shell //检查一下
出现一个大大的Spark,就说明成功了。
RDD
让开发者大大降低分布式应用程序的门槛以及提高执行效率。
RDD源码
Resilient Distributed Dataset (RDD)
* Resilient弹性:弹性体现在计算层面。
* Distributed:数据跨节点存储在不同节点之上,代码可以运行一个或多个节点之上。(前提是没有依赖关系)。
* Dataset:数据集,相当于HDFS里面的Block.
the basic abstraction in Spark
Represents an immutable,partitioned collection of elements that can be operated on in parallel.
* immutable:不可变的 ,( val)RDD是不可变的,RDDA == > RDDB:RDDB一定是新的。
* partitioned:可分割,Block
* operated on in parallel:可并行操作
Internally, each RDD is characterized by five main properties:
- A list of partitions:由多个分区构成
- A function for computing each split/partitions:
- A list of dependencies on other RDDs:
加载⇒RDDA ⇒ RDDB ⇒ RDDC ⇒ RDDA - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
- Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
RDDA = (1,2,3,4,5,6,7,8,9) operated + 1
hadoop000: patition1: (1,2,3) +1
hadoop001: patition2:(4,5,6) +1
hadoop002: patition3:(7,8,9) +1
RDD的定义:
abstract class RDD[T: ClassTag](
@transient private var _sc: SparkContext,
@transient private var deps: Seq[Dependency[_]]
) extends Serializable with Logging {}
- 抽象类: RDD必然由子类实现,直接使用子类即可
- Serializable:
- Logging
- T:泛型
- SparkContext
- @transient
RDD的子类:
class JdbcRDD[T: ClassTag](
sc: SparkContext,
getConnection: () => Connection,
sql: String,
lowerBound: Long,
upperBound: Long,
numPartitions: Int,
mapRow: (ResultSet) => T = JdbcRDD.resultSetToObjectArray _)
extends RDD[T](sc, Nil) with Logging {}