文章目录
- Apache Spark
-
- 一、概述
- 二、环境搭建
- 三、开发第一个Spark应用
- 四、Spark架构
- 五、Spark RDD
- 六、Spark源码追踪
- 七、RDD编程API
-
- Transformation
-
- **map**(*func*) √
- **filter**(*func*) √
- **flatMap**(*func*) √
- **mapPartitions**(*func*) √
- **mapPartitionsWithIndex**(*func*) √
- **sample**(*withReplacement*, *fraction*, *seed*)
- **intersection**(*otherDataset*)
- **distinct**([*numPartitions*])) √
- **coalesce**(*numPartitions*) √
- **repartition**(*numPartitions*) √
- **groupByKey**([*numPartitions*])
- **reduceByKey**(*func*, [*numPartitions*])
- **aggregateByKey**(*zeroValue*)(*seqOp*, *combOp*, [*numPartitions*])
- **join**(*otherDataset*, [*numPartitions*])
- **cogroup**(*otherDataset*, [*numPartitions*]) - 了解
- Action
- 八、共享变量(Shard Variable)
Apache Spark
一、概述
Lightning-fast unified analytics engine : 快如闪电的统一分析引擎
快如闪电:
- Spark基于内存式计算,分布式并行计算框架。不同于MapReduce框架,基于磁盘式计算,将Job粗粒度的分为MapTask、ReduceTask,并且必须通过网络进行数据交互。
- Spark任务执行时,实际上会将一个复杂的科学计算划分一个个的Stage(阶段),每一个Stage都支持分布式的并行计算
- Spark计算时,每一个Stage计算结果都可以进行缓存,可以非常容易的进行故障恢复和结果重用
统一: 集结了大数据处理的主流方案
- 批处理(RDD:代替MapReduce)
- 流处理(Streaming:代替Storm、Kafka Streaming)
- 机器学习(Machine Learing: 代替Mahout)
- 交互式查询(SQL:代替Hive)
- 图形计算(GraphX)
分析引擎:代替MapReduce
特点
- 速度: 相对于MapReduce的计算,效率极高。Spark将复杂的Job分解为若个Stage,每一个Stage都可以进行分布式并行计算,称为DAG(Directed Acyclic Graph)有向无环图,类似于Kafka Streaming Topology。Spark底层的物理引擎,内存管理,故障监控恢复做了大量优化,优异传统的分布式并行计算框架。
- 支持多种编程语言: spark基于scala语言开发的一款框架,通常使用scala语言编写应用。spark同时支持python、R、SQL等。spark内置大量的操作算子,极大简化复杂应用的开发。
- 通用性:Spark项目包含多个子项目,分别可以解决批处理、流处理、SQL、机器学习、图形计算等大数据处理问题
- 支持多种运行环境: Spark应用可以运行在Hadoop Yarn、Mesos、K8S、Cloud或者Standalone环境
二、环境搭建
Standalone(独立模式)
注意:伪分布式集群环境搭建
准备工作
-
准备物理虚拟机一台
版本:CentOS7
-
配置网络
# 如果需要配置双网卡,需要添加网卡硬件支持 vi /etc/sysconfig/network-scripts/ifcfg-ens33 1. 将ip的分配方式改为static 2. 添加 IPADDR=静态IP地址 3. 添加 NETMASK=255.255.255.0 4. 网卡开机自启动 ONBOOT=yes
-
关闭防火墙
[root@bogon ~]# systemctl stop firewalld [root@bogon ~]# systemctl disable firewalld Removed symlink /etc/systemd/system/multi-user.target.wants/firewalld.service. Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
-
修改主机名
[root@bogon ~]# vi /etc/hostname Spark
-
配置hosts映射
[root@Spark ~]# vi /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.11.100 Spark [root@Spark ~]# ping Spark
-
安装vim编辑器
[root@Spark ~]# yum install -y vim
安装Hadoop
-
配置SSH免密登陆
[root@Spark ~]# ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/root/.ssh/id_rsa): Created directory '/root/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /root/.ssh/id_rsa. Your public key has been saved in /root/.ssh/id_rsa.pub. The key fingerprint is: SHA256:KSQFcIduz5Fp5gS+C8FGJMq6v6GS+3ouvjcu4yT0qM4 root@Spark The key's randomart image is: +---[RSA 2048]----+ |..o.ooo | |o...oo | |.+ o...o | |. + +oB . | |.o o O..S | |..+ . +. | |o+.o . | |O=.+. | |XE@o. | +----[SHA256]-----+ [root@Spark ~]# ssh-copy-id Spark /usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub" The authenticity of host 'spark (192.168.11.100)' can't be established. ECDSA key fingerprint is SHA256:nLUImdu8ZPpknaVGhMQLRROdZ8ZfJCRC+lhqCv6QuF0. ECDSA key fingerprint is MD5:7e:a3:1f:b0:e3:c7:51:7c:24:70:e3:24:b9:ac:45:27. Are you sure you want to continue connecting (yes/no)? yes /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed /usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys root@spark's password: Number of key(s) added: 1 Now try logging into the machine, with: "ssh 'Spark'" and check to make sure that only the key(s) you wanted were added. [root@Spark ~]# ssh Spark Last login: Tue Oct 29 11:32:59 2019 from 192.168.11.1 [root@Spark ~]# exit 登出 Connection to spark closed.
-
安装JDK
[root@Spark ~]# rpm -ivh jdk-8u171-linux-x64.rpm 准备中... ################################# [100%] 正在升级/安装... 1:jdk1.8-2000:1.8.0_171-fcs ################################# [100%] Unpacking JAR files... tools.jar... plugin.jar... javaws.jar... deploy.jar... rt.jar... jsse.jar... charsets.jar... localedata.jar...
-
安装Hadoop
[root@Spark ~]# tar -zxf hadoop-2.9.2.tar.gz -C /usr [root@Spark ~]# cd /usr/hadoop-2.9.2/ [root@Spark hadoop-2.9.2]# ll 总用量 128 drwxr-xr-x. 2 501 dialout 194 11月 13 2018 bin drwxr-xr-x. 3 501 dialout 20 11月 13 2018 etc drwxr-xr-x. 2 501 dialout 106 11月 13 2018 include drwxr-xr-x. 3 501 dialout 20 11月 13 2018 lib drwxr-xr-x. 2 501 dialout 239 11月 13 2018 libexec -rw-r--r--. 1 501 dialout 106210 11月 13 2018 LICENSE.txt -rw-r--r--. 1 501 dialout 15917 11月 13 2018 NOTICE.txt -rw-r--r--. 1 501 dialout 1366 11月 13 2018 README.txt drwxr-xr-x. 3 501 dialout 4096 11月 13 2018 sbin drwxr-xr-x. 4 501 dialout 31 11月 13 2018 share
-
修改HDFS的配置文件
[root@Spark hadoop-2.9.2]# vim /usr/hadoop-2.9.2/etc/hadoop/core-site.xml <!--nn访问入口--> <property> <name>fs.defaultFS</name> <value>hdfs://Spark:9000</value> </property> <!--hdfs工作基础目录--> <property> <name>hadoop.tmp.dir</name> <value>/usr/hadoop-2.9.2/hadoop-${user.name}</value> </property> [root@Spark hadoop-2.9.2]# vim /usr/hadoop-2.9.2/etc/hadoop/hdfs-site.xml <!--block副本因子--> <property> <name>dfs.replication</name> <value>1</value> </property> <!--配置Sencondary namenode所在物理主机--> <property> <name>dfs.namenode.secondary.http-address</name> <value>Spark:50090</value> </property> <!--设置datanode最大文件操作数--> <property> <name>dfs.datanode.max.xcievers</name> <value>4096</value> </property> <!--设置datanode并行处理能力--> <property> <name>dfs.datanode.handler.count</name> <value>6</value> </property> [root@Spark hadoop-2.9.2]# vim /usr/hadoop-2.9.2/etc/hadoop/slaves Spark
-
配置JDK和Hadoop的环境变量
[root@Spark hadoop-2.9.2]# vim /root/.bashrc JAVA_HOME=/usr/java/latest PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin CLASSPATH=. export HADOOP_HOME export JAVA_HOME export PATH export CLASSPATH [root@Spark hadoop-2.9.2]# source /root/.bashrc
-
启动HDFS分布式文件系统,Yarn暂时不用配置
[root@Spark hadoop-2.9.2]# hdfs namenode -format [root@Spark hadoop-2.9.2]# start-dfs.sh Starting namenodes on [Spark] Spark: starting namenode, logging to /usr/hadoop-2.9.2/logs/hadoop-root-namenode-Spark.out Spark: starting datanode, logging to /usr/hadoop-2.9.2/logs/hadoop-root-datanode-Spark.out Starting secondary namenodes [Spark] Spark: starting secondarynamenode, logging to /usr/hadoop-2.9.2/logs/hadoop-root-secondarynamenode-Spark.out [root@Spark hadoop-2.9.2]# jps 10400 Jps 9874 NameNode 10197 SecondaryNameNode 10029 DataNode
-
访问HDFS WebUI:http://192.168.11.100:50070
安装Spark
-
安装
[root@Spark ~]# tar -zxf spark-2.4.4-bin-without-hadoop.tgz -C /usr [root@Spark ~]# mv /usr/spark-2.4.4-bin-without-hadoop/ /usr/spark-2.4.4 [root@Spark ~]# cd /usr/spark-2.4.4/ [root@Spark spark-2.4.4]# ll 总用量 100 drwxr-xr-x. 2 1000 1000 4096 8月 28 05:52 bin # spark操作指令目录 drwxr-xr-x. 2 1000 1000 230 8月 28 05:52 conf # 配置文件 drwxr-xr-x. 5 1000 1000 50 8月 28 05:52 data # 测试样例数据 drwxr-xr-x. 4 1000 1000 29 8月 28 05:52 examples # 示例Demo drwxr-xr-x. 2 1000 1000 8192 8月 28 05:52 jars # 运行依赖类库 drwxr-xr-x. 4 1000 1000 38 8月 28 05:52 kubernetes # spark 容器支持 -rw-r--r--. 1 1000 1000 21316 8月 28 05:52 LICENSE drwxr-xr-x. 2 1000 1000 4096 8月 28 05:52 licenses -rw-r--r--. 1 1000 1000 42919 8月 28 05:52 NOTICE drwxr-xr-x. 7 1000 1000 275 8月 28 05:52 python drwxr-xr-x. 3 1000 1000 17 8月 28 05:52 R -rw-r--r--. 1 1000 1000 3952 8月 28 05:52 README.md -rw-r--r--. 1 1000 1000 142 8月 28 05:52 RELEASE drwxr-xr-x. 2 1000 1000 4096 8月 28 05:52 sbin # spark系统服务相关指令 drwxr-xr-x. 2 1000 1000 42 8月 28 05:52 yarn # spark yarn集群支持
-
Spark简单配置
[root@Spark spark-2.4.4]# cp conf/spark-env.sh.template conf/spark-env.sh [root@Spark spark-2.4.4]# cp conf/slaves.template conf/slaves [root@Spark spark-2.4.4]# vim conf/spark-env.sh SPARK_WORKER_INSTANCES=2 SPARK_MASTER_HOST=Spark SPARK_MASTER_PORT=7077 SPARK_WORKER_CORES=4 SPARK_WORKER_MEMORY=2g LD_LIBRARY_PATH=/usr/hadoop-2.9.2/lib/native SPARK_DIST_CLASSPATH=$(hadoop classpath) export SPARK_MASTER_HOST export SPARK_MASTER_PORT export SPARK_WORKER_CORES export SPARK_WORKER_MEMORY export LD_LIBRARY_PATH export SPARK_DIST_CLASSPATH export SPARK_WORKER_INSTANCES [root@Spark spark-2.4.4]# vim conf/slaves Spark
-
启动Spark Standalone集群
[root@Spark spark-2.4.4]# sbin/start-all.sh starting org.apache.spark.deploy.master.Master, logging to /usr/spark-2.4.4/logs/spark-root-org.apache.spark.deploy.master.Master-1-Spark.out Spark: starting org.apache.spark.deploy.worker.Worker, logging to /usr/spark-2.4.4/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-Spark.out Spark: starting org.apache.spark.deploy.worker.Worker, logging to /usr/spark-2.4.4/logs/spark-root-org.apache.spark.deploy.worker.Worker-2-Spark.out [root@Spark spark-2.4.4]# [root@Spark spark-2.4.4]# [root@Spark spark-2.4.4]# jps 16433 Jps 9874 NameNode 16258 Worker 10197 SecondaryNameNode 16136 Master 16328 Worker 10029 DataNode
-
访问Spark WebUI:http://192.168.11.100:8080/
-
Spark shell 命令窗口
–master 远程连接的spark应用的运行环境 spark://host:port, mesos://host:port, yarn, k8s://https://host:port, or local (Default: local[*]).
–total-executor-cores NUM Executor 计算JVM进程 cores: 线程
[root@Spark spark-2.4.4]# bin/spark-shell --master spark://Spark:7077 --total-executor-cores 2 Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://Spark:4040 Spark context available as 'sc' (master = spark://Spark:7077, app id = app-20191029144958-0000). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.4 /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171