Spark简介
Apache Spark是一个用于实时处理的开源集群计算框架。 它是Apache软件基金会中最成功的项目。 Spark已成为大数据处理市场的领导者。 今天,Spark被亚马逊,eBay和雅虎等主要厂商采用。 许多组织在具有数千个节点的集群上运行Spark。
MR的这种方式对数据领域两种常见的操作不是很高效。第一种是迭代式的算法。比如机器学习中ALS、凸优化梯度下降等。这些都需要基于数据集或者数据集的衍生数据反复查询反复操作。MR这种模式不太合适,即使多MR串行处理,性能和时间也是一个问题。数据的共享依赖于磁盘。另外一种是交互式数据挖掘,MR显然不擅长。
spark安装
配置:
slaves
hadoop-senior01.zhangbk.com
hadoop-senior02.zhangbk.com
spark-env.sh
SPARK_MASTER_HOST=hadoop-senior03.zhangbk.com
SPARK_MASTER_PORT=7077
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=7077 -Dspark.history.retainedApplications=3 -Dspark.history.fs.logDirectory=hdfs://ns1/spark-history"
vi spark-defaults.conf
spark.master spark://hadoop-senior01.zhangbk.com:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs://ns1/spark-history
spark.eventLog.compress true
创建目录 hdfs dfs -mkdir /spark-history
由于是Hadoop集群HA,所以需将hdfs-site.xml 拷贝到./conf目录下 传输到其他节点
启动spark
sbin/start-all.sh
sbin/start-history-server.sh
登陆web页面
http://192.168.159.21:8080
执行第一个spark程序
bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://hadoop-senior01.zhangbk.com:7077 \
--executor-memory 1G \
--total-executor-cores 2 \
examples/jars/spark-examples_2.11-2.3.0.jar \
100
参数说明:
--class CLASS_NAME Your application's main class (for Java / Scala apps)
--master spark://master01:7077 指定Master的地址 master 的地址,提交任务到哪里执行,例如 local,spark://host:port, yarn, local
--executor-memory 1G 指定每个executor可用内存为1G
--total-executor-cores 2 指定每个executor使用的cup核数为2个
该算法是利用蒙特·卡罗算法求PI
Spark应用提交
一旦打包好,就可以使用bin/spark-submit脚本启动应用了. 这个脚本负责设置spark使用的classpath和依赖,支持不同类型的集群管理器和发布模式:
bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
一些常用选项:
1) --class: 你的应用的启动类 (如 org.apache.spark.examples.SparkPi)
2) --master: 集群的master URL (如 spark://23.195.26.187:7077)
3) --deploy-mode: 是否发布你的驱动到worker节点(cluster) 或者作为一个本地客户端 (client) (default: client)*
4) --conf: 任意的Spark配置属性, 格式key=value. 如果值包含空格,可以加引号“key=value”. 缺省的Spark配置
5) application-jar: 打包好的应用jar,包含依赖. 这个URL在集群中全局可见。 比如hdfs:// 共享存储系统, 如果是 file:// path, 那么所有的节点的path都包含同样的jar.
6) application-arguments: 传给main()方法的参数
启动Spark shell
bin/spark-shell \
--master spark://hadoop-senior01.zhangbk.com:7077 \
--executor-memory 2g \
--total-executor-cores 2
注意:
如果启动spark shell时没有指定master地址,但是也可以正常启动spark shell和执行spark shell中的程序,
其实是启动了spark的cluster模式,如果spark是单节点,并且没有指定slave文件,这个时候如果打开spark-shell
默认是local模式
Local模式是master和worker在同同一进程内
Cluster模式是master和worker在不同进程内
Spark Shell中已经默认将SparkContext类初始化为对象sc。用户代码如果需要用到,则直接应用sc即可
在Spark shell中编写WordCount程序
在Spark shell中用scala语言编写spark程序
sc.textFile("hdfs://ns1/spark/input/RELEASE").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).saveAsTextFile("hdfs://ns1/spark/output/out1")
说明:
sc是SparkContext对象,该对象是提交spark程序的入口
textFile("hdfs://ns1/spark/input/RELEASE")是从hdfs中读取数据
flatMap(_.split(" "))先map在压平
map((_,1))将单词和1构成元组
reduceByKey(_+_)按照key进行reduce,并将value累加
saveAsTextFile("hdfs://ns1/spark/output/out1")将结果写入到hdfs中
IDEA中编写WordCount程序
package com.zhangbk.spark
import org.apache.spark.{SparkConf, SparkContext}
import org.slf4j.LoggerFactory
object WordCount {
val logger = LoggerFactory.getLogger(WordCount.getClass)
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("WC")
val sc = new SparkContext(conf)
sc.textFile(args(0)).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_,
1).sortBy(_._2, false).saveAsTextFile(args(1))
logger.info("=========================completed================================")
sc.stop()
}
}
配置pom.xml
<!-- wordcount pom.xml-->
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>spark</artifactId>
<groupId>com.zhangbk</groupId>
<version>1.0-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>wordcount</artifactId>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
<!--<scope>provided</scope>-->
</dependency>
<!-- Logging -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>jcl-over-slf4j</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>${log4j.version}</version>
</dependency>
<!-- Logging End -->
</dependencies>
<build>
<finalName>wordcount</finalName>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.2-beta-5</version>
<configuration>
<archive>
<manifest>
<mainClass>com.zhangbk.spark.WordCount</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
</plugins>
</build>
</project>
<!-- spark pom.xml-->
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.zhangbk</groupId>
<artifactId>spark</artifactId>
<packaging>pom</packaging>
<version>1.0-SNAPSHOT</version>
<modules>
<module>wordcount</module>
</modules>
<properties>
<mysql.version>6.0.5</mysql.version>
<spring.version>4.3.6.RELEASE</spring.version>
<spring.data.jpa.version>1.11.0.RELEASE</spring.data.jpa.version>
<log4j.version>1.2.17</log4j.version>
<quartz.version>2.2.3</quartz.version>
<slf4j.version>1.7.22</slf4j.version>
<hibernate.version>5.2.6.Final</hibernate.version>
<camel.version>2.18.2</camel.version>
<config.version>1.10</config.version>
<jackson.version>2.8.6</jackson.version>
<servlet.version>3.0.1</servlet.version>
<net.sf.json.version>2.4</net.sf.json.version>
<activemq.version>5.14.3</activemq.version>
<spark.version>2.1.1</spark.version>
<scala.version>2.11.8</scala.version>
<hadoop.version>2.5.0</hadoop.version>
</properties>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.2-beta-5</version>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</pluginManagement>
</build>
</project>
执行WordCount程序
bin/spark-submit \
--class com.zhangbk.spark.WordCount \
--master spark://hadoop-senior01.zhangbk.com:7077 \
--executor-memory 1G \
--total-executor-cores 2 \
spark-jars/wordcount-jar-with-dependencies.jar \
hdfs://ns1/spark/input/RELEASE \
hdfs://ns1/spark/output/out5