Spark笔记一(环境和WordCount入门案例)

最新推荐文章于 2024-05-11 18:10:28 发布

不忘初心$$

最新推荐文章于 2024-05-11 18:10:28 发布

阅读量287

点赞数

分类专栏： spark&&scala

本文链接：https://blog.csdn.net/qq_42786792/article/details/102879993

版权

spark&&scala 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

一.Spark框架概述

1.官网:
		http://spark.apache.org/
2.源码托管
		https://github.com/apache/spark	
3.母公司网站:
		https://databricks.com/
4.官方博客
		https://databricks.com/blog/
		https://databricks.com/blog/category/engineering/spark

Spark框架来源于加州大学伯克利分校AMPLab(人工智能:A:算法, M: 机器, P:人类)实验室,创建公司Databricks

2009年诞生于美国加州大学伯克利分校AMP 实验室，
2010年通过BSD许可协议开源发布，
2013年捐赠给Apache软件基金会并切换开源协议到切换许可协议至 Apache2.0，
2014年2月，Spark 成为 Apache 的顶级项目
2014年11月, Spark的母公司Databricks团队使用Spark刷新数据排序世界记录

1. Spark官方定义

Spark是什么?
Apache Spark是用于大规模数据处理的统一分析引擎,类似于Mapreduce框架

Spark基于内存计算,提高了在大数据环境下数据处理的实时性,同时保证了高容错性和高可伸缩性,允许Spark部署在大量的硬件上,形成集群
官网
http://spark.apache.org

在这里插入图片描述

2.大数据分析类型

针对海量数据的分析,主要有下面三种类型(按照业务的划分)

1.第一类:离线分析
	处理分析的数据是静态不变的,类似于Mapreduce和Hive框架等等
2.第二类:交互式分析
	及时查询,类似于Impala
3.第三类:实时分析
	针对流式实时处理,展示结果报表

3. 大数据技术框架四代

第一代
Hadoop框架生态系统:N+1 数据,单位:小时,天…
第二代
Storm实时流式计算框架,目前逐步被企业所抛弃
典型: 阿里巴巴,双十一,大屏数据统计,2017年以前使用JStorm,后期使用Blink
第三代
以Spark框架生态模块,擅长离线分析,既可以做离线批处理,又可以做实时流式数据分析
从2014年5月份被称为Apace顶级项目(1.0版本),各大互联网公司都在使用,尤其是Spark 2.x一行的版本性能更加好
第四代
Flink 框架,擅长实时分析(实时数仓)
2018年,阿里巴巴花费7亿美元,收购了Flink的母公司

4. Spark框架介绍

Spark框架出现后,于Mapreduce框架做比较,官方说明

在这里插入图片描述

为什么Spark框架如此的快?

  1.数据结构(编程模型):Spark框架核心
  	
  		RDD:弹性分布式数据集,认为是列表List
  	
  		Spark框架将要处理的数据集封装到集合RDD中,调用RDD中的函数处理数据
  	
  		Scala实现词频统计
  					List -> flatMap   map   groupby   map(reduce)		
  	
  2.Task任务运行方式:以线程的方式运行
  		Mapreduce中Task是以进程Process方式运行,
  		
  		但是Spark中的Task是以线程Thread方式运行,而线程运行在进程中,
  		
  		启动和销毁都是很快的(相对于进程来说)

在这里插入图片描述

5 Spark框架特性

快
和Hadoop的MapReduce相比,Spark基于内存的运算要快100倍以上,基于硬盘的运算也要快10倍以上,spark实现了高效的DGA执行引擎,可以通过基于内存来高效处理数据流
易用
Spark支持Java,Python,R和Scala的API,还支持超过80种的高级算法,使用户可以快速构建不同的应用,而且spark支持交互式的Python和Scala的shell,可以非常方便的在这些shell中使用spark集群来验证解决问题的方案

在这里插入图片描述

通用
Spark提供了同一的解决方案,Spark可以用于批处理,交互式查询(Spark SQL),实时流处理(Spark Streaming),机器学习(Spark MLlib)和图像(Graphx),这些不同的类型处理都可以在一个应用中无缝使用,Spark同一的解决方案非常具有吸引力,毕竟任何公司都想要同一的平台去处理遇到的问题,减少开发和维护的人力成本和部署平台的物力成本
兼容性
Spark可以非常方便的和其他的开源产品进行融合,比如,Spark可以和Hadoop的Yarn和Apache Mesos作为他的资源管理器和调度器,并且可以处理所有的Hadoop支持的数据,包括HDFS,Hbase等

二.Spark框架模块

Spark框架数据生态系统,包含很多的子模块(子系统,子框架),处理不同类型的数据(针对不同的业务)

Spark Core:核心模块
- Spark框架核心,主要内容RDD
- 针对海量数据进行离线分析处理,类似Mapreduce框架
Spark SQL:使用最多的模块
- 类似Hive框架,提供SQL功能,分析数据,远远不止SQL,还提供了DSL(类似Python中的Pandas库)
Spark Streaming:针对流式数据处理的模块
- 目前来说,性能很稳定,在实时性不高的时候,可以选择此模块
Structured Streaming:Spark2.x 出现的新型的流式数据处理框架
- 结构化数据处理框架,取代SparkStreaming 模块
Spark MLlib:机器学习库
- Spark框架中常用的机器学习算法的实现
Spark GraphX :图形处理
- Spark框架中提供对图数据结构存储和算法实现的库
PySpark框架
- Spark框架提供对Python语言开发的模块.称之为PySpark
SparkR模块(针对R语言)

三.Spark框架运行模式

Local本地模式(单机)—开发测试用
分为local单线程和local-cluster多线程
standalone独立集群模式,----开发测试用
典型的Master/slave模式
standalone-HA高可用模式 ----生产环境使用
基于standalone模式,使用zk搭建高可用,避免Master是有单节点故障
on yarn集群模式----生产环境使用
运行在YARN集群之上,有Yarn负责资源管理,Spark负责调度和计算
on mesos集群模式----国内很少使用
运行在mesos资源管理器框架之上,由mesos负责资源管理,Spark负责任务调度和计算
on cloud集群模式----中小公司未来会更多的使用云服务

在这里插入图片描述

四.Spark快速入门

1.Spark本地模式

1.解压Spark安装包

2.进入conf目录
mv spark-env.sh.template spark-env.sh

3.配置如下内容
vi spark-env.sh
内容：
    JAVA_HOME=/export/servers/jdk
    SCALA_HOME=/export/servers/scala
	HADOOP_CONF_DIR=/export/servers/hadoop/etc/hadoop # 安装目录，默认从HDFS文件系统读取数据
4.启动spark-shell命令行，指定运行本地模式
bin/spark-shell --master local[2]

注意:启动spark需要启动hadoop集群

在这里插入图片描述

2.词频统计WordCount

读取HDFS上文本文件，统计文件中单词出现次数。
准备数据:wordcount.data

hadoop spark hadoop spark spark 
mapreduce spark spark hive
hive spark hadoop mapreduce spark
spark hive sql sql spark hive hive spark
hdfs hdfs mapreduce mapreduce spark hive

#上传数据到hdfs
hdfs dfs -put wordcount.input /datas

#读取HDFS文本数据，封装到RDD集合中，文本中每条数据就是集合中每条数据
val inputRDD=sc.textFile("/datas/wordcount.data")

#将集合中每条数据按照分隔符分割
val wordsRDD = inputRDD.flatMap(line=>line.split("\\s+"))

#转换为二元组，表示每个单词出现一次
val tuplesRDD = wordsRDD.map(word=>(word,1))
#wordsRDD.map((_, 1))

#按照Key分组，对Value进行聚合操作
#scala中二元组就是Java中Key/Value对
#reduceByKey：先分组，再聚合
#val wordcountsRDD = tuplesRDD.reduceByKey((a, b) => a + b)
val wordcountSRDD = tuplesRDD.reduceByKey((tmp,item)=>(tmp+item))
# 查看结果
wordcountSRDD.foreach(println)

(hive,6)
(mapreduce,4)
(sql,2)
(spark,11)
(hadoop,3)
(hdfs,2)

3.列表中reduce聚合函数

通过API提示查看reduce聚合函数

// 高级函数：函数A的参数类型是一个函数，那么函数A就是高阶函数
def reduce[A1 >: Int](op: (A1, A1) => A1): A1

// 函数要求
op: (A1: 参数一, A1： 参数二) => A1
/*
	表示需要两个参数及返回值，并且类型全部一样
	参数一：聚合中间临时变量
	参数二：集合中每个元素
*/
scala> val list = (1 to 10).toList
list: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
/*
	val tmp = 0 
	tmp = tmp + item
	return tmp 
	
	tmp: 聚合时中间临时变量
	
	需求：平均值
		总和 / 总数
*/

list.reduce((tmp, item) => {
    println(s"tmp = $tmp, item = $item, sum = ${tmp + item}")
    tmp + item
})

list.reduceLeft((tmp, item) => {
    println(s"tmp = $tmp, item = $item, sum = ${tmp + item}")
    tmp + item
})

list.reduceRight((item, tmp) => {
    println(s"tmp = $tmp, item = $item, sum = ${tmp + item}")
    tmp + item
})

4.本地模式运行圆周率

	运行官方Example中圆周率PI计算，使用蒙特卡洛算法计算圆周率。

SPARK_HOME=/export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0/
${SPARK_HOME}/bin/spark-submit \
--master local[2] \
--class org.apache.spark.examples.SparkPi \
${SPARK_HOME}/examples/jars/spark-examples_2.11-2.2.0.jar \
10

# 参数10含义：表示运行10次，每次100000点

五.Spark Standalone集群

	Spark 应用程序可以运行在集群上，目前支持以下集群：

1、SparkStandalone 集群
	http://spark.apache.org/docs/2.2.0/spark-standalone.html
	掌握：理解Spark框架原理，如何调度程序执行
2、Hadoop YARN 集群
	http://spark.apache.org/docs/2.2.0/running-on-yarn.html
	掌握：企业中就是将程序运行YARN上，很难
3、Apache Mesos
	http://spark.apache.org/docs/2.2.0/running-on-mesos.html

	Spark Standalone集群，类似Hadoop YARN，管理集群资源和调度资源，分布式架构：主从架构：

1、主节点：
	Master，管理整个集群资源，接收提交应用，分配资源给每个应用，运行Task任务
2、从节点：
	Workers，管理每个机器的资源，分配对应的资源来运行Task

整体架构图如下；

在这里插入图片描述

运行圆周率：

# 运行Spark 自带圆周率程序，SparkStandalone集群
SPARK_HOME=/export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0/
${SPARK_HOME}/bin/spark-submit \
--master spark://hadoop01:7077 \
--class org.apache.spark.examples.SparkPi \
${SPARK_HOME}/examples/jars/spark-examples_2.11-2.2.0.jar \
10

六.Spark运行组成

Spark Applicaiton运行在集群上时组成如下图所示：
在这里插入图片描述

1.Mapreduce应用组成

Mapreduce程序运行在YARN上,组成如下:

第一点:一个Mapreduce应用的运行就是一个Job任务

第二点:Mapreduce运行
		1).AppMaster
				应用的管理者,负责这个应用中所有的Task执行
		2).MapTask或者ReduceTask
				每个Task以进程方式运行,启动一个JVM进程

2.Spark应用组成

一个Spark  Application可以包含很多的Job

在这里插入图片描述

Spark Application运行在集群上时,也是有两部分组成:
	第一:Driver Program	
			相当AppMaster,整个应用管理者,负责应用中所有Job任务的调度执行
			重点:JVM Process,运行程序的MAIN函数,必须创建SparkContext上下文对象
	第二:Executor
			相当于一个线程池,运行JVM Process,其中很多线程,每个线程运行一个Task任务,
			一个Task任务运行需要1 Core Cpu ,也可以认为Excutor中线程数就等于CPU Core核数

在这里插入图片描述

七.Spark Standalone HA

Spark Standalone集群主节点Master接收用户提交应用，如果Master节点服务宕机，无法提交应用。

在这里插入图片描述

所以需要配置高可用HA（High Available），类似HBase中Master可以启动多个（不仅仅是两个），利用Zookeeper分布式协作服务框架帮助选举主节点Active和监控各个Master状态

在这里插入图片描述

八.Spark 入门案例WordCount

参考代码:

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object SparkWorcCount {
  def main(args: Array[String]): Unit = {
    // 创建SparkConf对象，设置应用的配置信息，比如应用名称和应用运行模式
    val sparkConf = new SparkConf()
      .setMaster("local[2]")
      .setAppName(this.getClass.getSimpleName.stripSuffix("$"))
    // 构建SparkContext上下文实例对象，读取数据和调度Job执行
    val sc: SparkContext = new SparkContext(sparkConf)
    // 设置日志级别，可设置的选项：Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN
    sc.setLogLevel("WARN")
    // 第一步、读取数据
    val inputRDD: RDD[String] = sc.textFile("hdfs://hadoop01:8020/datas/wordcount.data")
    // 第二步、处理数据
    val wordcountRDD = inputRDD
      //过滤错误数据与空数据
      .filter(line => line != null && line.trim.length > 0)
      // 每行数据分割为单词
      .flatMap(line => line.split("\\s+"))
      //转换为二元组，表示每个单词出现一次
      .mapPartitions(iter=>iter.map(words=>(words,1)))
      //按照Key分组聚合
      .reduceByKey((tmp, item) => tmp + item)
    // 第三步、输出数据
    wordcountRDD.foreach(println)
    //应用程序运行接收，关闭资源
    sc.stop()
  }
}

九.Spark应用提交

1.spark-submit

使用spark-submit提交应用，官方文档

http://spark.apache.org/docs/2.2.0/submitting-applications.html

查看spark-submit使用说明：

# bin/spark-submit --help
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor. File paths of these files
                              in executors can be accessed via SparkFiles.get(fileName).

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.

  --help, -h                  Show this help message and exit.
  --verbose, -v               Print additional debug output.
  --version,                  Print the version of current Spark.

 Spark standalone with cluster deploy mode only:
  --driver-cores NUM          Cores for driver (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)

 YARN-only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.

关注核心：提交应用：

Usage: spark-submit [options] <app jar | python file> [app arguments]

1)、options
	可选参数，应用运行配置信息，比如运行在哪里，本地模式还是集群模式
	重要的一点
2）、<app jar | python file> 
	如果使用Java或者SCALa语言，将程序编译jar包；如果是Python语言，脚本文件
	
3）、[app arguments]
	应用程序参数，可有可无

2. 提交执行词频统计

	提交运行词频WordCount在本地模式：

SPARK_HOME=/export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0/
${SPARK_HOME}/bin/spark-submit \
--master local[2] \
--class cn.bigdata.spark.submit.SparkSubmit \
${SPARK_HOME}/core_2.11-1.0-SNAPSHOT.jar \
/datas/wordcount.input /datas/swcs

提交运行词频WordCount到Standalone集群：

SPARK_HOME=/export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0/
${SPARK_HOME}/bin/spark-submit \
--master spark://hadoop01:7077,hadoop02:7077 \
--driver-memory 512m \
--executor-memory 512m \
--num-executors 1 \
--total-executor-cores 2 \
--class cn.bigdata.spark.submit.SparkSubmit \
${SPARK_HOME}/core_2.11-1.0-SNAPSHOT.jar \
/datas/wordcount.input /datas/swcs

十.Spark or YARN

将Spark Application提交运行到YARN集群上，至关重要，企业中都是运行在YANR上

文档：http://spark.apache.org/docs/2.2.0/running-on-yarn.html

提交Spark Application运行到YARN上，找的就是ResourceManager，配置信息在yarn-site.xml中。

在spark-env.sh配置告知YARN配置文件所在地方即可。
HADOOP_CONF_DIR
YARN_CONF_DIR

运行圆周率到YARN集群，命令如下：

SPARK_HOME=/export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0/
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--class org.apache.spark.examples.SparkPi \
${SPARK_HOME}/examples/jars/spark-examples_2.11-2.2.0.jar \
10

附件.Maven依赖

创建Maven工程、创建模块module，配置pom.xml文件内容如下：

<!-- 指定仓库位置，依次为aliyun、cloudera和jboss仓库 -->
<repositories>
    <repository>
        <id>aliyun</id>
        <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
    </repository>
    <repository>
        <id>cloudera</id>
        <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
    <repository>
        <id>jboss</id>
        <url>http://repository.jboss.com/nexus/content/groups/public</url>
    </repository>
</repositories>

<properties>
    <scala.version>2.11.8</scala.version>
    <scala.binary.version>2.11</scala.binary.version>
    <spark.version>2.2.0</spark.version>
    <hadoop.version>2.6.0-cdh5.14.0</hadoop.version>
</properties>

<dependencies>
    <!-- 依赖Scala语言 -->
    <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
        <version>${scala.version}</version>
    </dependency>
    <!-- Spark Core 依赖 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
    </dependency>
    <!-- Hadoop Client 依赖 -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>${hadoop.version}</version>
    </dependency>
</dependencies>

<build>
    <outputDirectory>target/classes</outputDirectory>
    <testOutputDirectory>target/test-classes</testOutputDirectory>
    <resources>
        <resource>
            <directory>${project.basedir}/src/main/resources</directory>
        </resource>
    </resources>
    <!-- Maven 编译的插件 -->
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.0</version>
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
                <encoding>UTF-8</encoding>
            </configuration>
        </plugin>
        <plugin>
            <groupId>net.alchim31.maven</groupId>
            <artifactId>scala-maven-plugin</artifactId>
            <version>3.2.0</version>
            <executions>
                <execution>
                    <goals>
                        <goal>compile</goal>
                        <goal>testCompile</goal>
                    </goals>
                    <configuration>
                        <args>
                            <arg>-dependencyfile</arg>
                            <arg>${project.build.directory}/.scala_dependencies</arg>
                        </args>
                    </configuration>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

不忘初心$$

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark笔记一(环境和WordCount入门案例)

sparkCore:核心模块spark框架核心,主要内容是RDD针对海量数据进行离线分析,类似于Mapreduce框架sparkSQL:使用最多的模块类似hive框架,提供sql功能,分析数据,远远不止sql,还提供DSL(类似python中的pandas库)sparkStream:针对流式数据处理的模块性能很稳,在实时不高德时候,选择此模块structured Streami...
复制链接

扫一扫

专栏目录