2.Spark 基础解析之执行Spark程序

1 执行第一个Spark程序

该算法是利用蒙特·卡罗算法求PI

/home/hadoop/software/spark/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://harvey:7077 \
--executor-memory 1G \
--total-executor-cores 2 \
/home/hadoop/software/spark/examples/jars/spark-examples_2.11-2.2.1.jar

参数说明:

--master spark://harvey:7077 指定Master的地址
--executor-memory 1G 指定每个executor可用内存为1G
--total-executor-cores 2 指定每个executor使用的cup核数为2个

运行结果:

Pi is roughly 3.140955704778524

2 Spark 应用提交

一旦打包好,就可以使用bin/spark-submit脚本启动应用了. 这个脚本负责设置spark使用的classpath和依赖,支持不同类型的集群管理器和发布模式

./bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]

一些常用选项
1)–class: 你的应用的启动类 (如 org.apache.spark.examples.SparkPi)
2)–master: 集群的master URL (如spark://harvey:7077)
3)–deploy-mode: 是否发布你的驱动到worker节点(cluster) 或者作为一个本地客户端 (client) (default: client)*
4)–conf: 任意的Spark配置属性, 格式key=value. 如果值包含空格,可以加引号“key=value”. 缺省的Spark配置
5)application-jar: 打包好的应用jar,包含依赖. 这个URL在集群中全局可见。 比如hdfs:// 共享存储系统, 如果是 file:// path, 那么所有的节点的path都包含同样的jar.
6)application-arguments: 传给main()方法的参数

Master URL 可以是以下格式:
在这里插入图片描述
spark-submit 全部参数:

[hadoop@harvey bin]$ ./spark-submit 
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor. File paths of these files
                              in executors can be accessed via SparkFiles.get(fileName).

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.

  --help, -h                  Show this help message and exit.
  --verbose, -v               Print additional debug output.
  --version,                  Print the version of current Spark.

 Spark standalone with cluster deploy mode only:
  --driver-cores NUM          Cores for driver (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)

 YARN-only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.

3.启动Spark Shell

3.1 启动Spark Shell

spark-shell是Spark自带的交互式Shell程序,方便用户进行交互式编程,用户可以在该命令行下用scala编写spark程序
启动spark shell 使用如下命令

/home/hadoop/software/spark/bin/spark-shell \
--master spark://harvey:7077 \
--executor-memory 1G \
--total-executor-cores 2

注意:
        如果启动spark shell时没有指定master地址,但是也可以正常启动spark shell和执行spark shell中的程序,其实是启动了spark的cluster模式,如果spark是单节点,并且没有指定slave文件,这个时候如果打开spark-shell 默认是local模式
        Local模式是master和worker在同同一进程内
        Cluster模式是master和worker在不同进程内

        Spark Shell中已经默认将SparkContext类初始化为对象sc。用户代码如果需要用到,则直接应用sc即可

3.2 Spark Shell 中编写WordCount程序

将 Spark 目录下的 RELEASE 上传到 hdfs://harvey:9000/RELEASE

[hadoop@harvey spark]$ hadoop fs -put RELEASE /

查看文件内容

[hadoop@harvey spark]$ hadoop fs -text /RELEASE
Spark 2.2.1 built for Hadoop 2.7.3
Build flags: -Phadoop-2.7 -Psparkr -Phive -Phive-thriftserver -Pyarn -Pmesos -DzincPort=3036

在Spark shell中用scala语言编写WordCount程序,代码如下

sc.textFile("hdfs://harvey:9000/RELEASE").flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).saveAsTextFile("hdfs://harvey:9000/out")

在这里插入图片描述
使用hdfs命令查看结果

$ hadoop fs -text /out/*

在这里插入图片描述

说明:
sc 是 SparkContext 对象,该对象时提交spark程序的入口
textFile(“hdfs://harvey:9000/RELEASE”) 是 hdfs 中读取数据
flatMap(.split(" ")) 先 map 再压平
map((
,1)) 将单词和1构成元组
reduceByKey(+) 按照key进行reduce,并将value累加
saveAsTextFile(“hdfs://harvey:9000/out”) 将结果写入到 hdfs 中
在这里插入图片描述

4.在IDEA中编写WordCount程序

创建一个项目名为 spark-wordcount

  • pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.spark.harvey</groupId>
    <artifactId>spark-wordcount</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
        <spak.version>2.2.1</spak.version>
        <hadoop.version>3.0.1</hadoop.version>
        <scala.version>2.11.12</scala.version>
        <log4j.version>1.2.12</log4j.version>
        <slf4j.version>1.7.25</slf4j.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spak.version}</version>
            <scope>provided</scope>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
            <scope>provided</scope>
        </dependency>

        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
            <scope>provided</scope>
        </dependency>

        <dependency>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
            <version>${log4j.version}</version>
        </dependency>
        <dependency>
            <artifactId>slf4j-log4j12</artifactId>
            <groupId>org.slf4j</groupId>
            <version>${slf4j.version}</version>
        </dependency>
    </dependencies>

    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
        <!--<testSourceDirectory>src/test/scala</testSourceDirectory>-->
        <resources>
            <!-- 保证resources下的所有的properties配置文件可以被过滤-->
            <resource>
                <directory>src/main/resources</directory>
                <includes>
                    <include>**/*.properties</include>
                </includes>
                <filtering>true</filtering>
            </resource>
        </resources>
        <plugins>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <version>2.15.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                        <configuration>
                            <args>
                                <arg>-dependencyfile</arg>
                                <arg>${project.build.directory}/.scala_dependencies</arg>
                            </args>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.10</version>
                <configuration>
                    <useFile>false</useFile>
                    <disableXmlReport>true</disableXmlReport>
                    <includes>
                        <include>**/*Test.*</include>
                        <include>**/*Suite.*</include>
                    </includes>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>2.4</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                    <archive>
                        <manifest>
                            <mainClass>com.spark.harvey.WordCount</mainClass>
                        </manifest>
                    </archive>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
    <reporting>
        <plugins>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <configuration>
                    <scalaVersion>${scala.version}</scalaVersion>
                </configuration>
            </plugin>
        </plugins>
    </reporting>
</project>
  • 配置日志(resources下新建log4j.xml)
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
<log4j:configuration>
    <!-- 将日志信息输出到控制台 -->
    <appender name="ConsoleAppender" class="org.apache.log4j.ConsoleAppender">
        <!-- 设置日志输出的样式 -->
        <layout class="org.apache.log4j.PatternLayout">
            <!-- 设置日志输出的格式 -->
            <param name="ConversionPattern" value="[%d{yyyy-MM-dd HH:mm:ss:SSS}] [%-5p] [method:%l]%n%m%n%n" />
        </layout>
        <!--过滤器设置输出的级别-->
        <filter class="org.apache.log4j.varia.LevelRangeFilter">
            <!-- 设置日志输出的最小级别 -->
            <param name="levelMin" value="debug" />
            <!-- 设置日志输出的最大级别 -->
            <param name="levelMax" value="debug" />
            <!-- 设置日志输出的xxx,默认是false -->
            <param name="AcceptOnMatch" value="true" />
        </filter>
    </appender>

    <!-- 根logger的设置-->
    <root>
        <level value ="INFO"/>
        <appender-ref ref="ConsoleAppender"/>
    </root>
</log4j:configuration>
  • 程序代码
package com.spark.harvey

import org.apache.spark.{SparkConf, SparkContext}
import org.slf4j.LoggerFactory

object WordCount {

  val logger = LoggerFactory.getLogger(WordCount.getClass)

  def main(args: Array[String]): Unit = {
    // 创建SparkConf并设置APP名称
    val conf: SparkConf = new SparkConf().setAppName("WordCount")

    // 创建 Spark 上下文对象
    val sc: SparkContext = new SparkContext(conf)

    // 使用RDD计算统计单词个数
    sc.textFile(args(0)).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_, 1).sortBy(_._2, false).saveAsTextFile(args(1))

    logger.info("wordcount complete!")

    sc.stop()
  }
}
  • 上传集群,运行任务
/home/hadoop/software/spark/bin/spark-submit \
--master spark://harvey:7077 \
--executor-memory 1G \
--total-executor-cores 2 \
/home/hadoop/jar/hadoop-mr-wordcount-jar-with-dependencies.jar \
hdfs://harvey:9000/RELEASE \
hdfs://harvey:9000/out

查看运行结果

$ hadoop fs -text /out/*

在这里插入图片描述

5.在IDEA中本地调试WordCount程序

本地Spark程序调试需要使用local提交模式,即将本机当做运行环境,Master和Worker都为本机。
设置参数:输入目录及输出目录
在这里插入图片描述
运行时直接加断点调试即可。如下:
在这里插入图片描述

问题记录

  • 1).windows下执行hadoop程序,没有配置hadoop环境
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

参考博客:MapReduce 工作机制 错误记录及解决

  • 2).windows 下 idea 远程连接 hadoop hdfs 没有写权限
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=admin, access=WRITE, inode="/":hadoop:supergroup:drwxr-xr-x

提示不允许往hdfs中写文件,解决方法如下
hadoop hdfs-site.xml 文件中添加如下配置

<property>    
	<name>dfs.permissions</name>
	<value>false</value>
</property>

然后重启hadoop 重启spark,重新运行程序,问题消失

6.IDEA中远程调试WordCount程序

通过IDEA进行远程调试,主要是将IDEA作为Driver来提交应用程序,配置过程如下:
修改sparkConf,添加最终需要运行的Jar包、Driver程序的地址,并设置Master的提交地址:
在这里插入图片描述
运行结果:
在这里插入图片描述

7.Spark 核心概念

        每个Spark应用都由一个驱动器程序(driver program)来发起集群上的各种 并行操作。驱动器程序包含应用的 main 函数,并且定义了集群上的分布式数据集,还对这 些分布式数据集应用了相关操作。

        驱动器程序通过一个 SparkContext 对象来访问 Spark。这个对象代表对计算集群的一个连 接。shell 启动时已经自动创建了一个 SparkContext 对象,是一个叫作 sc 的变量。

        驱动器程序一般要管理多个执行器(executor)节点。
在这里插入图片描述

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值