spark--IDEA开发Spark程序-★★★★★

最新推荐文章于 2023-11-01 15:56:32 发布

韩家小志

最新推荐文章于 2023-11-01 15:56:32 发布

阅读量180

点赞数

分类专栏： Spark 文章标签： spark

本文链接：https://blog.csdn.net/qq_46893497/article/details/113920145

版权

Spark 专栏收录该内容

46 篇文章 4 订阅

订阅专栏

IDEA开发Spark程序

工程准备
- 创建项目
- 添加pom依赖
创建WordCount
补充:命令说明
- spark-shell和spark-submit
- 命令参数

工程准备

创建项目

在这里插入图片描述

添加pom依赖

 <!-- 指定仓库位置，依次为aliyun、cloudera和jboss仓库 -->
    <repositories>
        <repository>
            <id>aliyun</id>
            <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
        </repository>
        <repository>
            <id>cloudera</id>
            <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
        </repository>
    </repositories>

    <properties>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <encoding>UTF-8</encoding>
        <scala.version>2.11.8</scala.version>
        <scala.compat.version>2.11</scala.compat.version>
        <hadoop.version>2.7.4</hadoop.version>
        <spark.version>2.2.0</spark.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.4</version>
        </dependency>

       <!-- <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive-thriftserver_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
             <groupId>org.apache.spark</groupId>
             <artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
             <version>${spark.version}</version>
         </dependency>
         <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql-kafka-0-10_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
       <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.6.0-mr1-cdh5.14.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>1.2.0-cdh5.14.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-server</artifactId>
            <version>1.2.0-cdh5.14.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>1.3.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-server</artifactId>
            <version>1.3.1</version>
        </dependency>
        <dependency>
            <groupId>com.typesafe</groupId>
            <artifactId>config</artifactId>
            <version>1.3.3</version>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.38</version>
        </dependency>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.47</version>
        </dependency>-->
    </dependencies>

    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
        <plugins>
            <!-- 指定编译java的插件 -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.5.1</version>
            </plugin>
            <!-- 指定编译scala的插件 -->
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                        <configuration>
                            <args>
                                <arg>-dependencyfile</arg>
                                <arg>${project.build.directory}/.scala_dependencies</arg>
                            </args>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.18.1</version>
                <configuration>
                    <useFile>false</useFile>
                    <disableXmlReport>true</disableXmlReport>
                    <includes>
                        <include>**/*Test.*</include>
                        <include>**/*Suite.*</include>
                    </includes>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                            <transformers>
                                <transformer
                                        implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass></mainClass>
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

创建WordCount

编写WordCount-★★★★★-重点

package cn.hanjiaxiaozhi.hello

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

/**
 * Author hanjiaxiaozhi
 * Date 2020/7/20 15:49
 * Desc 演示使用Spark编写WordCount
 */
object WordCount {
  def main(args: Array[String]): Unit = {
    //1.创建sc-Spark执行环境
    val conf: SparkConf = new SparkConf().setAppName("wc").setMaster("local[*]")//local[*]表示在本地以多线程的方式模式Spark集群运行,和上午在spark-shell中演示的local本地模式类似,*表示使用本地的所有资源
    val sc: SparkContext = new SparkContext(conf)
    sc.setLogLevel("WARN")//表示将后续的日志级别设置为warn,减少不必要输出

    //2.读取文件
    //A Resilient Distributed Dataset (RDD)弹性分布式数据集,后续会详细讲解,今天暂时理解为分布式集合,但是使用起来和本地集合一样简单
    //RDD[一行行的单词]
    val fileRDD: RDD[String] = sc.textFile("file:///D:\\data\\spark\\words.txt")

    //3.数据处理-WordCount
    //3.1切分每一行单词并压扁为一个集合
    //RDD[一个一个的单词]
    val wordRDD: RDD[String] = fileRDD.flatMap(_.split(" "))
    //3.2每个单词记为1
    //RDD[(hello, 1),(hello, 1),(hello, 1)...(you,1)..]
    val wordAndOneRDD: RDD[(String, Int)] = wordRDD.map((_,1))
    //3.3分组聚合--以前得groupBy之后在累加聚合,现在可以使用reduceByKey一步搞定
    //reduceByKey后面会单独讲,今天直接用,可以直接理解为按照Key进行聚合,效果=groupBy+聚合
    val wordAndCount: RDD[(String, Int)] = wordAndOneRDD.reduceByKey(_+_)

    //4.输出结果
    //上面的RDD[(String, Int)]是分布式集合,所以先收集为本地集合再输出到控制台
    val result: Array[(String, Int)] = wordAndCount.collect()
    result.foreach(println)

    //5.关闭sc
    sc.stop()
  }
}

修改代码并打包到Yarn运行

修改代码
1.mater可以注掉后续在提交命令参数中指定 //.setMaster("local[*]")
2.文件输入路径 sc.textFile(args(0))//表示后续提交运行时候通过参数指定文件输入路径
3.结果输出路径 wordAndCount.saveAsTextFile(args(1))//表示后续提交运行时通过参数指定文件输出路径

package cn.hanjiaxiaozhi.hello

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

/**
 * Author hanjiaxiaozhi
 * Desc 演示使用Spark编写WordCount
 */
object WordCount {
  def main(args: Array[String]): Unit = {
    //1.创建sc-Spark执行环境
    val conf: SparkConf = new SparkConf().setAppName("wc")//.setMaster("local[*]")//local[*]表示在本地以多线程的方式模式Spark集群运行,和上午在spark-shell中演示的local本地模式类似,*表示使用本地的所有资源
    val sc: SparkContext = new SparkContext(conf)
    sc.setLogLevel("WARN")//表示将后续的日志级别设置为warn,减少不必要输出

    //2.读取文件
    //A Resilient Distributed Dataset (RDD)弹性分布式数据集,后续会详细讲解,今天暂时理解为分布式集合,但是使用起来和本地集合一样简单
    //RDD[一行行的单词]
    val fileRDD: RDD[String] = sc.textFile(args(0))//表示后续提交运行时候通过参数指定文件输入路径

    //3.数据处理-WordCount
    //3.1切分每一行单词并压扁为一个集合
    //RDD[一个一个的单词]
    val wordRDD: RDD[String] = fileRDD.flatMap(_.split(" "))
    //3.2每个单词记为1
    //RDD[(hello, 1),(hello, 1),(hello, 1)...(you,1)..]
    val wordAndOneRDD: RDD[(String, Int)] = wordRDD.map((_,1))
    //3.3分组聚合--以前得groupBy之后在累加聚合,现在可以使用reduceByKey一步搞定
    //reduceByKey后面会单独讲,今天直接用,可以直接理解为按照Key进行聚合,效果=groupBy+聚合
    val wordAndCount: RDD[(String, Int)] = wordAndOneRDD.reduceByKey(_+_)

    //4.输出结果
    //上面的RDD[(String, Int)]是分布式集合,所以先收集为本地集合再输出到控制台
    //val result: Array[(String, Int)] = wordAndCount.collect()
    //result.foreach(println)
    wordAndCount.saveAsTextFile(args(1))//表示后续提交运行时通过参数指定文件输出路径

    //5.关闭sc
    sc.stop()
  }
}

打包代码

上传jar包并提交到yarn上运行

在这里插入图片描述

/export/servers/spark/bin/spark-submit  \
--class cn.hanjiaxiaozhi.hello.WordCount \
--master yarn \
--deploy-mode cluster \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 2 \
--queue default \
/root/wc.jar \
hdfs://node01:8020/wordcount/input/words.txt \
hdfs://node01:8020/wordcount/output43_yarn

在这里插入图片描述

补充:命令说明

spark-shell和spark-submit

spark-shell:Spark的交互式窗口,一般用来学习测试使用,可以将任务提交给本地集群/Spark集群/Yarn(一般不使用spark-shell提交到yarn )
spark-submit:Spark提供的用来将jar包提交给本地集群/Spark集群/Yarn(一般都是提交给yarn集群)
所以以后
- spark-shell就是做简单测试
- spark-submit才是用来提交任务给yarn的

命令参数

http://spark.apache.org/docs/latest/submitting-applications.html

[root@node01 bin]# ./spark-submit --help
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor. File paths of these files
                              in executors can be accessed via SparkFiles.get(fileName).

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.

  --help, -h                  Show this help message and exit.
  --verbose, -v               Print additional debug output.
  --version,                  Print the version of current Spark.

 Spark standalone with cluster deploy mode only:
  --driver-cores NUM          Cores for driver (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)

 YARN-only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.

韩家小志

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark--IDEA开发Spark程序-★★★★★

IDEA开发Spark程序工程准备创建项目添加pom依赖创建WordCount编写WordCount-★★★★★-重点修改代码并打包到Yarn运行上传jar包并提交到yarn上运行补充:命令说明spark-shell和spark-submit命令参数工程准备创建项目添加pom依赖  <repositories> <repository>
复制链接

扫一扫