第一个spark应用开发详解(java版)

 

https://blog.csdn.net/boling_cavalry/article/details/86776746

WordCount是大数据学习最好的入门demo,今天就一起开发java版本的WordCount,然后提交到Spark2.3.2环境运行;

版本信息

  1. 操作系统:CentOS7;
  2. JDK:1.8.0_191;
  3. Spark:2.3.3;
  4. Scala:2.11.12;
  5. Hadoop:2.7.7;
  6. Maven:3.5.0;

关于hadoop环境

本次实战用到了hadoop的hdfs,关于hadoop的部署,请参考《Linux部署hadoop2.7.7集群》

关于spark环境

本次实战用到了spark2.3.3,关于spark集群的部署,请参考《部署spark2.2集群(standalone模式)》
请注意,由于2.3.3版本的spark-core的jar包不支持scala2.12,所以在部署spark的时候,scala版本请使用2.11;

关于本次实战开发的应用

本次实战开发的应用是经典的WorkCount,也就是指定一个文本文件,统计其中每个单词出现的次数,再取出现次数最多的10个,打印出来,并保存在hdfs文件中;

本次统计单词数用到的文本

  1. 本次用到的txt文件,是我在网上找到的pdf版本的《乱世佳人》英文版,然后导出为txt,读者您可以自行选择适合的txt文件来测试;
  2. 在hdfs服务所在的机器上执行以下命令,创建input文件夹:
<span style="color:#000000"><code class="language-shell">~/hadoop-2.7.7/bin/hdfs dfs -mkdir /input
</code></span>
  • 1
  1. 在hdfs服务所在的机器上执行以下命令,创建output文件夹:
<span style="color:#000000"><code class="language-shell">~/hadoop-2.7.7/bin/hdfs dfs -mkdir /output
</code></span>
  • 1
  1. 把本次用到的text文件上传到hdfs服务所在的机器,再执行以下命令将文本文件上传到hdfs的/input文件夹下:
<span style="color:#000000"><code class="language-shell">~/hadoop-2.7.7/bin/hdfs dfs -put ~/GoneWiththeWind.txt /input
</code></span>
  • 1

源码下载

接下来详细讲述应用的编码过程,如果您不想自己写代码,也可以在GitHub下载完整的应用源码,地址和链接信息如下表所示:

名称链接备注
项目主页https://github.com/zq2599/blog_demos该项目在GitHub上的主页
git仓库地址(https)https://github.com/zq2599/blog_demos.git该项目源码的仓库地址,https协议
git仓库地址(ssh)git@github.com:zq2599/blog_demos.git该项目源码的仓库地址,ssh协议

这个git项目中有多个文件夹,本章源码在sparkwordcount这个文件夹下,如下图红框所示:
在这里插入图片描述

开发应用

  1. 基于maven创建一个java应用sparkwordcount,pom.xml的内容如下:
<span style="color:#000000"><code class="language-xml"><span style="color:#5c6370"><?xml version="1.0" encoding="UTF-8"?></span>
<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>project</span> <span style="color:#d19a66">xmlns</span><span style="color:#669900"><span style="color:#999999">=</span><span style="color:#999999">"</span>http://maven.apache.org/POM/4.0.0<span style="color:#999999">"</span></span>
         <span style="color:#d19a66">xmlns:xsi</span><span style="color:#669900"><span style="color:#999999">=</span><span style="color:#999999">"</span>http://www.w3.org/2001/XMLSchema-instance<span style="color:#999999">"</span></span>
         <span style="color:#d19a66">xsi:schemaLocation</span><span style="color:#669900"><span style="color:#999999">=</span><span style="color:#999999">"</span>http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd<span style="color:#999999">"</span></span><span style="color:#999999">></span></span>
    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>modelVersion</span><span style="color:#999999">></span></span>4.0.0<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>modelVersion</span><span style="color:#999999">></span></span>

    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>groupId</span><span style="color:#999999">></span></span>com.bolingcavalry<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>groupId</span><span style="color:#999999">></span></span>
    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>artifactId</span><span style="color:#999999">></span></span>sparkwordcount<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>artifactId</span><span style="color:#999999">></span></span>
    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>version</span><span style="color:#999999">></span></span>1.0-SNAPSHOT<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>version</span><span style="color:#999999">></span></span>

    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>properties</span><span style="color:#999999">></span></span>
        <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>project.build.sourceEncoding</span><span style="color:#999999">></span></span>UTF-8<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>project.build.sourceEncoding</span><span style="color:#999999">></span></span>
    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>properties</span><span style="color:#999999">></span></span>

    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>dependencies</span><span style="color:#999999">></span></span>
        <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>dependency</span><span style="color:#999999">></span></span>
            <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>groupId</span><span style="color:#999999">></span></span>org.apache.spark<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>groupId</span><span style="color:#999999">></span></span>
            <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>artifactId</span><span style="color:#999999">></span></span>spark-core_2.11<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>artifactId</span><span style="color:#999999">></span></span>
            <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>version</span><span style="color:#999999">></span></span>2.3.2<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>version</span><span style="color:#999999">></span></span>
        <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>dependency</span><span style="color:#999999">></span></span>
    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>dependencies</span><span style="color:#999999">></span></span>

    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>build</span><span style="color:#999999">></span></span>
        <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>sourceDirectory</span><span style="color:#999999">></span></span>src/main/java<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>sourceDirectory</span><span style="color:#999999">></span></span>
        <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>testSourceDirectory</span><span style="color:#999999">></span></span>src/test/java<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>testSourceDirectory</span><span style="color:#999999">></span></span>
        <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>plugins</span><span style="color:#999999">></span></span>
            <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>plugin</span><span style="color:#999999">></span></span>
                <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>artifactId</span><span style="color:#999999">></span></span>maven-assembly-plugin<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>artifactId</span><span style="color:#999999">></span></span>
                <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>configuration</span><span style="color:#999999">></span></span>
                    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>descriptorRefs</span><span style="color:#999999">></span></span>
                        <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>descriptorRef</span><span style="color:#999999">></span></span>jar-with-dependencies<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>descriptorRef</span><span style="color:#999999">></span></span>
                    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>descriptorRefs</span><span style="color:#999999">></span></span>
                    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>archive</span><span style="color:#999999">></span></span>
                        <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>manifest</span><span style="color:#999999">></span></span>
                            <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>mainClass</span><span style="color:#999999">></span></span><span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>mainClass</span><span style="color:#999999">></span></span>
                        <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>manifest</span><span style="color:#999999">></span></span>
                    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>archive</span><span style="color:#999999">></span></span>
                <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>configuration</span><span style="color:#999999">></span></span>
                <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>executions</span><span style="color:#999999">></span></span>
                    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>execution</span><span style="color:#999999">></span></span>
                        <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>id</span><span style="color:#999999">></span></span>make-assembly<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>id</span><span style="color:#999999">></span></span>
                        <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>phase</span><span style="color:#999999">></span></span>package<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>phase</span><span style="color:#999999">></span></span>
                        <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>goals</span><span style="color:#999999">></span></span>
                            <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>goal</span><span style="color:#999999">></span></span>single<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>goal</span><span style="color:#999999">></span></span>
                        <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>goals</span><span style="color:#999999">></span></span>
                    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>execution</span><span style="color:#999999">></span></span>
                <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>executions</span><span style="color:#999999">></span></span>
            <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>plugin</span><span style="color:#999999">></span></span>
            <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>plugin</span><span style="color:#999999">></span></span>
                <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>groupId</span><span style="color:#999999">></span></span>org.codehaus.mojo<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>groupId</span><span style="color:#999999">></span></span>
                <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>artifactId</span><span style="color:#999999">></span></span>exec-maven-plugin<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>artifactId</span><span style="color:#999999">></span></span>
                <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>version</span><span style="color:#999999">></span></span>1.2.1<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>version</span><span style="color:#999999">></span></span>
                <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>executions</span><span style="color:#999999">></span></span>
                    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>execution</span><span style="color:#999999">></span></span>
                        <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>goals</span><span style="color:#999999">></span></span>
                            <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>goal</span><span style="color:#999999">></span></span>exec<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>goal</span><span style="color:#999999">></span></span>
                        <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>goals</span><span style="color:#999999">></span></span>
                    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>execution</span><span style="color:#999999">></span></span>
                <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>executions</span><span style="color:#999999">></span></span>
                <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>configuration</span><span style="color:#999999">></span></span>
                    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>executable</span><span style="color:#999999">></span></span>java<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>executable</span><span style="color:#999999">></span></span>
                    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>includeProjectDependencies</span><span style="color:#999999">></span></span>false<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>includeProjectDependencies</span><span style="color:#999999">></span></span>
                    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>includePluginDependencies</span><span style="color:#999999">></span></span>false<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>includePluginDependencies</span><span style="color:#999999">></span></span>
                    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>classpathScope</span><span style="color:#999999">></span></span>compile<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>classpathScope</span><span style="color:#999999">></span></span>
                    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>mainClass</span><span style="color:#999999">></span></span>com.bolingcavalry.sparkwordcount.WordCount<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>mainClass</span><span style="color:#999999">></span></span>
                <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>configuration</span><span style="color:#999999">></span></span>
            <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>plugin</span><span style="color:#999999">></span></span>
            <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>plugin</span><span style="color:#999999">></span></span>
                <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>groupId</span><span style="color:#999999">></span></span>org.apache.maven.plugins<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>groupId</span><span style="color:#999999">></span></span>
                <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>artifactId</span><span style="color:#999999">></span></span>maven-compiler-plugin<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>artifactId</span><span style="color:#999999">></span></span>
                <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>configuration</span><span style="color:#999999">></span></span>
                    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>source</span><span style="color:#999999">></span></span>1.8<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>source</span><span style="color:#999999">></span></span>
                    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"><</span>target</span><span style="color:#999999">></span></span>1.8<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>target</span><span style="color:#999999">></span></span>
                <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>configuration</span><span style="color:#999999">></span></span>
            <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>plugin</span><span style="color:#999999">></span></span>
        <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>plugins</span><span style="color:#999999">></span></span>
    <span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>build</span><span style="color:#999999">></span></span>
<span style="color:#e06c75"><span style="color:#e06c75"><span style="color:#999999"></</span>project</span><span style="color:#999999">></span></span>
</code></span>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  1. 创建WrodCount类,关键代码位置都有注释,就不再赘述了:
<span style="color:#000000"><code class="language-java"><span style="color:#c678dd">package</span> com<span style="color:#999999">.</span>bolingcavalry<span style="color:#999999">.</span>sparkwordcount<span style="color:#999999">;</span>

<span style="color:#c678dd">import</span> org<span style="color:#999999">.</span>apache<span style="color:#999999">.</span>spark<span style="color:#999999">.</span>SparkConf<span style="color:#999999">;</span>
<span style="color:#c678dd">import</span> org<span style="color:#999999">.</span>apache<span style="color:#999999">.</span>spark<span style="color:#999999">.</span>api<span style="color:#999999">.</span>java<span style="color:#999999">.</span>JavaPairRDD<span style="color:#999999">;</span>
<span style="color:#c678dd">import</span> org<span style="color:#999999">.</span>apache<span style="color:#999999">.</span>spark<span style="color:#999999">.</span>api<span style="color:#999999">.</span>java<span style="color:#999999">.</span>JavaRDD<span style="color:#999999">;</span>
<span style="color:#c678dd">import</span> org<span style="color:#999999">.</span>apache<span style="color:#999999">.</span>spark<span style="color:#999999">.</span>api<span style="color:#999999">.</span>java<span style="color:#999999">.</span>JavaSparkContext<span style="color:#999999">;</span>
<span style="color:#c678dd">import</span> scala<span style="color:#999999">.</span>Tuple2<span style="color:#999999">;</span>

<span style="color:#c678dd">import</span> java<span style="color:#999999">.</span>text<span style="color:#999999">.</span>SimpleDateFormat<span style="color:#999999">;</span>
<span style="color:#c678dd">import</span> java<span style="color:#999999">.</span>util<span style="color:#999999">.</span>Arrays<span style="color:#999999">;</span>
<span style="color:#c678dd">import</span> java<span style="color:#999999">.</span>util<span style="color:#999999">.</span>Date<span style="color:#999999">;</span>
<span style="color:#c678dd">import</span> java<span style="color:#999999">.</span>util<span style="color:#999999">.</span>List<span style="color:#999999">;</span>

<span style="color:#5c6370">/**
 * @Description: spark的WordCount实战
 * @author: willzhao E-mail: zq2599@gmail.com
 * @date: 2019/2/8 17:21
 */</span>
<span style="color:#c678dd">public</span> <span style="color:#c678dd">class</span> WordCount <span style="color:#999999">{</span>

    <span style="color:#c678dd">public</span> <span style="color:#c678dd">static</span> <span style="color:#c678dd">void</span> <span style="color:#61aeee">main</span><span style="color:#999999">(</span>String<span style="color:#999999">[</span><span style="color:#999999">]</span> args<span style="color:#999999">)</span> <span style="color:#999999">{</span>
        String hdfsHost <span style="color:#669900">=</span> args<span style="color:#999999">[</span><span style="color:#98c379">0</span><span style="color:#999999">]</span><span style="color:#999999">;</span>
        String hdfsPort <span style="color:#669900">=</span> args<span style="color:#999999">[</span><span style="color:#98c379">1</span><span style="color:#999999">]</span><span style="color:#999999">;</span>
        String textFileName <span style="color:#669900">=</span> args<span style="color:#999999">[</span><span style="color:#98c379">2</span><span style="color:#999999">]</span><span style="color:#999999">;</span>

        SparkConf sparkConf <span style="color:#669900">=</span> <span style="color:#c678dd">new</span> SparkConf<span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#61aeee">setAppName</span><span style="color:#999999">(</span><span style="color:#669900">"Spark WordCount Application (java)"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>

        JavaSparkContext javaSparkContext <span style="color:#669900">=</span> <span style="color:#c678dd">new</span> JavaSparkContext<span style="color:#999999">(</span>sparkConf<span style="color:#999999">)</span><span style="color:#999999">;</span>

        String hdfsBasePath <span style="color:#669900">=</span> <span style="color:#669900">"hdfs://"</span> <span style="color:#669900">+</span> hdfsHost <span style="color:#669900">+</span> <span style="color:#669900">":"</span> <span style="color:#669900">+</span> hdfsPort<span style="color:#999999">;</span>
        <span style="color:#5c6370">//文本文件的hdfs路径</span>
        String inputPath <span style="color:#669900">=</span> hdfsBasePath <span style="color:#669900">+</span> <span style="color:#669900">"/input/"</span> <span style="color:#669900">+</span> textFileName<span style="color:#999999">;</span>

        <span style="color:#5c6370">//输出结果文件的hdfs路径</span>
        String outputPath <span style="color:#669900">=</span> hdfsBasePath <span style="color:#669900">+</span> <span style="color:#669900">"/output/"</span>
                       <span style="color:#669900">+</span> <span style="color:#c678dd">new</span> SimpleDateFormat<span style="color:#999999">(</span><span style="color:#669900">"yyyyMMddHHmmss"</span><span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#61aeee">format</span><span style="color:#999999">(</span><span style="color:#c678dd">new</span> Date<span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>

        System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#61aeee">println</span><span style="color:#999999">(</span><span style="color:#669900">"input path : "</span> <span style="color:#669900">+</span> inputPath<span style="color:#999999">)</span><span style="color:#999999">;</span>

        System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#61aeee">println</span><span style="color:#999999">(</span><span style="color:#669900">"output path : "</span> <span style="color:#669900">+</span> outputPath<span style="color:#999999">)</span><span style="color:#999999">;</span>

        <span style="color:#5c6370">//导入文件</span>
        JavaRDD<span style="color:#61aeee"><span style="color:#999999"><</span>String<span style="color:#999999">></span></span> textFile <span style="color:#669900">=</span> javaSparkContext<span style="color:#999999">.</span><span style="color:#61aeee">textFile</span><span style="color:#999999">(</span>inputPath<span style="color:#999999">)</span><span style="color:#999999">;</span>

        JavaPairRDD<span style="color:#61aeee"><span style="color:#999999"><</span>String<span style="color:#999999">,</span> Integer<span style="color:#999999">></span></span> counts <span style="color:#669900">=</span> textFile
                <span style="color:#5c6370">//每一行都分割成单词,返回后组成一个大集合</span>
                <span style="color:#999999">.</span><span style="color:#61aeee">flatMap</span><span style="color:#999999">(</span>s <span style="color:#669900">-</span><span style="color:#669900">></span> Arrays<span style="color:#999999">.</span><span style="color:#61aeee">asList</span><span style="color:#999999">(</span>s<span style="color:#999999">.</span><span style="color:#61aeee">split</span><span style="color:#999999">(</span><span style="color:#669900">" "</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#61aeee">iterator</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">)</span>
                <span style="color:#5c6370">//key是单词,value是1</span>
                <span style="color:#999999">.</span><span style="color:#61aeee">mapToPair</span><span style="color:#999999">(</span>word <span style="color:#669900">-</span><span style="color:#669900">></span> <span style="color:#c678dd">new</span> Tuple2<span style="color:#669900"><</span><span style="color:#669900">></span><span style="color:#999999">(</span>word<span style="color:#999999">,</span> <span style="color:#98c379">1</span><span style="color:#999999">)</span><span style="color:#999999">)</span>
                <span style="color:#5c6370">//基于key进行reduce,逻辑是将value累加</span>
                <span style="color:#999999">.</span><span style="color:#61aeee">reduceByKey</span><span style="color:#999999">(</span><span style="color:#999999">(</span>a<span style="color:#999999">,</span> b<span style="color:#999999">)</span> <span style="color:#669900">-</span><span style="color:#669900">></span> a <span style="color:#669900">+</span> b<span style="color:#999999">)</span><span style="color:#999999">;</span>

        <span style="color:#5c6370">//先将key和value倒过来,再按照key排序</span>
        JavaPairRDD<span style="color:#61aeee"><span style="color:#999999"><</span>Integer<span style="color:#999999">,</span> String<span style="color:#999999">></span></span> sorts <span style="color:#669900">=</span> counts
                <span style="color:#5c6370">//key和value颠倒,生成新的map</span>
                <span style="color:#999999">.</span><span style="color:#61aeee">mapToPair</span><span style="color:#999999">(</span>tuple2 <span style="color:#669900">-</span><span style="color:#669900">></span> <span style="color:#c678dd">new</span> Tuple2<span style="color:#669900"><</span><span style="color:#669900">></span><span style="color:#999999">(</span>tuple2<span style="color:#999999">.</span><span style="color:#61aeee">_2</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">,</span> tuple2<span style="color:#999999">.</span><span style="color:#61aeee">_1</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">)</span>
                <span style="color:#5c6370">//按照key倒排序</span>
                <span style="color:#999999">.</span><span style="color:#61aeee">sortByKey</span><span style="color:#999999">(</span><span style="color:#56b6c2">false</span><span style="color:#999999">)</span><span style="color:#999999">;</span>

        <span style="color:#5c6370">//取前10个</span>
        List<span style="color:#669900"><</span>Tuple2<span style="color:#61aeee"><span style="color:#999999"><</span>Integer<span style="color:#999999">,</span> String<span style="color:#999999">></span></span><span style="color:#669900">></span> top10 <span style="color:#669900">=</span> sorts<span style="color:#999999">.</span><span style="color:#61aeee">take</span><span style="color:#999999">(</span><span style="color:#98c379">10</span><span style="color:#999999">)</span><span style="color:#999999">;</span>

        <span style="color:#5c6370">//打印出来</span>
        <span style="color:#c678dd">for</span><span style="color:#999999">(</span>Tuple2<span style="color:#61aeee"><span style="color:#999999"><</span>Integer<span style="color:#999999">,</span> String<span style="color:#999999">></span></span> tuple2 <span style="color:#669900">:</span> top10<span style="color:#999999">)</span><span style="color:#999999">{</span>
            System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#61aeee">println</span><span style="color:#999999">(</span>tuple2<span style="color:#999999">.</span><span style="color:#61aeee">_2</span><span style="color:#999999">(</span><span style="color:#999999">)</span> <span style="color:#669900">+</span> <span style="color:#669900">"\t"</span> <span style="color:#669900">+</span> tuple2<span style="color:#999999">.</span><span style="color:#61aeee">_1</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
        <span style="color:#999999">}</span>

        <span style="color:#5c6370">//分区合并成一个,再导出为一个txt保存在hdfs</span>
        javaSparkContext<span style="color:#999999">.</span><span style="color:#61aeee">parallelize</span><span style="color:#999999">(</span>top10<span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#61aeee">coalesce</span><span style="color:#999999">(</span><span style="color:#98c379">1</span><span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#61aeee">saveAsTextFile</span><span style="color:#999999">(</span>outputPath<span style="color:#999999">)</span><span style="color:#999999">;</span>

        <span style="color:#5c6370">//关闭context</span>
        javaSparkContext<span style="color:#999999">.</span><span style="color:#61aeee">close</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
    <span style="color:#999999">}</span>
<span style="color:#999999">}</span>
</code></span>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  1. 在pom.xml目录下执行以下命令,编译构建jar包:
<span style="color:#000000"><code class="language-shell">mvn clean package -Dmaven.test.skip<span style="color:#669900">=</span>true
</code></span>
  • 1
  1. 构建成功后,在target目录下生成文件sparkwordcount-1.0-SNAPSHOT.jar,上传到spark服务器的~/jars/目录下;
  2. 假设spark服务器的IP地址为192.168.119.163,在spark服务器执行以下命令即可提交任务:
<span style="color:#000000"><code class="language-shell">~/spark-2.3.2-bin-hadoop2.7/bin/spark-submit \
--master spark://192.168.119.163:7077 \
--class com.bolingcavalry.sparkwordcount.WordCount \
--executor-memory 512m \
--total-executor-cores 2 \
~/jars/sparkwordcount-1.0-SNAPSHOT.jar \
192.168.119.163 \
8020 \
GoneWiththeWind.txt
</code></span>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

上述命令的最后三个参数,是java的main方法的入参,具体的使用请参照WordCount类的源码;
6. 提交成功后立即开始执行任务,看到类似如下信息表示任务完成:

<span style="color:#000000"><code class="language-shell">2019-02-08 21:26:04 INFO  BlockManagerMaster:54 - BlockManagerMaster stopped
2019-02-08 21:26:04 INFO  OutputCommitCoordinator<span style="color:#ee9900">$OutputCommitCoordinatorEndpoint</span>:54 - OutputCommitCoordinator stopped<span style="color:#669900">!</span>
2019-02-08 21:26:04 INFO  SparkContext:54 - Successfully stopped SparkContext
2019-02-08 21:26:04 INFO  ShutdownHookManager:54 - Shutdown hook called
2019-02-08 21:26:04 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-c3e2ea9e-7daf-4cab-a207-26f0a0394017
2019-02-08 21:26:04 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-d60e4d75-4189-4f33-a5e2-fbe9b06bdae7
</code></span>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  1. 往前翻滚一下控制台输出的信息,如下所示,可以见到单词统计的前十名已经输出在控制台了:
<span style="color:#000000"><code class="language-shell">2019-02-08 21:36:15 INFO  DAGScheduler:54 - Job 1 finished: take at WordCount.java:61, took 0.313008 s
the	18264
and	14150
to	10020
of	8615
a	7571
her	7086
she	6217
was	5912
<span style="color:#c678dd">in</span>	5751
had	4502
2019-02-08 21:36:15 INFO  deprecation:1173 - mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
2019-02-08 21:36:15 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
</code></span>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  1. 在hdfs服务器执行查看文件的命令,可见/output下新建了子目录20190208213610:
<span style="color:#000000"><code class="language-shell"><span style="color:#999999">[</span>hadoop@node0 ~<span style="color:#999999">]</span>$ ~/hadoop-2.7.7/bin/hdfs dfs -ls /output
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2019-02-08 21:36 /output/20190208213610
</code></span>
  • 1
  • 2
  • 3
  1. 查看子目录,发现里面有两个文件:
<span style="color:#000000"><code class="language-shell"><span style="color:#999999">[</span>hadoop@node0 ~<span style="color:#999999">]</span>$ ~/hadoop-2.7.7/bin/hdfs dfs -ls /output/20190208213610
Found 2 items
-rw-r--r--   3 hadoop supergroup          0 2019-02-08 21:36 /output/20190208213610/_SUCCESS
-rw-r--r--   3 hadoop supergroup        108 2019-02-08 21:36 /output/20190208213610/part-00000
</code></span>
  • 1
  • 2
  • 3
  • 4
  1. 上面看到的/output/20190208213610/part-00000就是输出结果,用cat命令查看其内容:
<span style="color:#000000"><code class="language-shell"><span style="color:#999999">[</span>hadoop@node0 ~<span style="color:#999999">]</span>$ ~/hadoop-2.7.7/bin/hdfs dfs -cat /output/20190208213610/part-00000
<span style="color:#999999">(</span>18264,the<span style="color:#999999">)</span>
<span style="color:#999999">(</span>14150,and<span style="color:#999999">)</span>
<span style="color:#999999">(</span>10020,to<span style="color:#999999">)</span>
<span style="color:#999999">(</span>8615,of<span style="color:#999999">)</span>
<span style="color:#999999">(</span>7571,a<span style="color:#999999">)</span>
<span style="color:#999999">(</span>7086,her<span style="color:#999999">)</span>
<span style="color:#999999">(</span>6217,she<span style="color:#999999">)</span>
<span style="color:#999999">(</span>5912,was<span style="color:#999999">)</span>
<span style="color:#999999">(</span>5751,in<span style="color:#999999">)</span>
<span style="color:#999999">(</span>4502,had<span style="color:#999999">)</span>
</code></span>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

可见与前面控制台输出的一致;

11. 在spark的web页面,可见刚刚执行的任务信息:
在这里插入图片描述

至此,第一个spark应用的开发和运行就完成了,接下来的文章中,咱们一起来完成更多的spark实战;

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值