Spark之入门单词统计

最新推荐文章于 2023-04-23 14:38:41 发布

碣石观海

最新推荐文章于 2023-04-23 14:38:41 发布

阅读量581

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/weixin_39469127/article/details/90745027

版权

Spark 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

一、环境

Spark、Hadoop环境搭建可参看之前文章。

开发环境：
    系统：Win10
    开发工具：scala-eclipse-IDE
    项目管理工具：Maven 3.6.0
    JDK 1.8
    Scala 2.11.11
    Spark 2.4.3

Spark运行环境：
    系统：Linux CentOS7（两台机：主从节点）
        master : 192.168.190.200
        slave1 : 192.168.190.201
    JDK 1.8
    Hadoop 2.9.2
    Scala 2.11.11
    Spark 2.4.3

二、单词词频统计的Spark流程图

三、代码（Maven项目：wordFreqFileSpark）

1. 配置 pom.xml：

1）配置好后会进行构建，很长时间；下载慢，可以添加镜像站提速：Maven设置镜像库

2）Spark的Maven依赖方式在官网下载页有提示：http://spark.apache.org/downloads.html

3）构建workspace时报错（因为之前用的Scala2.12版本）：spark的cross-compiled错误，属于scala version problem
重新选择scala版本为：2.11(dynamic) 就没有问题了，因为Spark 2.4.3版本依赖于Scala 2.11版本
右键项目，Maven->update project一下。

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com</groupId>
  <artifactId>wordFreqFileSpark</artifactId>
  <version>0.1</version>
  <dependencies>
  	<dependency><!-- Spark依赖包 -->
  		<groupId>org.apache.spark</groupId>
  		<artifactId>spark-core_2.11</artifactId>
  		<version>2.4.3</version>
  	</dependency>
  	<dependency><!-- Log 日志依赖包 -->
  		<groupId>log4j</groupId>
  		<artifactId>log4j</artifactId>
  		<version>1.2.17</version>
  	</dependency>
  	<dependency><!-- 日志依赖接口 -->
  		<groupId>org.slf4j</groupId>
  		<artifactId>slf4j-log4j12</artifactId>
  		<version>1.7.12</version>
  	</dependency>
  </dependencies>
  <build>
  	<plugins>
  		<!-- 混合scala/java编译 -->
  		<plugin><!-- scala编译插件 -->
  			<groupId>org.scala-tools</groupId>
  			<artifactId>maven-scala-plugin</artifactId>
  			<executions>
  				<execution>
  					<id>compile</id>
  					<goals>
  						<goal>compile</goal>
  					</goals>
  					<phase>compile</phase>
  				</execution>
  				<execution>
  					<id>test-compile</id>
  					<goals>
  						<goal>testCompile</goal>
  					</goals>
  					<phase>test-compile</phase>
  				</execution>
  				<execution>
  					<phase>process-resources</phase>
  					<goals>
  						<goal>compile</goal>
  					</goals>
  				</execution>
  			</executions>
  		</plugin>
  		<plugin>
  			<artifactId>maven-compiler-plugin</artifactId>
  			<configuration>
  				<source>1.8</source><!-- 设置Java源 -->
  				<target>1.8</target>
  			</configuration>
  		</plugin>
  		<!-- for fatjar -->
  		<plugin><!-- 将所有依赖包打入同一个jar包中 -->
  			<groupId>org.apache.maven.plugins</groupId>
  			<artifactId>maven-assembly-plugin</artifactId>
  			<version>2.4</version>
  			<configuration>
  				<descriptorRefs>
  					<descriptorRef>jar-with-dependencies</descriptorRef>
  				</descriptorRefs>
  			</configuration>
  			<executions>
  				<execution>
  					<id>assemble-all</id>
  					<phase>package</phase>
  					<goals>
  						<goal>single</goal>
  					</goals>
  				</execution>
  			</executions>
  		</plugin>
  		<plugin>
  			<groupId>org.apache.maven.plugins</groupId>
  			<artifactId>maven-jar-plugin</artifactId>
  			<configuration>
  				<archive>
  					<manifest>
  						<addClasspath>true</addClasspath>
  						<!-- 设置程序的入口类 -->
  						<mainClass>sparkstreaming_action.wordfreq.WordFreq</mainClass>
  					</manifest>
  				</archive>
  			</configuration>
  		</plugin>
  	</plugins>
  </build>
  <repositories>
  	<repository>  
		<id>alimaven</id>  
		<name>aliyun maven</name>  
		<url>http://maven.aliyun.com/nexus/content/groups/public/</url>  
		<releases>  
			<enabled>true</enabled>  
		</releases>  
		<snapshots>  
			<enabled>false</enabled>  
		</snapshots>  
	</repository>
  </repositories>
</project>

2. 主程序：

1）txtFile 指定的文件设置了HDFS分布式文件系统路径，可以被任一节点访问。
因为搭建的Spark是运行于主从节点的，所以如果指定为本地路径，那么在主节点提交作业后，
在计算结点本地是没有该文件的，也就会读取不到。
（如果是单机环境，那就没有这个问题，可设置本地路径。）

2）另外，输入文件需要单独部署，不能同项目一起打包（打包了也没用）。

package sparkstreaming_action.wordfreq

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object WordFreq {
  def main(args: Array[String]) {
    //创建Spark上下文环境
    val conf = new SparkConf()
      .setAppName("WordFreq_Spark")
      .setMaster("spark://master:7077")
    //创建Spark上下文
    val sc = new SparkContext(conf)
    //文本文件名
    val txtFile = "hdfs://master:9000/user/spark/input.txt"
    //读取文本文件
    val txtData = sc.textFile(txtFile)
    //缓存文本RDD
    txtData.cache()
    //计数
    txtData.count()
    //以空格分割进行词频统计
    val wcData = txtData.flatMap { line => line.split(" ") }
      .map { word => (word, 1) }
      .reduceByKey(_ + _)
    //汇总RDD信息（从所有Worker中汇总到Driver节点）并打印
    wcData.collect().foreach(println)
    sc.stop
  }
}

四、打包

1）在项目的根目录下运行命令行窗口（在目录下 "shift+右键"，选择命令行窗口 Power Shell）
执行如下命令：

编译代码：
    > mvn install
    编译成功后，会在当前目录的 ".\target\" 下产生两个jar包；
    其中的 wordFreqFileSpark-0.1-jar-with-dependencies.jar 用来提交给Spaek集群

2）上传输入文件至HDFS（默认开启Hadoop集群）

创建输入文件，单词用空格分隔：
    $ vi /opt/input.txt
    This is a book about spark and spark streaming
上传至HDFS目录下：
    $ hadoop fs -put /opt/input.txt /user/spark/input.txt

3）将Jar包提交至主节点上，执行Spark作业：

提交Spark作业：
    $ /opt/spark-2.4.3-bin-hadoop2.7/bin/spark-submit \
      --class sparkstreaming_action.wordfreq.WordFreq \
      /opt/wordFreqFileSpark-0.1-jar-with-dependencies.jar
    注1：其中每行的末尾 "\" 代表不换行，与下一行在同一行的意思
    注2：提交的Jar包放在 /opt/ 目录下
运行成功后，屏幕输出如下统计结果：
    (is,1)
    (This,1)
    (streaming,1)
    (book,1)
    (spark,2)
    (about,1)
    (a,1)
    (and,1)