Maven依赖
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.4</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<!--scala项目的打包插件 spark application jar包 运行standalone集群-->
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>4.0.1</version>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
编写WordCount应用
package com.baizhi
import org.apache.spark.{SparkConf, SparkContext}
/**
* 开发Spark应用
*/
object WordCountApplication {
def main(args: Array[String]): Unit = {
//1. 初始化SparkConf和SparkContext
val conf = new SparkConf().setAppName("wordCount").setMaster("spark://Spark:7077")
val sc = new SparkContext(conf)
//2. 进行大数据集的批处理
sc
.textFile("hdfs://Spark:9000/test.txt")
.flatMap(_.split(" ")) // word
.map((_,1))
.groupBy(_._1)
.map(t => (t._1,t._2.size))
.sortBy(_._2,false,1) // 根据单词出现的次数 进行降序排列
.saveAsTextFile("hdfs://Spark:9000/result2")
//3. 释放资源
sc.stop()
}
}
项目打包
上传Jar包Linux操作系统
提交Jar包Standalone集群中运行
[root@Spark spark-2.4.4]# bin/spark-submit --master spark://Spark:7077 --class com.baizhi.WordCountApplication --total-executor-cores 4 /root/spark-day1-1.0-SNAPSHOT.jar
[root@Spark spark-2.4.4]# hdfs dfs -cat /result2/*
(Hello,4)
(Kafka,2)
(World,1)
(Scala,1)
(Spark,1)
(Good,1)
本地模拟测试开发
package com.baizhi
import org.apache.spark.{SparkConf, SparkContext}
/**
* 开发Spark应用(开发测试)
*/
object WordCountApplication2 {
def main(args: Array[String]): Unit = {
//1. 初始化SparkConf和SparkContext
val conf = new SparkConf().setAppName("wordCount").setMaster("local[*]") // 本地模拟环境 * 匹配当前CPU的核
val sc = new SparkContext(conf)
//2. 进行大数据集的批处理
sc
.textFile("hdfs://Spark:9000/test.txt")
.flatMap(_.split(" ")) // word
.map((_, 1))
.groupBy(_._1)
.map(t => (t._1, t._2.size))
.sortBy(_._2, false, 1) // 根据单词出现的次数 进行降序排列
.saveAsTextFile("hdfs://Spark:9000/result3")
//3. 释放资源
sc.stop()
}
}
注意:
- 在Windows Hosts文件中配置Spark的IP映射
- 解决HDFS写入权限的问题: 添加虚拟机参数
-DHADOOP_USER_NAME=root