idea新建maven项目,配置scala环境
File–>Project Structure -->Modules,添加scala依赖库
配置pom.xml
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.3.0</version>
</dependency>
<!--如果需要访问hdfs中的文件,则要添加依赖-->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.3.1</version>
</dependency>
Spark统计WordCount
package com.hjt.yxh.hw
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object WordCountApp {
def main(args: Array[String]): Unit = {
//1. 创建SparkConf
val conf = new SparkConf();
conf.setMaster("local").setAppName("WordCount")
val sparkContext = new SparkContext(conf);
//val inpath:String = "hdfs://k8s-node8:8020/root/data/wordcount.txt"
val dataRdd1:RDD[String] = sparkContext.textFile("D:\\java_workspace\\BigData\\Spark\\SparkApp\\SparkLearn\\src\\main\\resources\\test.txt")
val dataRdd2:RDD[String] = dataRdd1.flatMap(data=>{
data.split(" ")
}).filter(_.nonEmpty)
val rdd3:RDD[(String,Int)] = dataRdd2.map(data=>{
(data,1)
})
// val result = rdd3.reduceByKey((val1:Int,val2:Int)=>{
// val1+val2
// })
val result = rdd3.groupByKey().map(data=>{
(data._1,data._2.sum)
})
result.foreach(println)
sparkContext.stop()
}
}
编写完成后可以在idea中运行测试,看到运行结果。
Tips:如果需要提交到spark集群上运行,需要先打成jar包,然后使用命令的方式提交。当然在代码中conf.setMaster()就不能写成“local”了。
--以客户端的方式提交任务
spark-submit --master spark://k8s-node3:6000 --class com.hjt.yxh.hw.transmate.WordCount ./SparkApp-1.0-SNAPSHOT.jar
--以集群的方式提交任务
spark-submit --master spark://k8s-node3:6000 --deploy-mode cluster --class com.hjt.yxh.hw.transmate.WordCount ./SparkApp-1.0-SNAPSHOT.jar
总结
开发一个Spark应用的流程:
- 1.创建SparkConf
- 2.设置Spark的Master的信息,有两种方式:
- 一种在idea本地运行,conf.setMaster(“local”)
- 另一种是集群提交的方式:conf.setMaster(“spark://k8s-node3:6000”)
- 3.设置AppName,这个是必须要设置的,不然程序运行不起来
- 4.创建一个SparkContext对象
- 5.做其他的算子运算操作(注意,有些算子是“懒执行”算子,一个Application中至少得有一个Action算子)
- 6.关闭sparkContext对象
使用scala可能遇到的问题
如果报错Failed to load class
22/07/11 19:18:49 WARN DependencyUtils: Local jar /home/software/spark-3.3.0-bin-hadoop3/SparkApp-1.0-SNAPSHOT2.jar does not exist, skipping.
Error: Failed to load class com.hjt.yxh.hw.transmate.WordCount.
22/07/11 19:18:49 INFO ShutdownHookManager: Shutdown hook called
22/07/11 19:18:49 INFO ShutdownHookManager: Deleting directory /tmp/spark-ac3b6c16-f668-493c-ae7b-4042bcf55e02
[root@k8s-node8 spark-3.3.0-bin-hadoop3]#
可能是因为使用Scala打包时没有生成class文件,这时需要在pom.xml中添加编译插件
pom.xml:
<build>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.4.6</version>
<executions>
<execution> <!-- 声明绑定到 maven 的 compile 阶段 -->
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.0.0</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>