spark 任务提交和部署
IDEA的maven开发环境
正常新建一个maven项目即可,可以选择maven 的quick-start模式
然后maven需要配置一个spark-core,还有一个maven打包scala的插件
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>xcewkk</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>
<name>xcewkk</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
scala插件
首先去IDEA插件市场安scala插件,重启IDE
IDE项目结构
项目结构全局库中添加scala SDK或者下载也可以
项目设置 库 里添加scala。
Scala代码编写
package test
import org.apache.spark.{SparkConf, SparkContext}
/**
* @author:xuanchenwei
* @create: 2022-11-01 14:53
* @Description:
*/
object SimpleApp {
def main(args: Array[String]) {
val logFile = "/xcw/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
scala里类是object
地址默认是hdfs的地址,所以需要提前在hdfs中准备文件。
conf中master参数报错会说让你写master的url。实际是写spark的运行模式,比如cluster和standalone
scala打包
一定要确保安了scala的打包插件,不然提交的jar包会报错找不到类。(即使包名.类名没问题)
先在maven的控制台点clean,然后package。打包的文件在文件树的target文件夹。该jar包上传到linux即可
spark-commit
./spark-submit --class test.SimpleApp --master local:7077 /data/xcw/xcewkk-1.0-SNAPSHOT.jar
不管是scala还是java写的jar包最终都用该指令提交,pyspark用pyspark的submmit。
master表示master主机选项,local表示本地单机模式(local是本地单机,stadalone是集群中单节点,cluster是集群模式),7077是任务提交的端口。
class参数是放jar包的地方,需要包名+类名
最后的地址是指jar包位置