怎样用IDEA编写spark程序并提交到集群上运行
1.安装scala sdk
1.下载 scala 安装程序
#下载地址
https://www.scala-lang.org/download
#windwos 下安装
https://downloads.lightbend.com/scala/2.13.1/scala-2.13.1.msi
2.配置环境变量
- 新增系统变量 SCALA_HOME
- 配置PATH环境变量新增:%SCALA_HOME%\bin;
- 配置CLASSPATH:
- 启动cmd:运行出现如下表示安装成功
2. IDEA 中的配置
1.安装scala插件
2. 创建 mave 项目
3.项目建好以后右键项目名称
3. 增加scala 框架的支持
4. 导入maven依赖
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>cn.itcast</groupId>
<artifactId>spark</artifactId>
<version>0.1.0</version>
<properties>
<scala.version>2.11.8</scala.version>
<spark.version>2.2.0</spark.version>
<slf4j.version>1.7.16</slf4j.version>
<log4j.version>1.2.17</log4j.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>jcl-over-slf4j</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>${log4j.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.10</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.1.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass></mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
5.选择scala sdk
6. 创建 scala工作目录
本地调试
- 新建 worldcount.txt 文件写入一下内容
spark hello
hadoop hello
java hello
golang hello
def main(args: Array[String]): Unit = {
// 1. 创建 Spark Context
val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val sc: SparkContext = new SparkContext(conf)
// 2. 读取文件并计算词频
val source: RDD[String] = sc.textFile("e:/wordcount.txt", 2)
val words: RDD[String] = source.flatMap { line => line.split(" ") }
val wordsTuple: RDD[(String, Int)] = words.map { word => (word, 1) }
val wordsCount: RDD[(String, Int)] = wordsTuple.reduceByKey { (x, y) => x + y }
// 3. 查看执行结果
wordsCount.foreach(println(_))
}
运行结果
(spark,1)
(golang,1)
(hadoop,1)
(hello,4)
(java,1)
集群运行
1.将wordcount.txt 文件上传到 hdfs
hdfs dfs -put wordcount.txt /data
2.代码编写
def main(args: Array[String]): Unit = {
// 1. 创建SparkContext
val conf = new SparkConf().setAppName("word_count")
val sc = new SparkContext(conf)
// 2. 加载文件
// 1. 准备文件
// 2. 读取文件
// RDD 特点:
// 1. RDD是数据集
// 2. RDD是编程模型
// 3. RDD相互之间有依赖关系
// 4. RDD是可以分区的
val rdd1: RDD[String] = sc.textFile("hdfs:///data/wordcount.txt")
// 3. 处理
// 1. 把整句话拆分为多个单词
val rdd2: RDD[String] = rdd1.flatMap(item => item.split(" ") )
// 2. 把每个单词指定一个词频1
val rdd3: RDD[(String, Int)] = rdd2.map(item => (item, 1) )
// 3. 聚合
val rdd4: RDD[(String, Int)] = rdd3.reduceByKey((curr, agg) => curr + agg )
// 4. 得到结果
val result: Array[(String, Int)] = rdd4.collect()
result.foreach(item => println(item))
}
3.spark-submit 命令
spark-submit [options] < app jar> < app options>
app jar 程序 Jar 包
- app jar 程序 Jar 包
- app options 程序 Main 方法传入的参数
- options 提交应用的参数, 可以有如下选项
options 可选参数
参数 | 解释 |
---|---|
–master < url> | 同 Spark shell 的 Master, 可以是spark, yarn, mesos, kubernetes等 URL |
–deploy-mode < client or cluster> | Driver 运行位置, 可选 Client 和 Cluster, 分别对应运行在本地和集群(Worker)中 |
–class < class full name> | Jar 中的 Class, 程序入口 |
–jars < dependencies path> | 依赖 Jar 包的位置 |
–driver-memory < memory size> | Driver 程序运行所需要的内存, 默认 512M |
–executor-memory < memory size> | Executor 的内存大小, 默认 1G |
4.提交到 Spark Standalone 集群中运行
1.打包
2.生成的jar包
3.进入安装 spark 安装目录 /bin 文件夹下执行脚本
cd /user/local/spark/bin
spark-submit --master spark://zhen:7077 \ #提交的主机
--class scl.MyRdd \ #包路径
/home/spark/original-lmspark-1.0-SNAPSHOT.jar #要运行的jar包
4.观察后台管理页面可以看到任务正在运行
5.运行结果
结束
好了今天就讲到这里了,欢迎加我微信一起提升,如有帮助感谢赞赏。