一、scala程序
idea中写的某个程序(采取传参形式,尽量不要写死):
package com.atguigu
import org.apache.spark.{SparkConf, SparkContext}
object WordCount {
def main(args: Array[String]): Unit = {
//1.创建配置信息
val conf=new SparkConf().setAppName("wordcount")
//2.创建sparkcontext
val sc=new SparkContext(conf)
//3.处理逻辑
val lines=sc.textFile(args(0))
val words=lines.flatMap(_.split(" "))
val k2v=words.map((_,1))
val result=k2v.reduceByKey(_+_)
result.saveAsTextFile(args(1))
//4.关闭连接
sc.stop()
}
}
二、程序打包
maven-package后使用无依赖的(sparkcore-1.0-SNAPSHOT.jar),占用内存少许多,spark集群已经有依赖环境了
如果是第一次打包maven需要下载一些程序,需要等一会,后面就很快了
此部分环境配置是应该在写scala程序前配好的,每个人环境的版本会有些不同,贴出我这边的:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.atguigu</groupId>
<artifactId>sparkde</artifactId>
<packaging>pom</packaging>
<version>1.0-SNAPSHOT</version>
<modules>
<module>sparkcore</module>
</modules>
(此处以下是自己需要配的依赖)
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.0.0</version>
<configuration>
<archive>
<manifest>
<mainClass>com.atguigu.WordCount</mainClass>
(jar包结构,maven打包好可以自己解压看下,需要根据自己scala程序主函数改,点在idea中写好的WordCount,点Copy Reference复制来这里就好)
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
三、spark集群中测试
测试代码:
bin/spark-submit \
--class com.atguigu.WordCount \
--master spark://mini3:7077 \
/mnt/hgfs/share_folder/sparkcore-1.0-SNAPSHOT.jar \ #方才打包好的jar包
hdfs://mini3:9000/RELEASE \ #输入,即scala程序中的args(0)
hdfs://mini3:9000/out3 #输出,即scala程序中的args(1)
查看运行结果:
hadoop fs -cat /out3/part*