Spark[02]IDEA提交WordCount程序到spark集群
准备环境
资源列表
软件/工具 | 版本 |
---|---|
VMware | VMware® Workstation 16 Pro |
Xshell | 6 |
filezilla | 3.7.3 |
IDEA | 2020.2 |
多台虚拟机部分数据如下
编号 | 主机名 | 主机域名 | ip地址 |
---|---|---|---|
① | Toozky | Toozky | 192.168.64.220 |
② | Toozky2 | Toozky2 | 192.168.64.221 |
③ | Toozky3 | Toozky3 | 192.168.64.222 |
准备多台虚拟机并配置hadoop2.0环境,安装hadoop2.0和zookeeper
详见链接:Hadoop[03]启动DFS和Zookeeper(Hadoop2.0)
设置时间同步
详见链接:Linux(CentOS6)设置同步网络时间
配置spark集群
详见链接:Spark[01]Spark集群安装以及配置
Xshell连接多台虚拟机,启动zkServer、hdfs、spark集群
编写多个测试数据.txt文件,文件名随意,将文本文件上传至hdfs的/datas目录
以1.txt、2.txt为例
将编辑好的文本文件用filezilla上传至虚拟机①的/root目录
虚拟机①
将测试数据传到hdfs的/input/datas目录
cd
ln -sf /root/hadoop-2.6.5/bin/hadoop /root/hadoop-2.6.5/sbin/hadoop
hadoop dfs -mkdir /input
hadoop dfs -mkdir /input/datas
hadoop dfs -put 1.txt /input/datas
hadoop dfs -put 2.txt /input/datas
创建结果存储目录/output
hadoop dfs -mkdir /output
IDEA创建普通Maven项目WordCountDemo
WordCountDemo
pom.xml
在pom.xml添加相关依赖
<properties>
<spark.version>2.1.1</spark.version>
<hadoop.version>2.6.5</hadoop.version>
<scala.version>2.11.8</scala.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
</dependencies>
build
标签指定生成jar包的命名
<build>
<finalName>WordCount</finalName>
</build>
log4j.properties
在resources目录创建log4j.properties
log4j.rootCategory=ERROR, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Set the default spark-shell log level to ERROR. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=ERROR
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=ERROR
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=ERROR
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=ERROR
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
TestWordCount
工作原理
①读取文件(textFile)→②拆分单词(flatMap)→③单词计数(map)→④单词统计(reduceByKey)→⑤数据采集(collect)/存储结果文件(saveAsTextFile)
示意图如下
图中RDD操作浅蓝色和白色意为不同的数据文件,最终若将结果存储为文件,得到的结果文件数量与数据源文件数量一致
源码
在java目录创建TestWordCount
以hdfs、spark状态为alive的主机为虚拟机①为例
hdfs://Toozky:8020
地址是hdfs alive主机的web首页中的地址
spark://Toozky:7077
地址是spark alive主机的web首页中的地址
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
public class TestWordCount {
public static void main(String[] args) {
System.setProperty("HADOOP_USER_NAME", "root");
//创建链接
SparkConf conf = new SparkConf()
.setMaster("spark://Toozky:7077")
.setAppName("TestWordCount")
.setJars(new String[]{("target/WordCount.jar")});
JavaSparkContext sc = new JavaSparkContext(conf);
//业务操作
//1.读取文件
JavaRDD<String> lines = sc.textFile("hdfs://Toozky:8020/input/datas");
//2.拆分单词
JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
public Iterator<String> call(String s) throws Exception {
List<String> list = new ArrayList<String>();
String[] arr = s.split(" ");
for (String s1 : arr) {
list.add(s1);
}
return list.iterator();
}
});
//3.单词计数
JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String word) throws Exception {
return new Tuple2<String, Integer>(word, 1);
}
});
//4.单词统计
JavaPairRDD<String, Integer> wordToCount = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer integer, Integer integer2) throws Exception {
return integer + integer2;
}
});
//5.数据采集
List<Tuple2<String, Integer>> arrays = wordToCount.collect();
System.out.println();
for (Tuple2<String, Integer> array : arrays) {
System.out.println(array);
}
//6.保存输出结果
wordToCount.saveAsTextFile("hdfs://Toozky:8020/output/wordCount/java/JavaWordCount");
//关闭资源
sc.close();
}
}
执行程序
package
点击IDEA右侧Maven→WordCountDemo→Lifecycle→package
生成jar包
执行main方法
执行main方法
控制台输出计算结果
hdfs输出目录
下载计算结果文件
spark alive web
Completed Applications出现计算记录
其中Name是代码:setAppName(“TestWordCount”) 中的TestWordCount
以上就是本期总结的全部内容,愿大家相互学习,共同进步!