这篇和这篇是介绍Spark入门的姊妹篇,有源码。先翻译下文中提到的问题域:
I have two datasets:
- User information (id, email, language, location)
- Transaction information (transaction-id, product-id, user-id, purchase-amount, item-description)
Given these datasets, I want to find the number of unique locations in which each product has been sold. To do that, I need to join the two datasets together.
有两个文本文件:
- 第一个文件是关于用户信息的,每行数据包括:用户ID、Email、使用的语言标识、所在国家的标识。
- 第二个文件是关于用户交易信息的,每行数据包括:事务ID、商品ID、用户ID、金额、描述。
基于这两个数据集文件,想知道每种产品在几个不同的地方卖出过。
下载源码后,不管三七二十一,运行D:\Dump\SparkDB\hadoop-framework-examples\spark\src\test\java\com\matthewrathbone\sparktest\SparkJavaJoinsTest.java,可以成功执行。再看看核心处理逻辑,D:\Dump\SparkDB\hadoop-framework-examples\spark\src\main\java\com\matthewrathbone\sparktest\ExampleJob.java,有些懵逼。其实,这是在用Java语法表达Scala语言的用法,所以理解起来很吃力。
先不说Spark RDD,再看一个比上面的例子更简单的Demo,以便对JavaRDD、JavaPairRDD有进一步的认识。
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;
import java.util.Arrays;
/**
* 改编自:https://blog.csdn.net/qq_37469055/article/details/86593803
*/
public class JavaRDDToJavaPairRDD {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local");
// 这一行非常耗时,目测耗时约10秒
JavaSparkContext sc = new JavaSparkContext(conf);
/**
* 可以想象为有一个文本文件,共4行,转为JavaRDD:
* 1 语文
* 2 数学
* 3 英语
* 4 政治
*/
JavaRDD<String> txtFile2JavaRDD = sc.parallelize(Arrays.asList("1 语文", "2 数学", "3 英语", "4 政治"));
// 遍历并输出JavaRDD
txtFile2JavaRDD.foreach(new VoidFunction<String>() {
public void call(String num) throws Exception {
System.out.println("每行内容:" + num);
}
});
// 将JavaRDD转换为JavaPairRDD
JavaPairRDD<String,String> javaPairRDD = txtFile2JavaRDD.mapToPair(new PairFunction<String, String, String>() {
public Tuple2<String, String> call(String s) throws Exception {
System.out.println("###### 将一行内容封装为一个元组对象");
return new Tuple2<String, String>(s.split(" ")[0],s.split(" ")[1]);
}
});
// 此处,日志里还看不到“将一行内容封装为一个元组对象”,可见上面的call(...)类似于回调函数
System.out.println("JavaRDD转换为JavaPairRDD,使用mapToPair(...)");
// 遍历并输出JavaPairRDD
javaPairRDD.foreach(new VoidFunction<Tuple2<String, String>>() {
public void call(Tuple2<String, String> t) throws Exception {
System.out.println(t);
}
});
/**
* 将JavaPairRDD转化为JavaRDD
* 可以使用t._1 访问元组的第一个元素,t._2 访问第二个元素,依此类推
*/
JavaRDD<String> javaRdd = javaPairRDD.map(new Function<Tuple2<String, String>, String>() {
public String call(Tuple2<String, String> t) throws Exception {
System.out.println("###### 将元组对象拆开并拼接为一个字符串:" + t);
System.out.println("第一个参数是:" + t._1);
System.out.println("第二个参数是:" + t._2);
return t._1 + "拼接" + t._2;
}
});
// 此处,日志里还看不到“将元组对象拆开并拼接为一个字符串”,可见上面的call(...)类似于回调函数
// 遍历并输出JavaRDD
javaRdd.foreach(new VoidFunction<String>(){
public void call(String num) throws Exception {
System.out.println("JavaRDD=" + num);
}
});
}
}
很明显地感觉到,Java语法与面向函数编程的语法混在一起。
再回到开头的例子里。先在idea里配置main()函数入参,直接运行ExampleJob里的main()函数,正常执行,在D:\sparkOut\part-00000里可以看到执行结果。重新执行前,需删除D:\sparkOut文件夹。
D:\Dump\SparkDB\hadoop-framework-examples\spark\transactions.txt D:\Dump\SparkDB\hadoop-framework-examples\spark\users.txt D:\sparkOut
但按照我的Spark学习笔记(一)中提到的方法,先生成jar,再在本机命令行执行spark-submit,却总提示java.lang.ClassNotFoundException,困扰了我很久。
D:\Dump\SparkDB\hadoop-framework-examples\spark\classes\artifacts\spark_example_jar>dir
2019/11/28 17:37 <DIR> .
2019/11/28 17:37 <DIR> ..
2019/11/28 17:32 74,672,147 spark-example.jar
2019/09/26 19:53 108 transactions.txt
2019/09/26 19:53 80 users.txt
D:\Dump\SparkDB\hadoop-framework-examples\spark\classes\artifacts\spark_example_jar>spark-submit --class com.matthewrathbone.sparktest.ExampleJob --master local ./spark-example.jar ./transactions.txt ./users.txt ./
java.lang.ClassNotFoundException: com.matthewrathbone.sparktest.ExampleJob
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:230)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:712)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
D:\Dump\SparkDB\hadoop-framework-examples\spark\classes\artifacts\spark_example_jar>
最终解决了,修改如下:
- pom.xml
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>com.matthewrathbone.sparktest.ExampleJob</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
- 因为修改spark-core的版本,代码也要修改:
package com.matthewrathbone.sparktest;
//import com.google.common.base.Optional;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.Optional;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
public class ExampleJob {
private static JavaSparkContext sc;
public ExampleJob(JavaSparkContext sc){
this.sc = sc;
}
public static final PairFunction<Tuple2<Integer, Optional<String>>, Integer, String> KEY_VALUE_PAIRER =
new PairFunction<Tuple2<Integer, Optional<String>>, Integer, String>() {
public Tuple2<Integer, String> call(
Tuple2<Integer, Optional<String>> a) throws Exception {
// a._2.isPresent()
return new Tuple2<Integer, String>(a._1, a._2.get());
}
};
public static JavaRDD<Tuple2<Integer,Optional<String>>> joinData(JavaPairRDD<Integer, Integer> t, JavaPairRDD<Integer, String> u){
JavaRDD<Tuple2<Integer,Optional<String>>> leftJoinOutput = t.leftOuterJoin(u).values().distinct();
return leftJoinOutput;
}
public static JavaPairRDD<Integer, String> modifyData(JavaRDD<Tuple2<Integer,Optional<String>>> d){
return d.mapToPair(KEY_VALUE_PAIRER);
}
public static Map<Integer, Long> countData(JavaPairRDD<Integer, String> d){
Map<Integer, Long> result = d.countByKey();
return result;
}
public static JavaPairRDD<String, String> run(String t, String u){
JavaRDD<String> transactionInputFile = sc.textFile(t);
JavaPairRDD<Integer, Integer> transactionPairs = transactionInputFile.mapToPair(new PairFunction<String, Integer, Integer>() {
public Tuple2<Integer, Integer> call(String s) {
String[] transactionSplit = s.split("\t");
return new Tuple2<Integer, Integer>(Integer.valueOf(transactionSplit[2]), Integer.valueOf(transactionSplit[1]));
}
});
JavaRDD<String> customerInputFile = sc.textFile(u);
JavaPairRDD<Integer, String> customerPairs = customerInputFile.mapToPair(new PairFunction<String, Integer, String>() {
public Tuple2<Integer, String> call(String s) {
String[] customerSplit = s.split("\t");
return new Tuple2<Integer, String>(Integer.valueOf(customerSplit[0]), customerSplit[3]);
}
});
Map<Integer, Long> result = countData(modifyData(joinData(transactionPairs, customerPairs)));
List<Tuple2<String, String>> output = new ArrayList<>();
for (Entry<Integer, Long> entry : result.entrySet()){
output.add(new Tuple2<>(entry.getKey().toString(), String.valueOf((long)entry.getValue())));
}
JavaPairRDD<String, String> output_rdd = sc.parallelizePairs(output);
return output_rdd;
}
public static void main(String[] args) throws Exception {
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJoins").setMaster("local"));
ExampleJob job = new ExampleJob(sc);
JavaPairRDD<String, String> output_rdd = job.run(args[0], args[1]);
output_rdd.saveAsHadoopFile(args[2], String.class, String.class, TextOutputFormat.class);
sc.close();
}
}
- 不要使用我的Spark学习笔记(一)中提到的方法生成jar,原作者已经配置了maven-assembly-plugin,配置Maven执行脚本,并执行,会在target下生成spark-example-1.0-SNAPSHOT-jar-with-dependencies.jar:
mvn clean compile assembly:single
- 命令行执行:
D:\Dump\SparkDB\hadoop-framework-examples\spark\target>spark-submit --class com.matthewrathbone.sparktest.ExampleJob --master local ./spark-example-1.0-SNAPSHOT-jar-with-dependencies.jar ./transactions.txt ./users.txt ./log
会提示无法删除临时文件,没关系,不影响结果生成。在./log/part-00000里可以看到计算结果。重新执行前,需删除log文件夹。
- 久违的结算结果:product 1 在 3 个unique location(US, GB, FR)卖出过;product 2 在 1 个 unique location(FR)卖出过。
1 3
2 1