1、官方解释
1.1、flatMap
<U> JavaRDD<U> flatMap(FlatMapFunction<T,U> f)Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
Parameters:
f
- (undocumented)Returns:
(undocumented)
此解释为输入必须是一个RDD,输出也将是一个RDD,并且输出的时候必须是一个数据集合
map和flatMap对比图如下(个人的一些理解,有不对的地方请大家指正。)
1.2、mapToPair
<K2,V2> JavaPairRDD<K2,V2> mapToPair(PairFunction<T,K2,V2> f)Return a new RDD by applying a function to all elements of this RDD.
Parameters:
f
- (undocumented)Returns:
(undocumented)
此解释为会对一个RDD中的每个元素调用f函数,其中原来RDD中的每一个元素都是T类型的,调用f函数后会进行一定的操作把每个元素都转换成一个<K2,V2>类型的对象
1.3、reduceByKey
public JavaPairRDD<K,V> reduceByKey(Partitioner partitioner, Function2<V,V,V> func)Merge the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
Parameters:
partitioner
- (undocumented)
func
- (undocumented)Returns:
(undocumented)
此解释为使用关联和交换的reduce函数合并每个键的值。这还将在将结果发送到reducer之前在每个映射器上本地执行合并,类似于MapReduce中的“combiner”。
当采用reduceByKeyt时,Spark可以在每个分区移动数据之前将待输出数据与一个共用的key结合。借助下图可以理解在reduceByKey里究竟发生了什么。 注意在数据对被搬移前同一机器上同样的key是怎样被组合的(reduceByKey中的lamdba函数)。然后lamdba函数在每个区上被再次调用来将所有值reduce成一个最终结果。如下图所示:
2、样例实战
实战样例都是在本地Windows上模拟local模式运行的,在模拟的时候,因为本地版本是hadoop-2.7.3版本的,所以在配置hadoop环境变量的时候,需要从官网找到hadoop-2.7.3相匹配的hadoop.dll ,否则会报如下错误
NativeIO$Windows.createDirectoryWithMode0(Ljava/lang/String;I)V
2.1、WordCount
样例代码:
static {
try {
System.load("D:\\hadoop-2.7.3\\bin\\hadoop.dll");//建议采用绝对地址,bin目录下的hadoop.dll文件路径
} catch (UnsatisfiedLinkError e) {
System.err.println("Native code library failed to load.\n " + e);
System.exit(1);
}
}
public static void main(String[] args) throws Exception {
if (args.length < 1) {
System.err.println("Usage: JavaWordCount <file>");
System.exit(1);
}
System.setProperty("HADOOP_USER_NAME", "admin");
SparkConf conf = new SparkConf().setAppName("Java-Test-WordCount").setMaster("local[*]");
SparkContext sc = new SparkContext(conf);
//WordCount demo
JavaRDD<String> rdd = sc.textFile("D:\\words.txt", 2).toJavaRDD();
//数据根据空格进行拆分成list扁平化
JavaRDD<String> words = rdd.flatMap(s -> Arrays.asList(s.split(" ")).iterator());
//对每个拆好的单词进行k-v标识,value统一给1,存入Tuple2元组里
JavaPairRDD<String, Integer> stringIntegerJavaPairRDD = words.mapToPair((PairFunction<String, String, Integer>) t -> new Tuple2<>(t, 1));
//拿元组数据我也不是很清楚是不是就是拿默认元组里的key作为分组key,统计输出。
stringIntegerJavaPairRDD.reduceByKey((Function2<Integer, Integer, Integer>) (i1, i2) -> i1 + i2, 3).saveAsTextFile("D://result");
sc.stop();
}
输入的文档“D\\words.txt” 内容如下
i i
lo lp
lo
k m
因我输出的时候,采用的是3个numPartitions,所以在输出的时候,产品了三个不同的文件:
输出的文档中的内容如下:
“part-00000”内容如下:
(i,2)
(lo,2)
part-00001内容如下:
(lp,1)
(m,1)
part-00002内容如下:
(k,1)
2.2、sum
2.2.1、sum by one Key(一个key聚合sum值)
代码:
static {
try {
System.load("D:\\hadoop-2.7.3\\bin\\hadoop.dll");//建议采用绝对地址,bin目录下的hadoop.dll文件路径
} catch (UnsatisfiedLinkError e) {
System.err.println("Native code library failed to load.\n " + e);
System.exit(1);
}
}
public static void main(String[] args) throws Exception {
if (args.length < 1) {
System.err.println("Usage: JavaWordCount <file>");
System.exit(1);
}
System.setProperty("HADOOP_USER_NAME", "admin");
SparkConf conf = new SparkConf().setAppName("Java-Test-WordCount").setMaster("local[*]");
SparkContext sc = new SparkContext(conf);
//sum demo
JavaPairRDD<String, Integer> stringIntegerJavaPairRDD1 = sc.textFile("D:\\sum.txt", 3).toJavaRDD().flatMapToPair((PairFlatMapFunction<String, String, Integer>) s -> {
Tuple2<String, Integer> tuple2 = new Tuple2<>(s.split(" ")[0], Integer.parseInt(s.split(" ")[1]));
return Arrays.asList(tuple2).iterator();
});
stringIntegerJavaPairRDD1.reduceByKey((Function2<Integer, Integer, Integer>) (v1, v2) -> v1 + v2, 1).saveAsTextFile("D:\\sumresult");
sc.stop();
}
“D:\\sum.txt”
word 3
word 4
count 1
count 2
sum 4
sum 5
group 7
by 1
因我输出的是采用的是1个numPartitions,所以就一个输出文件
“part-00000”内容如下:
(sum,9)
(word,7)
(group,7)
(by,1)
(count,3)
2.2.2、sum by a great many of Keys(多个key进行sum聚合值)
代码:
MoreDimension.java
import java.io.Serializable;
import java.util.Objects;
public class MoreDimension implements Serializable {
private String cityName;
private String areaName;
private String schoolName;
public MoreDimension(String cityName, String areaName, String schoolName) {
this.cityName = cityName;
this.areaName = areaName;
this.schoolName = schoolName;
}
public String getCityName() {
return cityName;
}
public void setCityName(String cityName) {
this.cityName = cityName;
}
public String getAreaName() {
return areaName;
}
public void setAreaName(String areaName) {
this.areaName = areaName;
}
public String getSchoolName() {
return schoolName;
}
public void setSchoolName(String schoolName) {
this.schoolName = schoolName;
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (!(o instanceof MoreDimension)) return false;
MoreDimension that = (MoreDimension) o;
return getCityName().equals(that.getCityName()) &&
getAreaName().equals(that.getAreaName()) &&
getSchoolName().equals(that.getSchoolName());
}
@Override
public int hashCode() {
return Objects.hash(getCityName(), getAreaName(), getSchoolName());
}
@Override
public String toString() {
return cityName + " " + areaName + " " + schoolName + " ";
}
}
static {
try {
System.load("D:\\hadoop-2.7.3\\bin\\hadoop.dll");//建议采用绝对地址,bin目录下的hadoop.dll文件路径
} catch (UnsatisfiedLinkError e) {
System.err.println("Native code library failed to load.\n " + e);
System.exit(1);
}
}
public static void main(String[] args) throws Exception {
if (args.length < 1) {
System.err.println("Usage: JavaWordCount <file>");
System.exit(1);
}
System.setProperty("HADOOP_USER_NAME", "admin");
SparkConf conf = new SparkConf().setAppName("Java-Test-WordCount").setMaster("local[*]");
SparkContext sc = new SparkContext(conf);
//sum group by 多维 demo
JavaPairRDD<MoreDimension, Integer> listIntegerJavaPairRDD = sc.textFile("D:\\summ.txt", 1).toJavaRDD().flatMapToPair((PairFlatMapFunction<String, MoreDimension, Integer>) s -> {
MoreDimension moreDimension = new MoreDimension(s.split(" ")[0], s.split(" ")[1], s.split(" ")[2]);
Tuple2<MoreDimension, Integer> moreDimensionIntegerTuple2 = new Tuple2<>(moreDimension, Integer.parseInt(s.split(" ")[4]));
return Arrays.asList(moreDimensionIntegerTuple2).iterator();
});
listIntegerJavaPairRDD.reduceByKey((Function2<Integer, Integer, Integer>) (v1, v2) -> v1 + v2,1).saveAsTextFile("D:\\summresult");
sc.stop();
}
"D:\\summ.txt"
南京市 雨花台区 花小 一班 2
南京市 雨花台区 花小 二班 10
南京市 雨花台区 实小 一班 3
南京市 雨花台区 实小 二班 1
因我只赋值了一个numPartitions,所以就一个文件输出
“part-00000”内容如下:
(南京市 雨花台区 实小 ,4)
(南京市 雨花台区 花小 ,12)
有问题可以随时联系我一起探讨~~