spark里虽然算子众多,虽然不要求全部掌握,但是对于常用算子还是需要烂熟于心灵活运用,对于kv格式的RDD(Tuple2)则需要全部掌握否则对于数据的处理就没办法做了,针对这个情况可以自己找一些网上的spark练习题,然后用java的方式实现一遍,能自己写最好,如果不知道怎么写也要照着别人写的敲一遍逐字逐句好好理解。这次也趁这次机会分享下我自己做的练习题
1 第一题题目如下:
定义三个文件对文件内容进行排序(数字)
解析:文件个数多少都无所谓,关键在于 排序 ,排序我们想到的算子有sortByKey、sortBy ; 怎么选呢?这就得知道区分了sortByKey是对key(键)的排序,sortBy可以指定对键还是value进行排序 ,显然这里使用sortBy就可以了
package com.debug;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
public class UseSortBy {
public static void main(String[] args) {
SparkConf conf=new SparkConf();
conf.setMaster("local");
conf.setAppName("数字排序");
JavaSparkContext sc=new JavaSparkContext(conf);
JavaRDD<String> rdd1=sc.textFile("/home/cry/myStudyData/numbersort");
JavaRDD<String> rdd2=rdd1.sortBy(new Function<String, Object>() {
public Object call(String num) throws Exception {
return Integer.parseInt(num);
}
}, true,1);
rdd2.foreach(new VoidFunction<String>() {
public void call(String line) throws Exception {
System.out.println(line);
}
});
sc.stop();
}
}
2 第二题题目如下:
题目:要求先按账户排序,在按金额排序
hadoop@apache 200
hive@apache 550
yarn@apache 580
hive@apache 159
hadoop@apache 300
hive@apache 258
hadoop@apache 150
yarn@apache 560
yarn@apache 260
结果:(hadoop@apache,[150,200,300]),(hive@apache,[159,258,550]),....
解析:以上数据有重复的key,要求的结果是Tuple2且对应的key为数组或集合,因此考虑可以考虑按key来分组,分组后的每个key对应的value无序所以要考虑对value排序(使用集合排序即可);题目还要求按账户即key排序因此考虑使用sortByKey, 我做练习为了方便空白我改成了逗号
package com.debug;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import com.clearspring.analytics.util.Lists;
import scala.Tuple2;
public class SortSecond {
public static void main(String[] args) {
SparkConf conf=new SparkConf();
conf.setMaster("local");
conf.setAppName("二次排序");
JavaSparkContext sc=new JavaSparkContext(conf);
JavaRDD<String> rdd1=sc.textFile("/home/cry/myStudyData/secondsort");
JavaPairRDD<String, Integer> rdd2=rdd1.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String line) throws Exception {
String sp[]=line.split(",");
return new Tuple2(sp[0],Integer.parseInt(sp[1]));
}
});
JavaPairRDD<String, Iterable<Integer>> rdd3=rdd2.groupByKey();
JavaPairRDD<String, List<Integer>> rdd4=rdd3.mapValues(new Function<Iterable<Integer>, List<Integer>>() {
public List<Integer> call(Iterable<Integer> it) throws Exception {
List<Integer> ls=Lists.newArrayList(it);
Collections.sort(ls);
//Collections.reverse(ls);
return ls;
}
});
JavaPairRDD<String, List<Integer>> rdd5=rdd4.sortByKey(true);
rdd5.foreach(new VoidFunction<Tuple2<String,List<Integer>>>() {
@Override
public void call(Tuple2<String, List<Integer>> arg0) throws Exception {
System.out.println(arg0);
}
});
sc.stop();
}
}
3 第三题
给定一组键值对("spark",2),("hadoop",6),("hadoop",4),("spark",6),键值对的key 表示图书名称,value表示某天图书销量,请计算每个键对应的平均值,也就是计算每种图书的每天平均销量。
解析:计算每种的平均值我们需要对每个分组的value进行累加,而且还需记录key的个数(key的个数等于value的个数),这样就能求得平均值,每个分组key的个数累加和value的累加可以借助把value转成Tuple2,然后对每个分组使用reduceByKey, value转换成Tuple2则可使用mapValues
package com.debug;
import java.math.BigDecimal;
import java.util.Arrays;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;
public class UseRDDAvg {
public static void main(String[] args) {
SparkConf conf=new SparkConf();
conf.setMaster("local");
conf.setAppName("计算平均值");
JavaSparkContext sc=new JavaSparkContext(conf);
List<Tuple2<String, Integer>> arr=Arrays.asList(
new Tuple2<String, Integer>("spark", 1),
new Tuple2<String, Integer>("hadoop", 6),
new Tuple2<String, Integer>("hadoop", 4),
new Tuple2<String, Integer>("spark", 6),
new Tuple2<String, Integer>("spark", 4)
);
JavaPairRDD<String, Integer> rdd=sc.parallelizePairs(arr);
JavaPairRDD<String, Tuple2<Integer, Integer>> rdd1=rdd.mapValues(new Function<Integer, Tuple2<Integer, Integer>>() {
public Tuple2<Integer, Integer> call(Integer num) throws Exception {
return new Tuple2<Integer, Integer>(num, 1);
}
});
JavaPairRDD<String, Tuple2<Integer, Integer>> rdd2=rdd1.reduceByKey(new Function2<Tuple2<Integer,Integer>, Tuple2<Integer,Integer>, Tuple2<Integer,Integer>>() {
public Tuple2<Integer, Integer> call(Tuple2<Integer, Integer> tup1, Tuple2<Integer, Integer> tup2)
throws Exception {
return new Tuple2<Integer, Integer>(tup1._1+tup2._1, tup1._2+tup2._2);
}
});
JavaPairRDD<String, Integer> rdd3=rdd2.mapValues(new Function<Tuple2<Integer,Integer>, Integer>() {
public Integer call(Tuple2<Integer, Integer> tp) throws Exception {
return tp._1/tp._2;
}
});
rdd3.foreach(new VoidFunction<Tuple2<String,Integer>>() {
public void call(Tuple2<String, Integer> res) throws Exception {
System.out.println(res);
}
});
sc.stop();
}
}
4 第四题
模拟订单统计,计算出每个用户的消费总金额,并按总金额降序排(数据自己编)
下面是我的数据:
分别是用户昵称、下单时间、订单金额
u1,2018-10-01,20.01
u2,2019-10-05,50.88
u3,2019-04-05,10.65
u1,2019-05-06,5.64
u4,2019-09-10,10.49
u1,2020-02-15,80.45
u2,2019-06-24,30.45
u2,2019-06-28,130.45
解析:和经典的字数统计案例相似,不同的是加入了排序的功能,因不能直接使用sortByKey所以要对RDD先使用mapToPair将key和value调换位置,如有需要再使用一次mapToPair把key和value调换回来
package com.debug;
import java.math.BigDecimal;
import java.util.Arrays;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;
public class RDDExercise1 {
public static void main(String[] args) {
SparkConf conf=new SparkConf();
conf.setMaster("local");
conf.setAppName("模拟订单统计");
JavaSparkContext sc=new JavaSparkContext(conf);
//读取订单信息txt文件
JavaRDD<String> lines=sc.textFile("/home/cry/myStudyData/orderTable.txt");
//将RDD转换成kv格式-map阶段
JavaPairRDD<String, Double> pairWords=lines.mapToPair(new PairFunction<String, String, Double>() {
public Tuple2<String, Double> call(String line) throws Exception {
String sp[]=line.split(",");
return new Tuple2(sp[0],Double.parseDouble(sp[2]));
}
});
//reduce阶段-原理和hadoop的reduce一样
JavaPairRDD<String, Double> result=pairWords.reduceByKey(new Function2<Double, Double, Double>() {
public Double call(Double num1, Double num2) throws Exception {
BigDecimal b = new BigDecimal(num1+num2);
double d = b.setScale(2, BigDecimal.ROUND_FLOOR).doubleValue();
return d;
}
});
//因不存在按value排序的方法,所以使用mapTopair先给kv对的rdd的key和value交换位置
JavaPairRDD<Double, String> res3=result.mapToPair(new PairFunction<Tuple2<String,Double>, Double, String>() {
public Tuple2<Double, String> call(Tuple2<String, Double> tup) throws Exception {
return new Tuple2<Double, String>(tup._2, tup._1);
}
});
JavaPairRDD<Double, String> res4=res3.sortByKey(false);
res4.foreach(new VoidFunction<Tuple2<Double, String>>() {
public void call(Tuple2<Double, String> tuple) throws Exception {
System.out.println(tuple);
}
});
sc.stop();
}
}
写这些代码刚刚开始脑袋是晕的,因此要学会拆分问题,把一个大的问题拆成若干个小问题逐一处理