Spark实战案例
需求:计算每个大区当天金币收入TopN的主播,这里以前5位为主,即Top5;
1.数据准备
数据都以JSON格式进行存储的,主要包括以下两个文件
- video_info.log(主播直播记录)
uid代表主播的id,vid代表当前直播间的id,area代表当前要统计的大区名称
{"uid":"8407173251001","vid":"14943445328940001","area":"US","status":"1"}
{"uid":"8407173251002","vid":"14943445328940002","area":"ID","status":"1"}
{"uid":"8407173251003","vid":"14943445328940003","area":"CN","status":"1"}
{"uid":"8407173251004","vid":"14943445328940004","area":"US","status":"1"}
{"uid":"8407173251005","vid":"14943445328940005","area":"ID","status":"1"}
{"uid":"8407173251006","vid":"14943445328940006","area":"CN","status":"1"}
{"uid":"8407173251007","vid":"14943445328940007","area":"ID","status":"1"}
{"uid":"8407173251008","vid":"14943445328940008","area":"CN","status":"1"}
{"uid":"8407173251009","vid":"14943445328940009","area":"US","status":"1"}
{"uid":"8407173251010","vid":"14943445328940010","area":"ID","status":"1"}
{"uid":"8407173251011","vid":"14943445328940011","area":"CN","status":"1"}
{"uid":"8407173251012","vid":"14943445328940012","area":"US","status":"1"}
{"uid":"8407173251013","vid":"14943445328940013","area":"ID","status":"1"}
{"uid":"8407173251014","vid":"14943445328940014","area":"CN","status":"1"}
{"uid":"8407173251015","vid":"14943445328940015","area":"US","status":"1"}
{"uid":"8407173251001","vid":"14943445328940016","area":"US","status":"1"}
{"uid":"8407173251005","vid":"14943445328940017","area":"ID","status":"1"}
{"uid":"8407173251008","vid":"14943445328940018","area":"CN","status":"1"}
{"uid":"8407173251010","vid":"14943445328940019","area":"ID","status":"1"}
{"uid":"8407173251015","vid":"14943445328940020","area":"US","status":"1"}
......更多的字段
- gift_record.log(用户送礼记录)
uid---主播id,vid--直播间id,good_id:礼物id,gold:礼物金币数量
{"uid":"7201232141001","vid":"14943445328940001","good_id":"223","gold":"10"}
{"uid":"7201232141002","vid":"14943445328940001","good_id":"223","gold":"20"}
{"uid":"7201232141003","vid":"14943445328940002","good_id":"223","gold":"30"}
{"uid":"7201232141004","vid":"14943445328940002","good_id":"223","gold":"40"}
{"uid":"7201232141005","vid":"14943445328940003","good_id":"223","gold":"50"}
{"uid":"7201232141006","vid":"14943445328940003","good_id":"223","gold":"10"}
{"uid":"7201232141007","vid":"14943445328940004","good_id":"223","gold":"20"}
{"uid":"7201232141008","vid":"14943445328940004","good_id":"223","gold":"30"}
{"uid":"7201232141009","vid":"14943445328940005","good_id":"223","gold":"40"}
{"uid":"7201232141010","vid":"14943445328940005","good_id":"223","gold":"50"}
{"uid":"7201232141011","vid":"14943445328940006","good_id":"223","gold":"10"}
{"uid":"7201232141012","vid":"14943445328940006","good_id":"223","gold":"20"}
{"uid":"7201232141013","vid":"14943445328940007","good_id":"223","gold":"30"}
{"uid":"7201232141014","vid":"14943445328940007","good_id":"223","gold":"40"}
{"uid":"7201232141015","vid":"14943445328940008","good_id":"223","gold":"50"}
{"uid":"7201232141016","vid":"14943445328940008","good_id":"223","gold":"10"}
{"uid":"7201232141017","vid":"14943445328940009","good_id":"223","gold":"20"}
{"uid":"7201232141018","vid":"14943445328940009","good_id":"223","gold":"30"}
2.实现思路
- 本次使用fastjson组件对字段Json内容进行提取获取video_info中的uid,vid,area三种字段,拼接格式为 (vid,(uid,area))
- 获取gift_record.log中的uid,gold字段拼接格式为 (uid,gold)
- 对用户送礼数据进行聚合,原因:一个用户可以多次送礼,所以需要统计全部送礼的数量,格式为**(uid,gold_sum)**
- 以vid为拼接桥梁,完成对上述两份数据的join拼接,格式为 (vid,(uid,area),gold_sum)
- 使用map迭代join之后的数据,最后获取到uid,area,gold_sum字段
- 由于一个用户只能存在于一个大区,但是一个用户可以直播多次,所以要以(uid,area)为key进行gold_sum的聚合格式为((uid,area),gold_sum_all)
- 实现以area大区进行分组–格式为(area,(uid,gold_sum_all))
- 使用map迭代每个分组内的数据,按金币数量倒序排序,取前5个,最终输出area、topN
这个TopN其实就是把前几名主播的id还有金币数量拼接成一个字符串 - 使用foreach将结果打印到控制台,多个字段使用制表符分割。
3.准备工作
- pom文件
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>db_spark</groupId>
<artifactId>db_spark</artifactId>
<version>1.0-SNAPSHOT</version>
<!-- Spark相关jar包 -->
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.3</version>
</dependency>
<!-- fastjson相关jar包 -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.68</version>
</dependency>
</dependencies>
<build>
<plugins>
<!-- java编译插件 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<!-- 打包插件 -->
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass></mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
- RDD创建准备
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("TopNClass").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
3.Java代码实现
- 提取字段,调整格式为Tuple2
//进行解析数据
JavaRDD<String> rdd_video = sc.textFile("F:\\video_info.log");
JavaRDD<String> rdd_gift = sc.textFile("F:\\gift_record.log");
//对rdd_video 进行调整,格式为(vid,(uid,area))
//mapToPair函数使用,将其字段解析然后封装为Tuple2格式
JavaPairRDD<String,Tuple2<String, String>> rdd_video_format = rdd_video.mapToPair(new PairFunction<String, String, Tuple2<String, String>>() {
@Override
public Tuple2<String, Tuple2<String, String>> call(String line) throws Exception {
//通过Fastjson进行数据提取
JSONObject jsonObj = JSON.parseObject(line);
String vid = jsonObj.getString("vid");
String uid = jsonObj.getString("uid");
String area = jsonObj.getString("area");
return new Tuple2<String, Tuple2<String, String>>(vid,new Tuple2<String, String>(uid,area));
}
});
//对rdd_gift 进行格式调整 格式为(vid,gold)
JavaPairRDD<String, Integer> rdd_gold = rdd_gift.mapToPair(new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String line) throws Exception {
JSONObject jsonObj = JSON.parseObject(line);
String vid =jsonObj.getString("vid");
//gold代表金币数
Integer gold = Integer.parseInt(jsonObj.getString("gold"));
return new Tuple2<String, Integer>(vid,gold);
}
});
- 对用户送礼物记录进行采用reduceByKey聚合,因为一个用户可以送礼多次
//通过(vid,gold)进行聚合,也就是相同的vid所携带的gold进行累加,格式调整为(vid,gold_sum)
JavaPairRDD<String,Integer> rdd_gold_sum = rdd_gold.reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer i1, Integer i2) throws Exception {
return i1+i2;
}
});
- 上述字段提取后包括 (vid,(uid,area)) 和 (vid,gold_sum),将其两个格式以vid进行拼接
//rdd_video_format代表第一种格式(vid,(uid,area))
//rdd_gold_sum代表第二种格式(vid,gold_sum)
//拼接的结果为(vid, ( (uid,area),gold_sum) )
rdd_video_format.join(rdd_gold_sum)
- 通过上述的拼接格式获取uid、area、gold_sum字段进行调整格式
//join两个数据组,以vid进行拼接
//原格式为(vid,((uid,area),goold_sum))
//转化为( (uid,area),gool_sum )
//因为要获取新的tuple使用mapToPair进行封装
JavaPairRDD<Tuple2<String, String>, Integer> rdd_groupPre = rdd_video_geshi.join(rdd_gold_num).mapToPair(new PairFunction<Tuple2<String, Tuple2<Tuple2<String, String>, Integer>>, Tuple2<String, String>, Integer>() {
@Override
public Tuple2<Tuple2<String, String>, Integer> call(Tuple2<String, Tuple2<Tuple2<String, String>, Integer>> tup) throws Exception {
//获取原格式的uid和area和good_sum
//uid是在原格式的,第二个大括号的第一个小括号的第一个元素,所以是tup._2._1._1
String uid = tup._2._1._1;
String area = tup._2._1._2;
Integer good_sum= tup._2._2;
return new Tuple2<Tuple2<String, String>, Integer>(new Tuple2<String, String>(uid,area),good_sum);
}
});
- 当前格式为 ( (uid,area),gool_sum ) ,由于一个用户只能属于一个大区,因此 **(uid,area)**是唯一的key,根据唯一的key,由于一个主播当前能够直播多次,所以要对每次的直播金币数进行reduceByKey聚合
//使用redusceBykey对数据进行累加聚合
//聚合完成后的格式为( (uid,area),gool_sum_all)
JavaPairRDD<Tuple2<String, String>, Integer> rdd_groupby = rdd_groupPre .reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer i1, Integer i2) throws Exception {
return i1+i2;
}
});
- 当前格式为 ( (uid,area),gool_sum_all ) ,针对当前的数据进行提取
//通过( (uid,area),good_sum_all) ----> 转化为( area, (uid,good_sum_all) )
JavaPairRDD<String, Tuple2<String, Integer>> rdd_group = rdd_groupby.mapToPair(new PairFunction<Tuple2<Tuple2<String, String>, Integer>, String, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Tuple2<String, Integer>> call(Tuple2<Tuple2<String, String>, Integer> tup) throws Exception {
String area = tup._1._2;
String uid = tup._1._1;
Integer goodnum_all = tup._2;
return new Tuple2<String, Tuple2<String, Integer>>(area,new Tuple2<String, Integer>(uid,good_sum_all));
}
});
- 针对当前的格式 ( area, (uid,good_sum_all) ) 以area进行分组
//进行以area进行分组
JavaPairRDD<String, Iterable<Tuple2<String, Integer>>> rdd = rdd_group.groupByKey();
- 根据分组后的数据进行排序进行提取
//现在的形式是<area,(uid,good_sum_all)>
//进行排序
//格式(area,topN)
JavaRDD<Tuple2<String, String>> topN = rdd.map(new Function<Tuple2<String, Iterable<Tuple2<String, Integer>>>, Tuple2<String, String>>() {
@Override
public Tuple2<String, String> call(Tuple2<String, Iterable<Tuple2<String, Integer>>> tup) throws Exception {
String area = tup._1;
//采用集合进行排序
ArrayList<Tuple2<String, Integer>> array = Lists.newArrayList(tup._2);
Collections.sort(array, new Comparator<Tuple2<String, Integer>>() {
//倒序
@Override
public int compare(Tuple2<String, Integer> o1, Tuple2<String, Integer> o2) {
return o2._2 - o1._2;
}
});
StringBuffer stringBuffer = new StringBuffer();
for (int i = 0; i < array.size(); i++) {
if (i < 5) {
Tuple2<String, Integer> tup1 = array.get(i);
if (i != 0) {
stringBuffer.append(",");
}
stringBuffer.append(tup1._1 + ":" + tup1._2);
}
}
return new Tuple2<String, String>(area, stringBuffer.toString());
}
});
- 进行foreach输出
//(area ,topN)
topN.foreach(new VoidFunction<Tuple2<String, String>>() {
@Override
public void call(Tuple2<String, String> tup) throws Exception {
System.out.println(tup._1+":"+tup._2);
}
});
运行结果
CN是area的地区,之后的数据是用户编号和获得的金币数的排行榜
完整代码
package com.dang.java;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.google.common.collect.Lists;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
/**
* @author dang
* @version 3.0
* @description ()
* @date 2022/7/9 10:55
*/
public class TopN {
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("TopN").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
//进行解析数据
JavaRDD<String> rdd_video = sc.textFile("F:\\video_info.log");
JavaRDD<String> rdd_gift = sc.textFile("F:\\gift_record.log");
//进行json数据解析
//(vid,(uid,un))
JavaPairRDD<String,Tuple2<String, String>> rdd_video_geshi = rdd_video.mapToPair(new PairFunction<String, String, Tuple2<String, String>>() {
@Override
public Tuple2<String, Tuple2<String, String>> call(String line) throws Exception {
JSONObject jsonObj = JSON.parseObject(line);
String vid = jsonObj.getString("vid");
String uid = jsonObj.getString("uid");
String area = jsonObj.getString("area");
return new Tuple2<String, Tuple2<String, String>>(vid,new Tuple2<String, String>(uid,area));
}
});
//(vid,gold)
//若是返回Tuple直接返回rdd就可以了
JavaPairRDD<String, Integer> rdd_gold = rdd_gift.mapToPair(new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String line) throws Exception {
JSONObject jsonObj = JSON.parseObject(line);
String vid =jsonObj.getString("vid");
Integer gold = Integer.parseInt(jsonObj.getString("gold"));
return new Tuple2<String, Integer>(vid,gold);
}
});
//进行数据聚合,相同vid的进行gold求和//对送礼记录进行聚合
//(vid,gold_num)
JavaPairRDD<String,Integer> rdd_gold_num = rdd_gold.reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer i1, Integer i2) throws Exception {
return i1+i2;
}
});
//join两个数据组,以vid进行拼接
//因为要获取新的tuple使用mapToPair进行封装
JavaPairRDD<Tuple2<String, String>, Integer> rdd_groupbysq = rdd_video_geshi.join(rdd_gold_num).mapToPair(new PairFunction<Tuple2<String, Tuple2<Tuple2<String, String>, Integer>>, Tuple2<String, String>, Integer>() {
@Override
public Tuple2<Tuple2<String, String>, Integer> call(Tuple2<String, Tuple2<Tuple2<String, String>, Integer>> tup) throws Exception {
String uid = tup._2._1._1;
String area = tup._2._1._2;
Integer gooldnum = tup._2._2;
return new Tuple2<Tuple2<String, String>, Integer>(new Tuple2<String, String>(uid,area),gooldnum);
}
});
//使用redusceBykey对数据进行聚合
JavaPairRDD<Tuple2<String, String>, Integer> rdd_groupby = rdd_groupbysq.reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer i1, Integer i2) throws Exception {
return i1+i2;
}
});
JavaPairRDD<String, Tuple2<String, Integer>> rdd_group = rdd_groupby.mapToPair(new PairFunction<Tuple2<Tuple2<String, String>, Integer>, String, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Tuple2<String, Integer>> call(Tuple2<Tuple2<String, String>, Integer> tup) throws Exception {
String area = tup._1._2;
String uid = tup._1._1;
Integer goodnum_all = tup._2;
return new Tuple2<String, Tuple2<String, Integer>>(area,new Tuple2<String, Integer>(uid,goodnum_all));
}
});
//进行以area进行分组
JavaPairRDD<String, Iterable<Tuple2<String, Integer>>> rdd = rdd_group.groupByKey();
//现在的形式是<area,(uid,goodnum_all)>
//进行排序
//格式(area,topN)
JavaRDD<Tuple2<String, String>> topN = rdd.map(new Function<Tuple2<String, Iterable<Tuple2<String, Integer>>>, Tuple2<String, String>>() {
@Override
public Tuple2<String, String> call(Tuple2<String, Iterable<Tuple2<String, Integer>>> tup) throws Exception {
//实现排序逻辑
String area = tup._1;
ArrayList<Tuple2<String, Integer>> array = Lists.newArrayList(tup._2);
Collections.sort(array, new Comparator<Tuple2<String, Integer>>() {
//倒序
@Override
public int compare(Tuple2<String, Integer> o1, Tuple2<String, Integer> o2) {
return o2._2 - o1._2;
}
});
StringBuffer stringBuffer = new StringBuffer();
for (int i = 0; i < array.size(); i++) {
if (i < 3) {
Tuple2<String, Integer> tup1 = array.get(i);
if (i != 0) {
stringBuffer.append(",");
}
stringBuffer.append(tup1._1 + ":" + tup1._2);
}
}
return new Tuple2<String, String>(area, stringBuffer.toString());
}
});
//(area ,topN)
topN.foreach(new VoidFunction<Tuple2<String, String>>() {
@Override
public void call(Tuple2<String, String> tup) throws Exception {
System.out.println(tup._1+":"+tup._2);
}
});
}
}