Spark实战案例---TopN主播统计

Spark实战案例

需求:计算每个大区当天金币收入TopN的主播,这里以前5位为主,即Top5;

1.数据准备
数据都以JSON格式进行存储的,主要包括以下两个文件

  1. video_info.log(主播直播记录)
uid代表主播的id,vid代表当前直播间的id,area代表当前要统计的大区名称
{"uid":"8407173251001","vid":"14943445328940001","area":"US","status":"1"}
{"uid":"8407173251002","vid":"14943445328940002","area":"ID","status":"1"}
{"uid":"8407173251003","vid":"14943445328940003","area":"CN","status":"1"}
{"uid":"8407173251004","vid":"14943445328940004","area":"US","status":"1"}
{"uid":"8407173251005","vid":"14943445328940005","area":"ID","status":"1"}
{"uid":"8407173251006","vid":"14943445328940006","area":"CN","status":"1"}
{"uid":"8407173251007","vid":"14943445328940007","area":"ID","status":"1"}
{"uid":"8407173251008","vid":"14943445328940008","area":"CN","status":"1"}
{"uid":"8407173251009","vid":"14943445328940009","area":"US","status":"1"}
{"uid":"8407173251010","vid":"14943445328940010","area":"ID","status":"1"}
{"uid":"8407173251011","vid":"14943445328940011","area":"CN","status":"1"}
{"uid":"8407173251012","vid":"14943445328940012","area":"US","status":"1"}
{"uid":"8407173251013","vid":"14943445328940013","area":"ID","status":"1"}
{"uid":"8407173251014","vid":"14943445328940014","area":"CN","status":"1"}
{"uid":"8407173251015","vid":"14943445328940015","area":"US","status":"1"}
{"uid":"8407173251001","vid":"14943445328940016","area":"US","status":"1"}
{"uid":"8407173251005","vid":"14943445328940017","area":"ID","status":"1"}
{"uid":"8407173251008","vid":"14943445328940018","area":"CN","status":"1"}
{"uid":"8407173251010","vid":"14943445328940019","area":"ID","status":"1"}
{"uid":"8407173251015","vid":"14943445328940020","area":"US","status":"1"}
......更多的字段
  1. gift_record.log(用户送礼记录)
uid---主播id,vid--直播间id,good_id:礼物id,gold:礼物金币数量
{"uid":"7201232141001","vid":"14943445328940001","good_id":"223","gold":"10"}
{"uid":"7201232141002","vid":"14943445328940001","good_id":"223","gold":"20"}
{"uid":"7201232141003","vid":"14943445328940002","good_id":"223","gold":"30"}
{"uid":"7201232141004","vid":"14943445328940002","good_id":"223","gold":"40"}
{"uid":"7201232141005","vid":"14943445328940003","good_id":"223","gold":"50"}
{"uid":"7201232141006","vid":"14943445328940003","good_id":"223","gold":"10"}
{"uid":"7201232141007","vid":"14943445328940004","good_id":"223","gold":"20"}
{"uid":"7201232141008","vid":"14943445328940004","good_id":"223","gold":"30"}
{"uid":"7201232141009","vid":"14943445328940005","good_id":"223","gold":"40"}
{"uid":"7201232141010","vid":"14943445328940005","good_id":"223","gold":"50"}
{"uid":"7201232141011","vid":"14943445328940006","good_id":"223","gold":"10"}
{"uid":"7201232141012","vid":"14943445328940006","good_id":"223","gold":"20"}
{"uid":"7201232141013","vid":"14943445328940007","good_id":"223","gold":"30"}
{"uid":"7201232141014","vid":"14943445328940007","good_id":"223","gold":"40"}
{"uid":"7201232141015","vid":"14943445328940008","good_id":"223","gold":"50"}
{"uid":"7201232141016","vid":"14943445328940008","good_id":"223","gold":"10"}
{"uid":"7201232141017","vid":"14943445328940009","good_id":"223","gold":"20"}
{"uid":"7201232141018","vid":"14943445328940009","good_id":"223","gold":"30"}

2.实现思路

  1. 本次使用fastjson组件对字段Json内容进行提取获取video_info中的uid,vid,area三种字段,拼接格式为 (vid,(uid,area))
  2. 获取gift_record.log中的uid,gold字段拼接格式为 (uid,gold)
  3. 对用户送礼数据进行聚合,原因:一个用户可以多次送礼,所以需要统计全部送礼的数量,格式为**(uid,gold_sum)**
  4. 以vid为拼接桥梁,完成对上述两份数据的join拼接,格式为 (vid,(uid,area),gold_sum)
  5. 使用map迭代join之后的数据,最后获取到uid,area,gold_sum字段
  6. 由于一个用户只能存在于一个大区,但是一个用户可以直播多次,所以要以(uid,area)为key进行gold_sum的聚合格式为((uid,area),gold_sum_all)
  7. 实现以area大区进行分组–格式为(area,(uid,gold_sum_all))
  8. 使用map迭代每个分组内的数据,按金币数量倒序排序,取前5个,最终输出area、topN
    这个TopN其实就是把前几名主播的id还有金币数量拼接成一个字符串
  9. 使用foreach将结果打印到控制台,多个字段使用制表符分割。

3.准备工作

  1. pom文件
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>db_spark</groupId>
    <artifactId>db_spark</artifactId>
    <version>1.0-SNAPSHOT</version>
    <!-- Spark相关jar包 -->
    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.4.3</version>
        </dependency>
        <!-- fastjson相关jar包 -->
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.68</version>
        </dependency>
    </dependencies>
    
    <build>
    <plugins>
    <!-- java编译插件 -->
    <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.6.0</version>
        <configuration>
            <source>1.8</source>
            <target>1.8</target>
            <encoding>UTF-8</encoding>
        </configuration>
    </plugin>
        <!-- 打包插件 -->
        <plugin>
            <artifactId>maven-assembly-plugin</artifactId>
            <configuration>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
                <archive>
                    <manifest>
                        <mainClass></mainClass>
                    </manifest>
                </archive>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
    </build>
</project>
  1. RDD创建准备
		SparkConf sparkConf = new SparkConf();
        sparkConf.setAppName("TopNClass").setMaster("local");
        JavaSparkContext sc = new JavaSparkContext(sparkConf);

3.Java代码实现

  1. 提取字段,调整格式为Tuple2
		//进行解析数据
        JavaRDD<String> rdd_video = sc.textFile("F:\\video_info.log");
        JavaRDD<String> rdd_gift = sc.textFile("F:\\gift_record.log");
        
        //对rdd_video 进行调整,格式为(vid,(uid,area))
        //mapToPair函数使用,将其字段解析然后封装为Tuple2格式 
         JavaPairRDD<String,Tuple2<String, String>> rdd_video_format = rdd_video.mapToPair(new PairFunction<String, String, Tuple2<String, String>>() {
             @Override
             public Tuple2<String, Tuple2<String, String>> call(String line) throws Exception {
					//通过Fastjson进行数据提取
                 JSONObject jsonObj = JSON.parseObject(line);
                 String vid  =  jsonObj.getString("vid");
                 String uid = jsonObj.getString("uid");
                 String area = jsonObj.getString("area");
                 return new Tuple2<String, Tuple2<String, String>>(vid,new Tuple2<String, String>(uid,area));
             }
         });
		
		    //对rdd_gift 进行格式调整 格式为(vid,gold)
        JavaPairRDD<String, Integer> rdd_gold =  rdd_gift.mapToPair(new PairFunction<String, String, Integer>() {
            @Override
            public Tuple2<String, Integer> call(String line) throws Exception {
                JSONObject jsonObj = JSON.parseObject(line);
                String vid  =jsonObj.getString("vid");
                //gold代表金币数
                Integer gold = Integer.parseInt(jsonObj.getString("gold"));
                return new Tuple2<String, Integer>(vid,gold);
            }
        });


  1. 对用户送礼物记录进行采用reduceByKey聚合,因为一个用户可以送礼多次
   		//通过(vid,gold)进行聚合,也就是相同的vid所携带的gold进行累加,格式调整为(vid,gold_sum)
       JavaPairRDD<String,Integer> rdd_gold_sum =  rdd_gold.reduceByKey(new Function2<Integer, Integer, Integer>() {
            @Override
            public Integer call(Integer i1, Integer i2) throws Exception {
                return i1+i2;
            }
        });
  1. 上述字段提取后包括 (vid,(uid,area))(vid,gold_sum),将其两个格式以vid进行拼接
		//rdd_video_format代表第一种格式(vid,(uid,area))
		//rdd_gold_sum代表第二种格式(vid,gold_sum)
		//拼接的结果为(vid, ( (uid,area),gold_sum) )
		rdd_video_format.join(rdd_gold_sum)
  1. 通过上述的拼接格式获取uid、area、gold_sum字段进行调整格式
   		//join两个数据组,以vid进行拼接
        //原格式为(vid,((uid,area),goold_sum))
        //转化为( (uid,area),gool_sum ) 
        //因为要获取新的tuple使用mapToPair进行封装
        JavaPairRDD<Tuple2<String, String>, Integer>  rdd_groupPre = rdd_video_geshi.join(rdd_gold_num).mapToPair(new PairFunction<Tuple2<String, Tuple2<Tuple2<String, String>, Integer>>, Tuple2<String, String>, Integer>() {
            @Override
            public Tuple2<Tuple2<String, String>, Integer> call(Tuple2<String, Tuple2<Tuple2<String, String>, Integer>> tup) throws Exception {
				//获取原格式的uid和area和good_sum
				//uid是在原格式的,第二个大括号的第一个小括号的第一个元素,所以是tup._2._1._1
                String uid = tup._2._1._1;
                String area = tup._2._1._2;
                Integer good_sum= tup._2._2;
                return new Tuple2<Tuple2<String, String>, Integer>(new Tuple2<String, String>(uid,area),good_sum);
            }
        });
  1. 当前格式为 ( (uid,area),gool_sum ) ,由于一个用户只能属于一个大区,因此 **(uid,area)**是唯一的key,根据唯一的key,由于一个主播当前能够直播多次,所以要对每次的直播金币数进行reduceByKey聚合
  //使用redusceBykey对数据进行累加聚合
  //聚合完成后的格式为( (uid,area),gool_sum_all)
   JavaPairRDD<Tuple2<String, String>, Integer>  rdd_groupby =  rdd_groupPre .reduceByKey(new Function2<Integer, Integer, Integer>() {
            @Override
            public Integer call(Integer i1, Integer i2) throws Exception {
                return i1+i2;
            }
        });
  1. 当前格式为 ( (uid,area),gool_sum_all ) ,针对当前的数据进行提取
        //通过( (uid,area),good_sum_all)  ---->  转化为( area, (uid,good_sum_all) )
        JavaPairRDD<String, Tuple2<String, Integer>> rdd_group =  rdd_groupby.mapToPair(new PairFunction<Tuple2<Tuple2<String, String>, Integer>, String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Tuple2<String, Integer>> call(Tuple2<Tuple2<String, String>, Integer> tup) throws Exception {
                String area = tup._1._2;
                String uid = tup._1._1;
                Integer goodnum_all = tup._2;
                return new Tuple2<String, Tuple2<String, Integer>>(area,new Tuple2<String, Integer>(uid,good_sum_all));
            }
        });
  1. 针对当前的格式 ( area, (uid,good_sum_all) ) 以area进行分组
//进行以area进行分组
JavaPairRDD<String, Iterable<Tuple2<String, Integer>>> rdd =  rdd_group.groupByKey();
  1. 根据分组后的数据进行排序进行提取
  //现在的形式是<area,(uid,good_sum_all)>
        //进行排序
        //格式(area,topN)
        JavaRDD<Tuple2<String, String>> topN = rdd.map(new Function<Tuple2<String, Iterable<Tuple2<String, Integer>>>, Tuple2<String, String>>() {
            @Override
            public Tuple2<String, String> call(Tuple2<String, Iterable<Tuple2<String, Integer>>> tup) throws Exception {
                String area = tup._1;
                //采用集合进行排序
                ArrayList<Tuple2<String, Integer>> array = Lists.newArrayList(tup._2);
                Collections.sort(array, new Comparator<Tuple2<String, Integer>>() {
                    //倒序
                    @Override
                    public int compare(Tuple2<String, Integer> o1, Tuple2<String, Integer> o2) {
                        return o2._2 - o1._2;
                    }
                });
                StringBuffer stringBuffer = new StringBuffer();
                for (int i = 0; i < array.size(); i++) {
                    if (i < 5) {
                        Tuple2<String, Integer> tup1 = array.get(i);
                        if (i != 0) {
                            stringBuffer.append(",");
                        }
                        stringBuffer.append(tup1._1 + ":" + tup1._2);
                    }
                }
                return new Tuple2<String, String>(area, stringBuffer.toString());
            }
        });
  1. 进行foreach输出
       //(area ,topN)
        topN.foreach(new VoidFunction<Tuple2<String, String>>() {
            @Override
            public void call(Tuple2<String, String> tup) throws Exception {
                System.out.println(tup._1+":"+tup._2);
            }
        });

运行结果
在这里插入图片描述

CN是area的地区,之后的数据是用户编号和获得的金币数的排行榜

完整代码

package com.dang.java;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.google.common.collect.Lists;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;

import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;

/**
 * @author dang
 * @version 3.0
 * @description ()
 * @date 2022/7/9 10:55
 */
public class TopN {

    public static void main(String[] args) {

        SparkConf sparkConf = new SparkConf();
        sparkConf.setAppName("TopN").setMaster("local");

        JavaSparkContext sc = new JavaSparkContext(sparkConf);

        //进行解析数据
        JavaRDD<String> rdd_video = sc.textFile("F:\\video_info.log");
        JavaRDD<String> rdd_gift = sc.textFile("F:\\gift_record.log");

        //进行json数据解析
        //(vid,(uid,un))
         JavaPairRDD<String,Tuple2<String, String>> rdd_video_geshi = rdd_video.mapToPair(new PairFunction<String, String, Tuple2<String, String>>() {
             @Override
             public Tuple2<String, Tuple2<String, String>> call(String line) throws Exception {

                 JSONObject jsonObj = JSON.parseObject(line);
                 String vid  =  jsonObj.getString("vid");
                 String uid = jsonObj.getString("uid");
                 String area = jsonObj.getString("area");
                 return new Tuple2<String, Tuple2<String, String>>(vid,new Tuple2<String, String>(uid,area));
             }
         });

         //(vid,gold)
        //若是返回Tuple直接返回rdd就可以了
        JavaPairRDD<String, Integer> rdd_gold =  rdd_gift.mapToPair(new PairFunction<String, String, Integer>() {
            @Override
            public Tuple2<String, Integer> call(String line) throws Exception {
                JSONObject jsonObj = JSON.parseObject(line);
                String vid  =jsonObj.getString("vid");
                Integer gold = Integer.parseInt(jsonObj.getString("gold"));
                return new Tuple2<String, Integer>(vid,gold);
            }
        });

        //进行数据聚合,相同vid的进行gold求和//对送礼记录进行聚合
        //(vid,gold_num)
          JavaPairRDD<String,Integer> rdd_gold_num =  rdd_gold.reduceByKey(new Function2<Integer, Integer, Integer>() {
            @Override
            public Integer call(Integer i1, Integer i2) throws Exception {
                return i1+i2;
            }
        });


        //join两个数据组,以vid进行拼接
        //因为要获取新的tuple使用mapToPair进行封装
        JavaPairRDD<Tuple2<String, String>, Integer>  rdd_groupbysq = rdd_video_geshi.join(rdd_gold_num).mapToPair(new PairFunction<Tuple2<String, Tuple2<Tuple2<String, String>, Integer>>, Tuple2<String, String>, Integer>() {
            @Override
            public Tuple2<Tuple2<String, String>, Integer> call(Tuple2<String, Tuple2<Tuple2<String, String>, Integer>> tup) throws Exception {

                String uid = tup._2._1._1;
                String area = tup._2._1._2;
                Integer gooldnum = tup._2._2;

                return new Tuple2<Tuple2<String, String>, Integer>(new Tuple2<String, String>(uid,area),gooldnum);
            }
        });
        //使用redusceBykey对数据进行聚合
        JavaPairRDD<Tuple2<String, String>, Integer>  rdd_groupby =  rdd_groupbysq.reduceByKey(new Function2<Integer, Integer, Integer>() {
            @Override
            public Integer call(Integer i1, Integer i2) throws Exception {
                return i1+i2;
            }
        });
       
       
        JavaPairRDD<String, Tuple2<String, Integer>> rdd_group =  rdd_groupby.mapToPair(new PairFunction<Tuple2<Tuple2<String, String>, Integer>, String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Tuple2<String, Integer>> call(Tuple2<Tuple2<String, String>, Integer> tup) throws Exception {
                String area = tup._1._2;
                String uid = tup._1._1;
                Integer goodnum_all = tup._2;
                return new Tuple2<String, Tuple2<String, Integer>>(area,new Tuple2<String, Integer>(uid,goodnum_all));
            }
        });

        //进行以area进行分组
        JavaPairRDD<String, Iterable<Tuple2<String, Integer>>> rdd =  rdd_group.groupByKey();
        //现在的形式是<area,(uid,goodnum_all)>
        //进行排序
        //格式(area,topN)
        JavaRDD<Tuple2<String, String>> topN = rdd.map(new Function<Tuple2<String, Iterable<Tuple2<String, Integer>>>, Tuple2<String, String>>() {
            @Override
            public Tuple2<String, String> call(Tuple2<String, Iterable<Tuple2<String, Integer>>> tup) throws Exception {
                //实现排序逻辑
                String area = tup._1;
                ArrayList<Tuple2<String, Integer>> array = Lists.newArrayList(tup._2);
                Collections.sort(array, new Comparator<Tuple2<String, Integer>>() {
                    //倒序
                    @Override
                    public int compare(Tuple2<String, Integer> o1, Tuple2<String, Integer> o2) {
                        return o2._2 - o1._2;
                    }
                });
                StringBuffer stringBuffer = new StringBuffer();

                for (int i = 0; i < array.size(); i++) {
                    if (i < 3) {
                        Tuple2<String, Integer> tup1 = array.get(i);
                        if (i != 0) {
                            stringBuffer.append(",");
                        }
                        stringBuffer.append(tup1._1 + ":" + tup1._2);
                    }
                }


                return new Tuple2<String, String>(area, stringBuffer.toString());
            }
        });

        //(area ,topN)
        topN.foreach(new VoidFunction<Tuple2<String, String>>() {
            @Override
            public void call(Tuple2<String, String> tup) throws Exception {
                System.out.println(tup._1+":"+tup._2);
            }
        });
    }

}

  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值