Spark入门（十四）之分组求最大值

最新推荐文章于 2023-04-23 19:54:35 发布

茅坤宝骏氹

最新推荐文章于 2023-04-23 19:54:35 发布

阅读量2.1k

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/moakun/article/details/107219648

版权

Spark 专栏收录该内容

34 篇文章 9 订阅

订阅专栏

一、分组求最大值

计算文本里面的每个key分组求最大值，输出结果。

二、maven设置

<?xml version="1.0" encoding="UTF-8"?>
 
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
 
  <groupId>com.mk</groupId>
  <artifactId>spark-test</artifactId>
  <version>1.0</version>
 
  <name>spark-test</name>
  <url>http://spark.mk.com</url>
 
  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
    <scala.version>2.11.1</scala.version>
    <spark.version>2.4.4</spark.version>
    <hadoop.version>2.6.0</hadoop.version>
  </properties>
 
  <dependencies>
    <!-- scala依赖-->
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>
 
    <!-- spark依赖-->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
 
 
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>
  </dependencies>
 
  <build>
    <pluginManagement>
      <plugins>
 
        <plugin>
          <artifactId>maven-clean-plugin</artifactId>
          <version>3.1.0</version>
        </plugin>
 
        <plugin>
          <artifactId>maven-resources-plugin</artifactId>
          <version>3.0.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-compiler-plugin</artifactId>
          <version>3.8.0</version>
        </plugin>
        <plugin>
          <artifactId>maven-surefire-plugin</artifactId>
          <version>2.22.1</version>
        </plugin>
        <plugin>
          <artifactId>maven-jar-plugin</artifactId>
          <version>3.0.2</version>
        </plugin>
      </plugins>
    </pluginManagement>
  </build>
</project>

三、编程代码

public class GroupByMaxApp implements SparkConfInfo {

    public static void main(String[] args) {

        String filePath = "E:\\spark\\groubByNumber.txt";
        SparkSession sparkSession = new GroupByMaxApp().getSparkConf("groubByNumber");
        JavaPairRDD<String, Integer> numbers = sparkSession.sparkContext()
                .textFile(filePath, 4)
                .toJavaRDD()
                .flatMap(v -> Arrays.asList(v.split("\n")).iterator())
                .mapToPair(v -> {
                    String[] data = v.split("\\s+");
                    if (data.length != 2) {
                        return null;
                    }
                    if (!data[1].matches("-?[0-9]+(.[0-9]+)?"))
                        return null;
                    return new Tuple2<>(data[0], Integer.valueOf(data[1]));
                }).filter(v -> v != null).cache();

        //数据量大会溢出内存无法计算
//        numbers.groupByKey()
//                .sortByKey(true)
//                .mapValues(v -> {
//
//                    Integer max = null;
//                    Iterator<Integer> it = v.iterator();
//                    while (it.hasNext()) {
//                        Integer val = it.next();
//                        if(max==null || max<val){
//                            max = val;
//                        }
//                    }
//                    return max;
//                })
//                .collect()
//                .forEach(v -> System.out.println(v._1 + ":" + v._2));

        //这种聚合数据再计算
        numbers.combineByKey(max -> max,  // 将val映射为一个元组，作为分区内聚合初始值
                (max,val) -> {
                    if (max < val) {
                        max = val;
                    }
                    return max;
                }, //分区内聚合，
                (a, b) -> Math.max(a, b))   //分区间聚合
                .sortByKey(true)
                .collect()
                .forEach(v -> System.out.println(v._1 + ":" + v._2));

        sparkSession.stop();
    }
}


public interface SparkConfInfo {

    default SparkSession getSparkConf(String appName){
        SparkConf sparkConf = new SparkConf();
        if(System.getProperty("os.name").toLowerCase().contains("win")) {
            sparkConf.setMaster("local[4]");
            System.out.println("使用本地模拟是spark");
        }else
        {
            sparkConf.setMaster("spark://hadoop01:7077,hadoop02:7077,hadoop03:7077");
            sparkConf.set("spark.driver.host","192.168.150.1");//本地ip，必须与spark集群能够相互访问，如：同一个局域网
            sparkConf.setJars(new String[] {".\\out\\artifacts\\spark_test\\spark-test.jar"});//项目构建生成的路径
        }

        SparkSession session = SparkSession.builder().appName(appName).config(sparkConf).config(sparkConf).getOrCreate();
        return session;
    }
}

groubByNumber.txt文件内容

输出

A:538
B:79
C:774
D:78
E:888
F:543
G:49
H:67

四、combineByKey方法

<C> JavaPairRDD<K, C> combineByKey(Function<V, C> createCombiner, 
                                   Function2<C, V, C> mergeValue, 
                                   Function2<C, C, C> mergeCombiners);

首先介绍一下上面三个参数：

* Users provide three functions:
* - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)
这个函数把当前的值作为参数，此时我们可以对其做些附加操作(类型转换)并把它返回 (这一步类似于初始化操作)
* - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)
该函数把元素V合并到之前的元素C(createCombiner)上 (这个操作在每个分区内进行)
* - `mergeCombiners`, to combine two C's into a single one.
该函数把2个元素C合并 (这个操作在不同分区间进行)