Spark2.4 基本RDD的Transform 操作总结

本篇博客涉及代码已经全部上传至github,需要可自行下载:

项目github链接

    本篇文章主要内容如下:

一. 环境准备

1. spark2.4.2(19年4月23日最新版本)

2. scala2.11+

3. pom.xml导入相关依赖,如下:

    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.11.0</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.4.2</version>
        </dependency>
    </dependencies>
二.必要的初始化
1. 初始化SparkContext上下文以及后续transform操作需要用到的RDD
  • java
        SparkConf sparkConf = new SparkConf();
        sparkConf.setAppName("transform rdd test");
        sparkConf.setMaster("local[2]");
        JavaSparkContext jsc = new JavaSparkContext(sparkConf);
        JavaRDD<Integer> javaRDD1 = jsc.parallelize(Arrays.asList(1,2,2,3,4,5,6,7,8,9));
        JavaRDD<Integer> javaRDD2 = jsc.parallelize(Arrays.asList(2,4,6,8,10));
        JavaRDD<String> javaRDD3 = jsc.parallelize(Arrays.asList("1,3,5,7,9","2,4,6,8,10"));
  • scala
    val sparkConf = new SparkConf().setMaster("local").setAppName("transform rdd test scala")
    val sparkContext = new SparkContext(sparkConf)
    val scalaRDD1 = sparkContext.parallelize(List(1, 2, 2, 3, 4, 5, 6, 7, 8, 9))
    val scalaRDD2 = sparkContext.parallelize(List(2, 4, 6, 8, 10))
    val scalaRDD3 = sparkContext.parallelize(List("1,3,5,7,9", "2,4,6,8,10"))
2. 提供一个方法,在得到transform rdd的时候调用,目的是为了输出rdd的值,供我们验证结果
  • java
    private static void resultCollect(JavaRDD filter) {
        List collect = filter.collect();
        for (int i = 0; i < collect.size(); i++) {
            System.out.print(collect.get(i) + " ");
        }
        System.out.println();
    }
  • scala
  def resultCollect(rdd: RDD[Int]) = {
    val results = rdd.collect()
    for (i <- 0 to results.size - 1) {
        print(results(i) +" ")
    }
  }
三. 基本RDD的transform操作总结
  • Transform操作简介

    Transform操作是返回一个新的RDD的操作,比如map()和filter().调用转换操作时,不会立即执行.Spark会在内部记录下索要求执行的操作的相关信息,直到action操作触发时,才开始执行真正的计算.

下面,使用scala和java两种语言介绍具体的transfrom操作,涉及到的所有代码全部经过测试,请放心使用~

  • map

    map: 将函数应用于RDD中的每一个元素,每个元素对应于一个结果,将返回的结果值构成新的RDD
    案例 : 将RDD的元素 * 2 之后输出
    
java
        JavaRDD<Integer> map = javaRDD.map(new Function<Integer, Integer>() {
            public Integer call(Integer i) throws Exception {
                return i * 2;
            }
        });
scala
val rdd = rddScala.map(i => i * 2)

输出结果: 2 4 4 6 8 10 12 14 16 18

  • filter

    filter: 对RDD的数据进行过滤,返回符合条件的值组成的RDD
    案例 : 过滤出RDD中的奇数
    
java
        JavaRDD<Integer> filter = javaRDD1.filter(new Function<Integer, Boolean>() {
            public Boolean call(Integer integer) throws Exception {
                return integer % 2 != 0;
            }
        });

scala
val filter = rddScala.filter(i => i % 2 != 0)

输出结果: 1 3 5 7 9

  • flatMap

    flatMap: 原RDD中的每一个元素,对应于新RDD里面的一个迭代器
    案例 : 将RDD的每个字符串按照逗号进行分割,输出结果
    
java
        JavaRDD<String> javaRDD = javaRDD3.flatMap(new FlatMapFunction<String, String>() {
            public Iterator<String> call(String s) throws Exception {
                String[] strings = s.split(",");
                List<String> list = new ArrayList<String>(Arrays.asList(strings));
                return list.iterator();
            }
        });

scala
val flatMapRdd = rddScala3.flatMap(str => str.split(","))

输出结果: 1 3 5 7 9 2 4 6 8 10

注: 该结果是我程序输出的结果, flatMap操作本身返回的RDD是两个迭代器,一个是1 3 5 7 9 另一个对应的是2 4 6 8 10

  • distinct

    distinct: 对集合中的元素进行去重
    
java
 JavaRDD<Integer> distinct = javaRDD1.distinct();
scala
val distinctRDD = scalaRDD1.distinct()

输出结果:4 6 8 2 1 3 7 9 5(这里也是比较特殊的,去重之后并不是按照我们预想的顺序输出哦~)

  • union

    union:  生成一个包含两个集合所有元素的RDD(必须是同种类型的RDD才可以进行union操作)
    
java
  JavaRDD<Integer> union = javaRDD1.union(javaRDD2);
scala
   val unionRDD = scalaRDD1.union(scalaRDD2)

输出果:1 2 2 3 4 5 6 7 8 9 2 4 6 8 10

  • intersection

    intersection: 返回两个RDD的共同元素(必须是同种类型的RDD才可以进行该操作) 
    
java
JavaRDD<Integer> intersection = javaRDD1.intersection(javaRDD2);
scala
 val intersectionRDD = scalaRDD1.intersection(scalaRDD2)

输出结果:4 6 8 2

  • curtesion操作

    curtesion: 两个RDD的笛卡尔积(两个RDD数据量比较大时,非常消耗性能,慎用!)
    
java
 JavaPairRDD<Integer, Integer> cartesian = javaRDD1.cartesian(javaRDD2);
        List<Tuple2<Integer, Integer>> collects = cartesian.collect();
        for (Tuple2<Integer, Integer> collect: collects) {
            System.out.print(collect._1 + "," + collect._2 +" ");
        }
        System.out.println();
scala
    val cartesianRdd = scalaRDD1.cartesian(scalaRDD2)
    val results = cartesianRdd.collect();
    for (result<- results) {
      print(result._1 + "," + result._2 + " ")
    }

输出结果:1,2 1,4 2,2 2,4 2,2 2,4 3,2 3,4 4,2 4,4 1,6 1,8 1,10 2,6 2,8 2,10 2,6 2,8 2,10 3,6 3,8 3,10 4,6 4,8 4,10 5,2 5,4 6,2 6,4 7,2 7,4 8,2 8,4 9,2 9,4 5,6 5,8 5,10 6,6 6,8 6,10 7,6 7,8 7,10 8,6 8,8 8,10 9,6 9,8 9,10(仅仅是这么几个元素就有这么多结果,可想而知数据量大会发生什么,一定要慎用!)

  • substract

    substract : 移除一个RDD中的内容
    
java
  JavaRDD<Integer> subtract = javaRDD1.subtract(javaRDD2);
scala
 val substractRDD = scalaRDD1.subtract(scalaRDD2)

输出结果:1 3 5 7 9

  • sample

     sample: 对RDD进行采样, 传入的第一个参数是是否进行替换,第二个参数是采样的比例(返回的结果是随机的)
     案例 : 在不替换的前提下,抽取RDD 10%的数据
    
java
 JavaRDD<Integer> sample = javaRDD1.sample(false, 0.1);
scala
val sampleRDD = scalaRDD1.sample(false, 0.1)

输出结果: 2 5(这个是随机的,每次运行的结果可能都不一样)

四. 全部程序代码
java
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import scala.Tuple2;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;

/**
 * @author xmr
 * @date 2019/4/29 10:02
 * @description spark转换操作RDD
 */
public class SparkTransformRdd {
    public static void main(String[] args) {
        SparkConf sparkConf = new SparkConf();
        sparkConf.setAppName("transform rdd test");
        sparkConf.setMaster("local[2]");
        JavaSparkContext jsc = new JavaSparkContext(sparkConf);
        JavaRDD<Integer> javaRDD1 = jsc.parallelize(Arrays.asList(1,2,2,3,4,5,6,7,8,9));
        JavaRDD<Integer> javaRDD2 = jsc.parallelize(Arrays.asList(2,4,6,8,10));
        JavaRDD<String> javaRDD3 = jsc.parallelize(Arrays.asList("1,3,5,7,9","2,4,6,8,10"));
        mapTest(javaRDD1);
        filterTest(javaRDD1);
        flatMapTest(javaRDD3);
        distinctTest(javaRDD1);
        unionTest(javaRDD1, javaRDD2);
        cartesionTest(javaRDD1, javaRDD2);
        intersectionTest(javaRDD1, javaRDD2);
        mapToPairTest(javaRDD3);
        substartTest(javaRDD1, javaRDD2);
        sampleTest(javaRDD1);
    }

    private static void sampleTest(JavaRDD<Integer> javaRDD1) {
        JavaRDD<Integer> sample = javaRDD1.sample(false, 0.1);
        resultCollect(sample);
    }

    private static void substartTest(JavaRDD<Integer> javaRDD1, JavaRDD<Integer> javaRDD2) {
        JavaRDD<Integer> subtract = javaRDD1.subtract(javaRDD2);
        resultCollect(subtract);
    }

    private static void mapToPairTest(JavaRDD<String> javaRDD3) {
    }

    private static void intersectionTest(JavaRDD<Integer> javaRDD1, JavaRDD<Integer> javaRDD2) {
        JavaRDD<Integer> intersection = javaRDD1.intersection(javaRDD2);
        resultCollect(intersection);
    }

    private static void cartesionTest(JavaRDD<Integer> javaRDD1, JavaRDD<Integer> javaRDD2) {
        JavaPairRDD<Integer, Integer> cartesian = javaRDD1.cartesian(javaRDD2);
        List<Tuple2<Integer, Integer>> collects = cartesian.collect();
        for (Tuple2<Integer, Integer> collect: collects) {
            System.out.print(collect._1 + "," + collect._2 +" ");
        }
        System.out.println();
    }

    private static void unionTest(JavaRDD<Integer> javaRDD1, JavaRDD<Integer> javaRDD2) {
        JavaRDD<Integer> union = javaRDD1.union(javaRDD2);
        resultCollect(union);
    }

    private static void distinctTest(JavaRDD<Integer> javaRDD1) {
        JavaRDD<Integer> distinct = javaRDD1.distinct(1);
        resultCollect(distinct);
    }

    private static void flatMapTest(JavaRDD<String> javaRDD3) {
        JavaRDD<String> javaRDD = javaRDD3.flatMap(new FlatMapFunction<String, String>() {
            public Iterator<String> call(String s) throws Exception {
                String[] strings = s.split(",");
                List<String> list = new ArrayList<String>(Arrays.asList(strings));
                return list.iterator();
            }
        });
        resultCollect(javaRDD);
    }

    private static void filterTest(JavaRDD<Integer> javaRDD1) {
        JavaRDD<Integer> filter = javaRDD1.filter(new Function<Integer, Boolean>() {
            public Boolean call(Integer integer) throws Exception {
                return integer % 2 != 0;
            }
        });
        List<Integer> collect = filter.collect();
        resultCollect(filter);
    }

    private static void resultCollect(JavaRDD filter) {
        List collect = filter.collect();
        for (int i = 0; i < collect.size(); i++) {
            System.out.print(collect.get(i) + " ");
        }
        System.out.println();
    }

    private static void mapTest(JavaRDD<Integer> javaRDD) {
        JavaRDD<Integer> map = javaRDD.map(new Function<Integer, Integer>() {
            public Integer call(Integer i) throws Exception {
                return i * 2;
            }
        });
        resultCollect(map);
    }
}

scala
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
/**
  * @author xmr
  * @date 2019/4/29 10:32
  * @description
  */
object SparkTransformRddScala {

  def resultCollect(rdd: RDD[Int]) = {
    val results = rdd.collect()
    for (i <- 0 to results.size - 1) {
        print(results(i) +" ")
    }
  }
  def resultCollectStr(rdd: RDD[String]) = {
    val results = rdd.collect()
    for (i <- 0 to results.size - 1) {
      print(results(i) +" ")
    }
  }

  def mapTest(rddScala: RDD[Int]): Unit = {
    val rdd = rddScala.map(i => i * 2)
    resultCollect(rdd)
  }

  def filter(rddScala: RDD[Int]): Unit = {
    val filter = rddScala.filter(i => i % 2 != 0)
    resultCollect(filter)
  }

  def flatMapTest(rddScala3: RDD[String]): Unit = {
    val flatMapRdd = rddScala3.flatMap(str => str.split(","))
    resultCollectStr(flatMapRdd)
  }

  def unionTest(scalaRDD1: RDD[Int], scalaRDD2: RDD[Int]): Unit = {
    val unionRDD = scalaRDD1.union(scalaRDD2)
    resultCollect(unionRDD)
  }

  def cartesionTest(scalaRDD1: RDD[Int], scalaRDD2: RDD[Int]): Unit = {
    val cartesianRdd = scalaRDD1.cartesian(scalaRDD2)
    val results = cartesianRdd.collect();
    for (result<- results) {
      print(result._1 + "," + result._2 + " ")
    }
  }

  def intersectionTest(scalaRDD1: RDD[Int], scalaRDD2: RDD[Int]): Unit = {
    val intersectionRDD = scalaRDD1.intersection(scalaRDD2)
    resultCollect(intersectionRDD)
  }

  def substractTest(scalaRDD1: RDD[Int], scalaRDD2: RDD[Int]): Unit = {
    val substractRDD = scalaRDD1.subtract(scalaRDD2)
    resultCollect(substractRDD)
  }

  def sampleTest(scalaRDD1: RDD[Int]): Unit = {
    val sampleRDD = scalaRDD1.sample(false, 0.1)
    resultCollect(sampleRDD)
  }

  def distinctTest(scalaRDD1: RDD[Int]) = {
    val distinctRDD = scalaRDD1.distinct()
    resultCollect(distinctRDD)
  }

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local").setAppName("transform rdd test scala")
    val sparkContext = new SparkContext(sparkConf)
    val scalaRDD1 = sparkContext.parallelize(List(1, 2, 2, 3, 4, 5, 6, 7, 8, 9))
    val scalaRDD2 = sparkContext.parallelize(List(2, 4, 6, 8, 10))
    val scalaRDD3 = sparkContext.parallelize(List("1,3,5,7,9", "2,4,6,8,10"))
    mapTest(scalaRDD1)
    filter( scalaRDD1)
    flatMapTest(scalaRDD3)
    distinctTest(scalaRDD1)
    unionTest(scalaRDD1, scalaRDD2)
    cartesionTest(scalaRDD1, scalaRDD2)
    intersectionTest(scalaRDD1, scalaRDD2)
    substractTest(scalaRDD1, scalaRDD2)
    sampleTest(scalaRDD1)
  }
}

  • 3
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值