本篇博客涉及代码已经全部上传至github,需要可自行下载:
本篇文章主要内容如下:
一. 环境准备
1. spark2.4.2(19年4月23日最新版本)
2. scala2.11+
3. pom.xml导入相关依赖,如下:
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.2</version>
</dependency>
</dependencies>
二.必要的初始化
1. 初始化SparkContext上下文以及后续transform操作需要用到的RDD
- java
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("transform rdd test");
sparkConf.setMaster("local[2]");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
JavaRDD<Integer> javaRDD1 = jsc.parallelize(Arrays.asList(1,2,2,3,4,5,6,7,8,9));
JavaRDD<Integer> javaRDD2 = jsc.parallelize(Arrays.asList(2,4,6,8,10));
JavaRDD<String> javaRDD3 = jsc.parallelize(Arrays.asList("1,3,5,7,9","2,4,6,8,10"));
- scala
val sparkConf = new SparkConf().setMaster("local").setAppName("transform rdd test scala")
val sparkContext = new SparkContext(sparkConf)
val scalaRDD1 = sparkContext.parallelize(List(1, 2, 2, 3, 4, 5, 6, 7, 8, 9))
val scalaRDD2 = sparkContext.parallelize(List(2, 4, 6, 8, 10))
val scalaRDD3 = sparkContext.parallelize(List("1,3,5,7,9", "2,4,6,8,10"))
2. 提供一个方法,在得到transform rdd的时候调用,目的是为了输出rdd的值,供我们验证结果
- java
private static void resultCollect(JavaRDD filter) {
List collect = filter.collect();
for (int i = 0; i < collect.size(); i++) {
System.out.print(collect.get(i) + " ");
}
System.out.println();
}
- scala
def resultCollect(rdd: RDD[Int]) = {
val results = rdd.collect()
for (i <- 0 to results.size - 1) {
print(results(i) +" ")
}
}
三. 基本RDD的transform操作总结
- Transform操作简介
Transform操作是返回一个新的RDD的操作,比如map()和filter().调用转换操作时,不会立即执行.Spark会在内部记录下索要求执行的操作的相关信息,直到action操作触发时,才开始执行真正的计算.
下面,使用scala和java两种语言介绍具体的transfrom操作,涉及到的所有代码全部经过测试,请放心使用~
-
map
map: 将函数应用于RDD中的每一个元素,每个元素对应于一个结果,将返回的结果值构成新的RDD 案例 : 将RDD的元素 * 2 之后输出
java
JavaRDD<Integer> map = javaRDD.map(new Function<Integer, Integer>() {
public Integer call(Integer i) throws Exception {
return i * 2;
}
});
scala
val rdd = rddScala.map(i => i * 2)
输出结果: 2 4 4 6 8 10 12 14 16 18
-
filter
filter: 对RDD的数据进行过滤,返回符合条件的值组成的RDD 案例 : 过滤出RDD中的奇数
java
JavaRDD<Integer> filter = javaRDD1.filter(new Function<Integer, Boolean>() {
public Boolean call(Integer integer) throws Exception {
return integer % 2 != 0;
}
});
scala
val filter = rddScala.filter(i => i % 2 != 0)
输出结果: 1 3 5 7 9
-
flatMap
flatMap: 原RDD中的每一个元素,对应于新RDD里面的一个迭代器 案例 : 将RDD的每个字符串按照逗号进行分割,输出结果
java
JavaRDD<String> javaRDD = javaRDD3.flatMap(new FlatMapFunction<String, String>() {
public Iterator<String> call(String s) throws Exception {
String[] strings = s.split(",");
List<String> list = new ArrayList<String>(Arrays.asList(strings));
return list.iterator();
}
});
scala
val flatMapRdd = rddScala3.flatMap(str => str.split(","))
输出结果: 1 3 5 7 9 2 4 6 8 10
注: 该结果是我程序输出的结果, flatMap操作本身返回的RDD是两个迭代器,一个是1 3 5 7 9 另一个对应的是2 4 6 8 10
-
distinct
distinct: 对集合中的元素进行去重
java
JavaRDD<Integer> distinct = javaRDD1.distinct();
scala
val distinctRDD = scalaRDD1.distinct()
输出结果:4 6 8 2 1 3 7 9 5(这里也是比较特殊的,去重之后并不是按照我们预想的顺序输出哦~)
-
union
union: 生成一个包含两个集合所有元素的RDD(必须是同种类型的RDD才可以进行union操作)
java
JavaRDD<Integer> union = javaRDD1.union(javaRDD2);
scala
val unionRDD = scalaRDD1.union(scalaRDD2)
输出果:1 2 2 3 4 5 6 7 8 9 2 4 6 8 10
-
intersection
intersection: 返回两个RDD的共同元素(必须是同种类型的RDD才可以进行该操作)
java
JavaRDD<Integer> intersection = javaRDD1.intersection(javaRDD2);
scala
val intersectionRDD = scalaRDD1.intersection(scalaRDD2)
输出结果:4 6 8 2
-
curtesion操作
curtesion: 两个RDD的笛卡尔积(两个RDD数据量比较大时,非常消耗性能,慎用!)
java
JavaPairRDD<Integer, Integer> cartesian = javaRDD1.cartesian(javaRDD2);
List<Tuple2<Integer, Integer>> collects = cartesian.collect();
for (Tuple2<Integer, Integer> collect: collects) {
System.out.print(collect._1 + "," + collect._2 +" ");
}
System.out.println();
scala
val cartesianRdd = scalaRDD1.cartesian(scalaRDD2)
val results = cartesianRdd.collect();
for (result<- results) {
print(result._1 + "," + result._2 + " ")
}
输出结果:1,2 1,4 2,2 2,4 2,2 2,4 3,2 3,4 4,2 4,4 1,6 1,8 1,10 2,6 2,8 2,10 2,6 2,8 2,10 3,6 3,8 3,10 4,6 4,8 4,10 5,2 5,4 6,2 6,4 7,2 7,4 8,2 8,4 9,2 9,4 5,6 5,8 5,10 6,6 6,8 6,10 7,6 7,8 7,10 8,6 8,8 8,10 9,6 9,8 9,10(仅仅是这么几个元素就有这么多结果,可想而知数据量大会发生什么,一定要慎用!)
-
substract
substract : 移除一个RDD中的内容
java
JavaRDD<Integer> subtract = javaRDD1.subtract(javaRDD2);
scala
val substractRDD = scalaRDD1.subtract(scalaRDD2)
输出结果:1 3 5 7 9
-
sample
sample: 对RDD进行采样, 传入的第一个参数是是否进行替换,第二个参数是采样的比例(返回的结果是随机的) 案例 : 在不替换的前提下,抽取RDD 10%的数据
java
JavaRDD<Integer> sample = javaRDD1.sample(false, 0.1);
scala
val sampleRDD = scalaRDD1.sample(false, 0.1)
输出结果: 2 5(这个是随机的,每次运行的结果可能都不一样)
四. 全部程序代码
java
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import scala.Tuple2;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;
/**
* @author xmr
* @date 2019/4/29 10:02
* @description spark转换操作RDD
*/
public class SparkTransformRdd {
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("transform rdd test");
sparkConf.setMaster("local[2]");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
JavaRDD<Integer> javaRDD1 = jsc.parallelize(Arrays.asList(1,2,2,3,4,5,6,7,8,9));
JavaRDD<Integer> javaRDD2 = jsc.parallelize(Arrays.asList(2,4,6,8,10));
JavaRDD<String> javaRDD3 = jsc.parallelize(Arrays.asList("1,3,5,7,9","2,4,6,8,10"));
mapTest(javaRDD1);
filterTest(javaRDD1);
flatMapTest(javaRDD3);
distinctTest(javaRDD1);
unionTest(javaRDD1, javaRDD2);
cartesionTest(javaRDD1, javaRDD2);
intersectionTest(javaRDD1, javaRDD2);
mapToPairTest(javaRDD3);
substartTest(javaRDD1, javaRDD2);
sampleTest(javaRDD1);
}
private static void sampleTest(JavaRDD<Integer> javaRDD1) {
JavaRDD<Integer> sample = javaRDD1.sample(false, 0.1);
resultCollect(sample);
}
private static void substartTest(JavaRDD<Integer> javaRDD1, JavaRDD<Integer> javaRDD2) {
JavaRDD<Integer> subtract = javaRDD1.subtract(javaRDD2);
resultCollect(subtract);
}
private static void mapToPairTest(JavaRDD<String> javaRDD3) {
}
private static void intersectionTest(JavaRDD<Integer> javaRDD1, JavaRDD<Integer> javaRDD2) {
JavaRDD<Integer> intersection = javaRDD1.intersection(javaRDD2);
resultCollect(intersection);
}
private static void cartesionTest(JavaRDD<Integer> javaRDD1, JavaRDD<Integer> javaRDD2) {
JavaPairRDD<Integer, Integer> cartesian = javaRDD1.cartesian(javaRDD2);
List<Tuple2<Integer, Integer>> collects = cartesian.collect();
for (Tuple2<Integer, Integer> collect: collects) {
System.out.print(collect._1 + "," + collect._2 +" ");
}
System.out.println();
}
private static void unionTest(JavaRDD<Integer> javaRDD1, JavaRDD<Integer> javaRDD2) {
JavaRDD<Integer> union = javaRDD1.union(javaRDD2);
resultCollect(union);
}
private static void distinctTest(JavaRDD<Integer> javaRDD1) {
JavaRDD<Integer> distinct = javaRDD1.distinct(1);
resultCollect(distinct);
}
private static void flatMapTest(JavaRDD<String> javaRDD3) {
JavaRDD<String> javaRDD = javaRDD3.flatMap(new FlatMapFunction<String, String>() {
public Iterator<String> call(String s) throws Exception {
String[] strings = s.split(",");
List<String> list = new ArrayList<String>(Arrays.asList(strings));
return list.iterator();
}
});
resultCollect(javaRDD);
}
private static void filterTest(JavaRDD<Integer> javaRDD1) {
JavaRDD<Integer> filter = javaRDD1.filter(new Function<Integer, Boolean>() {
public Boolean call(Integer integer) throws Exception {
return integer % 2 != 0;
}
});
List<Integer> collect = filter.collect();
resultCollect(filter);
}
private static void resultCollect(JavaRDD filter) {
List collect = filter.collect();
for (int i = 0; i < collect.size(); i++) {
System.out.print(collect.get(i) + " ");
}
System.out.println();
}
private static void mapTest(JavaRDD<Integer> javaRDD) {
JavaRDD<Integer> map = javaRDD.map(new Function<Integer, Integer>() {
public Integer call(Integer i) throws Exception {
return i * 2;
}
});
resultCollect(map);
}
}
scala
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
/**
* @author xmr
* @date 2019/4/29 10:32
* @description
*/
object SparkTransformRddScala {
def resultCollect(rdd: RDD[Int]) = {
val results = rdd.collect()
for (i <- 0 to results.size - 1) {
print(results(i) +" ")
}
}
def resultCollectStr(rdd: RDD[String]) = {
val results = rdd.collect()
for (i <- 0 to results.size - 1) {
print(results(i) +" ")
}
}
def mapTest(rddScala: RDD[Int]): Unit = {
val rdd = rddScala.map(i => i * 2)
resultCollect(rdd)
}
def filter(rddScala: RDD[Int]): Unit = {
val filter = rddScala.filter(i => i % 2 != 0)
resultCollect(filter)
}
def flatMapTest(rddScala3: RDD[String]): Unit = {
val flatMapRdd = rddScala3.flatMap(str => str.split(","))
resultCollectStr(flatMapRdd)
}
def unionTest(scalaRDD1: RDD[Int], scalaRDD2: RDD[Int]): Unit = {
val unionRDD = scalaRDD1.union(scalaRDD2)
resultCollect(unionRDD)
}
def cartesionTest(scalaRDD1: RDD[Int], scalaRDD2: RDD[Int]): Unit = {
val cartesianRdd = scalaRDD1.cartesian(scalaRDD2)
val results = cartesianRdd.collect();
for (result<- results) {
print(result._1 + "," + result._2 + " ")
}
}
def intersectionTest(scalaRDD1: RDD[Int], scalaRDD2: RDD[Int]): Unit = {
val intersectionRDD = scalaRDD1.intersection(scalaRDD2)
resultCollect(intersectionRDD)
}
def substractTest(scalaRDD1: RDD[Int], scalaRDD2: RDD[Int]): Unit = {
val substractRDD = scalaRDD1.subtract(scalaRDD2)
resultCollect(substractRDD)
}
def sampleTest(scalaRDD1: RDD[Int]): Unit = {
val sampleRDD = scalaRDD1.sample(false, 0.1)
resultCollect(sampleRDD)
}
def distinctTest(scalaRDD1: RDD[Int]) = {
val distinctRDD = scalaRDD1.distinct()
resultCollect(distinctRDD)
}
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local").setAppName("transform rdd test scala")
val sparkContext = new SparkContext(sparkConf)
val scalaRDD1 = sparkContext.parallelize(List(1, 2, 2, 3, 4, 5, 6, 7, 8, 9))
val scalaRDD2 = sparkContext.parallelize(List(2, 4, 6, 8, 10))
val scalaRDD3 = sparkContext.parallelize(List("1,3,5,7,9", "2,4,6,8,10"))
mapTest(scalaRDD1)
filter( scalaRDD1)
flatMapTest(scalaRDD3)
distinctTest(scalaRDD1)
unionTest(scalaRDD1, scalaRDD2)
cartesionTest(scalaRDD1, scalaRDD2)
intersectionTest(scalaRDD1, scalaRDD2)
substractTest(scalaRDD1, scalaRDD2)
sampleTest(scalaRDD1)
}
}