Spark之常用RDD算子(java版本与scala版本对比)

parallelize

调用SparkContext 的 parallelize(),将一个存在的集合,变成一个RDD,这种方式试用于学习spark和做一些spark的测试

scala版本

def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T]

- 第一个参数一是一个 Seq集合

- 第二个参数是分区数

- 返回的是RDD[T]

val rdd: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6,7))

java版本

def parallelize[T](list : java.util.List[T], numSlices : scala.Int) : org.apache.spark.api.java.JavaRDD[T] = { /* compiled code */ }

- 第一个参数是一个List集合

- 第二个参数是一个分区,可以默认

- 返回的是一个JavaRDD[T]

java版本只能接收List的集合

JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6));

makeRDD

只有scala版本的才有makeRDD

def makeRDD[T](seq : scala.Seq[T], numSlices : scala.Int = { /* compiled code */ })

跟parallelize类似

val rdd: RDD[Int] = sc.makeRDD(List(1,2,3,4,5,6,7))

textFile

调用SparkContext.textFile()方法,从外部存储中读取数据来创建 RDD

例如在我本地input下有个word.txt文件,文件随便写了点内容,我需要将里面的内容读取出来创建RDD

scala版本

val rdd: RDD[String] = sc.textFile("in/word.txt")

java版本

JavaRDD<String> stringJavaRDD = sc.textFile("in/word.txt");

注: textFile支持分区,支持模式匹配,例如把in目录下.txt的给转换成RDD

var lines = sc.textFile("in/*.txt")

多个路径可以使用逗号分隔,例如

var lines = sc.textFile("dir1,dir2",3)

filter

举例,在sample.txt 文件的内容如下

aa bb cc aa aa aa dd dd ee ee ee ee 
ff aa bb zks
ee kks
ee  zz zks

我要将包含zks的行的内容给找出来

scala版本

val rdd: RDD[String] = sc.textFile("in/sample.txt")
val rdd2: RDD[String] = rdd.filter(x=>x.contains("zks"))
rdd2.foreach(println)

java版本

JavaRDD<String> rdd2 = sc.textFile("in/sample.txt");
JavaRDD<String> filterRdd = rdd2.filter(new Function<String, Boolean>() {
    @Override
    public Boolean call(String v1) throws Exception {
        return v1.contains("zks");
    }
});
List<String> collect3 = filterRdd.collect();
for (String s : collect3) {
    System.out.println(s);
}

map

map() 接收一个函数,把这个函数用于 RDD 中的每个元素,将函数的返回结果作为结果RDD编程

RDD 中对应元素的值 map是一对一的关系

scala版本

//读取数据
scala> val lines = sc.textFile("F:\\sparktest\\sample.txt")
//用map,对于每一行数据,按照空格分割成一个一个数组,然后返回的是一对一的关系
scala> var mapRDD = lines.map(line => line.split("\\s+"))
---------------输出-----------
res0: Array[Array[String]] = Array(Array(aa, bb, cc, aa, aa, aa, dd, dd, ee, ee, ee, ee), Array(ff, aa, bb, zks), Array(ee, kks), Array(ee, zz, zks))

//读取第一个元素
scala> mapRDD.first
---输出----
res1: Array[String] = Array(aa, bb, cc, aa, aa, aa, dd, dd, ee, ee, ee, ee)

java版本

JavaRDD<String> stringJavaRDD = sc.textFile("in/sample.txt");
JavaRDD<Iterable> mapRdd = stringJavaRDD.map(new Function<String, Iterable>() {
    @Override
    public Iterable call(String v1) throws Exception {
        String[] split = v1.split(" ");
        return Arrays.asList(split);
    }
});
List<Iterable> collect = mapRdd.collect();
for (Iterable iterable : collect) {
    Iterator iterator = iterable.iterator();
    while (iterator.hasNext()) System.out.println(iterator.next());
}

System.out.println(mapRdd.first());

flatMap

有时候,我们希望对某个元素生成多个元素,实现该功能的操作叫作 flatMap()

faltMap的函数应用于每一个元素,对于每一个元素返回的是多个元素组成的迭代器(想要了解更多,请参考scala的flatMap和map用法)

例如我们将数据切分为单词

scala版本

val rdd: RDD[String] = sc.textFile("in/sample.txt")
rdd.flatMap(x=>x.split(" ")).foreach(println)

java版本,spark2.0以下

    JavaRDD<String> lines = sc.textFile("/in/sample.txt");
    JavaRDD<String> flatMapRDD = lines.flatMap(new FlatMapFunction<String, String>() {
        @Override
        public Iterable<String> call(String s) throws Exception {
            String[] split = s.split("\\s+");
            return Arrays.asList(split);
        }
    });
    //输出第一个
    System.out.println(flatMapRDD.first());
------------输出----------
aa

java版本,spark2.0以上

spark2.0以上,对flatMap的方法有所修改,就是flatMap中的Iterator和Iteratable的小区别

JavaRDD<String> flatMapRdd = stringJavaRDD.flatMap(new FlatMapFunction<String, String>() {
    @Override
    public Iterator<String> call(String s) throws Exception {
        String[] split = s.split("\\s+");
        return Arrays.asList(split).iterator();
    }
});
List<String> collect = flatMapRdd.collect();
for (String s : collect) {
    System.out.println(s);
}

distinct

distinct用于去重, 我们生成的RDD可能有重复的元素,使用distinct方法可以去掉重复的元素, 不过此方法涉及到混洗,操作开销很大
scala版本

val rdd: RDD[Int] = sc.parallelize(List(1,1,1,2,3,4,5,6))
val rdd2: RDD[Int] = rdd.distinct()
rdd2.collect.foreach(println)

java版本

JavaRDD<String> javaRDD = sc.parallelize(Arrays.asList("aa", "aa", "cc", "dd"));
JavaRDD<String> distinctRdd = javaRDD.distinct();
List<String> collect = distinctRdd.collect();
for (String s : collect) {
    System.out.println(s);
}

union

两个RDD进行合并

scala版本

val rdd1: RDD[Int] = sc.parallelize(List(1,1,1,1))
val rdd2: RDD[Int] = sc.parallelize(List(2,2,2,2))
val rdd3: RDD[Int] = rdd1.union(rdd2)
rdd3.collect.foreach(println)

java版本

JavaRDD<String> javaRDD = sc.parallelize(Arrays.asList("aa", "aa", "cc", "dd"));
JavaRDD<String> javaRDD2 = sc.parallelize(Arrays.asList("aa", "aa", "cc", "dd"));
JavaRDD<String> unionRdd = javaRDD.union(javaRDD2);
List<String> collect = unionRdd.collect();
for (String s : collect) {
    System.out.print(s+",");
}

intersection

RDD1.intersection(RDD2) 返回两个RDD的交集,并且去重

intersection 需要混洗数据,比较浪费性能

scala版本

val rdd1: RDD[String] = sc.parallelize(List("aa","aa","bb","cc"))
val rdd2: RDD[String] = sc.parallelize(List("aa","aa","bb","ff"))
val intersectionRdd: RDD[String] = rdd1.intersection(rdd2)
intersectionRdd.collect.foreach(println)

java版本

JavaRDD<String> javaRDD = sc.parallelize(Arrays.asList("aa", "aa", "cc", "dd"));
JavaRDD<String> javaRDD2 = sc.parallelize(Arrays.asList("aa", "aa", "cc", "ff"));
List<String> collect = javaRDD.intersection(javaRDD2).collect();
for (String s : collect) {
    System.out.print(s+",");
}

subtract

RDD1.subtract(RDD2),返回在RDD1中出现,但是不在RDD2中出现的元素,不去重

scala版本

val rdd1: RDD[String] = sc.parallelize(List("aa","aa","bb","cc"))
val rdd2: RDD[String] = sc.parallelize(List("aa","aa","bb","ff"))
val intersectionRdd: RDD[String] = rdd1.subtract(rdd2)
intersectionRdd.collect.foreach(println)

java版本

JavaRDD<String> javaRDD = sc.parallelize(Arrays.asList("aa", "aa", "cc", "dd"));
JavaRDD<String> javaRDD2 = sc.parallelize(Arrays.asList("aa", "aa", "cc", "ff"));
List<String> collect = javaRDD.subtract(javaRDD2).collect();
for (String s : collect) {
    System.out.print(s+",");
}

cartesian

RDD1.cartesian(RDD2) 返回RDD1和RDD2的笛卡儿积,这个开销非常大

scala版本

val rdd1: RDD[String] = sc.parallelize(List("aa","aa","bb","cc"))
val rdd2: RDD[String] = sc.parallelize(List("aa","aa","bb","ff"))
val rdd3: RDD[(String, String)] = rdd1.cartesian(rdd2)
rdd3.collect.foreach(println)

java版本

JavaRDD<String> javaRDD = sc.parallelize(Arrays.asList("1","2","3"));
JavaRDD<String> javaRDD2 = sc.parallelize(Arrays.asList("aa", "aa", "cc", "ff"));
List<Tuple2<String, String>> collect = javaRDD.cartesian(javaRDD2).collect();
for (Tuple2<String, String> tuple2 : collect) {
    System.out.println(tuple2);
}

mapToPair

将每一行的第一个单词作为键,1 作为value创建pairRDD

scala版本

scala是没有mapToPair函数的,scala版本只需要map就可以了

val rdd: RDD[String] = sc.textFile("in/sample.txt")
val rdd2: RDD[(String, Int)] = rdd.map(x=>(x.split(" ")(0),1))
rdd2.collect.foreach(println)

java版本

JavaRDD<String> javaRDD = sc.textFile("in/sample.txt");
JavaPairRDD<String, Integer> mapToPair = javaRDD.mapToPair(new PairFunction<String, String, Integer>() {
    @Override
    public Tuple2<String, Integer> call(String s) throws Exception {
        String key = s.split(" ")[0];
        return new Tuple2<>(key, 1);
    }
});
List<Tuple2<String, Integer>> collect = mapToPair.collect();
for (Tuple2<String, Integer> tuple2 : collect) {
    System.out.println(tuple2);
}

flatMapToPair

类似于xxx连接 mapToPair是一对一,一个元素返回一个元素,而flatMapToPair可以一个元素返回多个,相当于先flatMap,在mapToPair

例子: 将每一行的第一个单词作为键,1 作为value

scala版本

val rdd1: RDD[String] = sc.textFile("in/sample.txt")
val flatRdd: RDD[String] = rdd1.flatMap(x=>x.split(" "))
val pairs: RDD[(String, Int)] = flatRdd.map(x=>(x,1))
pairs.collect.foreach(println)

java版本 spark2.0以下

JavaPairRDD<String, Integer> wordPairRDD = lines.flatMapToPair(new PairFlatMapFunction<String, String, Integer>() {
            @Override
            public Iterable<Tuple2<String, Integer>> call(String s) throws Exception {
                ArrayList<Tuple2<String, Integer>> tpLists = new ArrayList<Tuple2<String, Integer>>();
                String[] split = s.split("\\s+");
                for (int i = 0; i <split.length ; i++) {
                    Tuple2 tp = new Tuple2<String,Integer>(split[i], 1);
                    tpLists.add(tp);
                }
            return tpLists;
            }
        });

java版本 spark2.0以上

主要是iterator和iteratable的一些区别

JavaRDD<String> javaRDD = sc.textFile("in/sample.txt");
JavaPairRDD<String, Integer> flatMapToPair = javaRDD.flatMapToPair(new PairFlatMapFunction<String, String, Integer>() {
    @Override
    public Iterator<Tuple2<String, Integer>> call(String s) throws Exception {
        ArrayList<Tuple2<String, Integer>> list = new ArrayList<>();
        String[] split = s.split(" ");
        for (int i = 0; i < split.length; i++) {
            String key = split[i];
            Tuple2<String, Integer> tuple2 = new Tuple2<>(key, 1);
            list.add(tuple2);
        }
        return list.iterator();
    }
});
List<Tuple2<String, Integer>> collect = flatMapToPair.collect();
for (Tuple2<String, Integer> tuple2 : collect) {
    System.out.println("key "+tuple2._1+" value "+tuple2._2);
}

combineByKey

聚合数据一般在集中式数据比较方便,如果涉及到分布式的数据集,该如何去实现呢。这里介绍一下combineByKey, 这个是各种聚集操作的鼻祖,应该要好好了解一下,参考scala API

简要介绍

def combineByKey[C](createCombiner: (V) => C,  
                    mergeValue: (C, V) => C,   
                    mergeCombiners: (C, C) => C): RD
  • createCombiner: combineByKey() 会遍历分区中的所有元素,因此每个元素的键要么还没有遇到过,要么就和之前的某个元素的键相同。如果这是一个新的元素, combineByKey() 会使用一个叫作 createCombiner() 的函数来创建那个键对应的累加器的初始值
  • mergeValue: 如果这是一个在处理当前分区之前已经遇到的键, 它会使用 mergeValue() 方法将该键的累加器对应的当前值与这个新的值进行合并
  • mergeCombiners: 由于每个分区都是独立处理的, 因此对于同一个键可以有多个累加器。如果有两个或者更多的分区都有对应同一个键的累加器, 就需要使用用户提供的 mergeCombiners() 方法将各
    个分区的结果进行合并。

计算学生平均成绩例子

这里举一个计算学生平均成绩的例子,例子参考至https://www.edureka.co/blog/apache-spark-combinebykey-explained, github源码 我对此进行了解析

创建一个学生成绩说明的类

case class ScoreDetail(studentName:String,subject:String,score:Float)

下面是一些测试数据,加载测试数据集合 key = Students name and value = ScoreDetail instance

val scores = List(
  ScoreDetail("xiaoming", "Math", 98),
  ScoreDetail("xiaoming", "English", 88),
  ScoreDetail("wangwu", "Math", 75),
  ScoreDetail("wangwu", "English", 78),
  ScoreDetail("lihua", "Math", 90),
  ScoreDetail("lihua", "English", 80),
  ScoreDetail("zhangsan", "Math", 91),
  ScoreDetail("zhangsan", "English", 80))

将集合转换成二元组, 也可以理解成转换成一个map, 利用了for 和 yield的组合

val scoresWithKey = for { i <- scores } yield (i.studentName, i)

创建RDD, 并且指定三个分区

val scoresWithKeyRDD: RDD[(String, ScoreDetail)] = sc.parallelize(scoresWithKey).partitionBy(new HashPartitioner(3)).cache()

输出打印一下各个分区的长度和各个分区的一些数据

scoresWithKeyRDD.foreachPartition(partitions=>{
      partitions.foreach(x=>println(x._1,x._2.subject,x._2.score))
    })

聚合求平均值让后打印

val avgScoresRdd: RDD[(String, Float)] = scoresWithKeyRDD.combineByKey(
  (x: ScoreDetail) => (x.score, 1),
  (acc: (Float, Int), x: ScoreDetail) => (acc._1 + x.score, acc._2 + 1),
  (acc1: (Float, Int), acc2: (Float, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)
).map({ case (key, value) => (key, value._1 / value._2) })

avgScoresRdd.collect.foreach(println)

解释一下scoresWithKeyRDD.combineByKey

createCombiner: (x: ScoreDetail) => (x.score, 1)

这是第一次遇到zhangsan,创建一个函数,把map中的value转成另外一个类型 ,这里是把(zhangsan,(ScoreDetail类))转换成(zhangsan,(91,1))

mergeValue: (acc: (Float, Int), x: ScoreDetail) => (acc._1 + x.score, acc._2 + 1) 再次碰到张三, 就把这两个合并, 这里是将(zhangsan,(91,1)) 这种类型 和 (zhangsan,(ScoreDetail类))这种类型合并,合并成了(zhangsan,(171,2))

mergeCombiners (acc1: (Float, Int), acc2: (Float, Int)) 这个是将多个分区中的zhangsan的数据进行合并, 我们这里zhansan在同一个分区,这个地方就没有用上

java版本

ScoreDetail类

package nj.zb.CombineByKey;

import java.io.Serializable;


public class ScoreDetailsJava implements Serializable {
    public String stuName;
    public Integer score;
    public String subject;

    public ScoreDetailsJava(String stuName,  String subject,Integer score) {
        this.stuName = stuName;
        this.score = score;
        this.subject = subject;
    }
}

CombineByKey的测试类

package nj.zb.CombineByKey;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;

import java.util.ArrayList;
import java.util.Map;


public class CombineByKey {
    public static void main(String[] args) {
        SparkConf conf = new SparkConf().setAppName("cby").setMaster("local[2]");
        JavaSparkContext sc = new JavaSparkContext(conf);

        ArrayList<ScoreDetailsJava> scoreDetails = new ArrayList<>();
        scoreDetails.add(new ScoreDetailsJava("xiaoming", "Math", 98));
        scoreDetails.add(new ScoreDetailsJava("xiaoming", "English", 88));
        scoreDetails.add(new ScoreDetailsJava("wangwu", "Math", 75));
        scoreDetails.add(new ScoreDetailsJava("wangwu", "English", 78));
        scoreDetails.add(new ScoreDetailsJava("lihua", "Math", 90));
        scoreDetails.add(new ScoreDetailsJava("lihua", "English", 80));
        scoreDetails.add(new ScoreDetailsJava("zhangsan", "Math", 91));
        scoreDetails.add(new ScoreDetailsJava("zhangsan", "English", 80));

        JavaRDD<ScoreDetailsJava> scoreDetailsRDD = sc.parallelize(scoreDetails);
        JavaPairRDD<String, ScoreDetailsJava> pairRdd = scoreDetailsRDD.mapToPair(new PairFunction<ScoreDetailsJava, String, ScoreDetailsJava>() {
            @Override
            public Tuple2<String, ScoreDetailsJava> call(ScoreDetailsJava scoreDetailsJava) throws Exception {
                return new Tuple2<>(scoreDetailsJava.stuName, scoreDetailsJava);
            }
        });

        //createCombine
        Function<ScoreDetailsJava, Tuple2<Integer, Integer>> createCombine = new Function<ScoreDetailsJava, Tuple2<Integer, Integer>>() {
            @Override
            public Tuple2<Integer, Integer> call(ScoreDetailsJava v1) throws Exception {
                return new Tuple2<>(v1.score, 1);
            }
        };
        //mergeValue
        Function2<Tuple2<Integer, Integer>, ScoreDetailsJava, Tuple2<Integer, Integer>> mergeValue = new Function2<Tuple2<Integer, Integer>, ScoreDetailsJava, Tuple2<Integer, Integer>>() {
            @Override
            public Tuple2<Integer, Integer> call(Tuple2<Integer, Integer> v1, ScoreDetailsJava v2) throws Exception {
                return new Tuple2<>(v1._1 + v2.score, v1._2 + 1);
            }
        };
        //mergeCombiners
        Function2<Tuple2<Integer, Integer>, Tuple2<Integer, Integer>, Tuple2<Integer, Integer>> mergeCombiners = new Function2<Tuple2<Integer, Integer>, Tuple2<Integer, Integer>, Tuple2<Integer, Integer>>() {
            @Override
            public Tuple2<Integer, Integer> call(Tuple2<Integer, Integer> v1, Tuple2<Integer, Integer> v2) throws Exception {
                return new Tuple2<>(v1._1 + v2._1, v1._2 + v2._2);
            }
        };

        JavaPairRDD<String, Tuple2<Integer, Integer>> combineByRdd = pairRdd.combineByKey(createCombine, mergeValue, mergeCombiners);

        Map<String, Tuple2<Integer, Integer>> stringTuple2Map = combineByRdd.collectAsMap();

        for (String s : stringTuple2Map.keySet()) {
            System.out.println(s+":"+stringTuple2Map.get(s)._1/stringTuple2Map.get(s)._2);
        }

    }
}

reduceByKey

def reduceByKey(func: (V, V) => V): RDD[(K, V)]

def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]

def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]

接收一个函数,按照相同的key进行reduce操作,类似于scala的reduce的操作

例如RDD {(1, 2), (3, 4), (3, 6)}进行reduce

val rdd: RDD[(Int, Int)] = sc.parallelize(List((1,2),(3,4),(3,6)))
val rdd2: RDD[(Int, Int)] = rdd.reduceByKey((x,y)=>{println("one:"+x+"two:"+y);x+y})
rdd2.collect.foreach(println)

再举例

单词计数

sample.txt中的内容如下

aa bb cc aa aa aa dd dd ee ee ee ee 
ff aa bb zks
ee kks
ee  zz zks

scala版本

val rdd: RDD[String] = sc.textFile("in/sample.txt")
rdd.flatMap(_.split("\\s+")).map((_,1)).reduceByKey(_+_).foreach(println)

java版本

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.*;
import scala.Tuple2;

import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;


public class RddJava {

    public static void main(String[] args) {
        SparkConf conf = new SparkConf().setAppName("RddJava").setMaster("local[1]");
        JavaSparkContext sc = new JavaSparkContext(conf);
        JavaRDD<String> javaRDD = sc.textFile("in/sample.txt");

        PairFlatMapFunction<String, String, Integer> pairFlatMapFunction = new PairFlatMapFunction<String, String, Integer>(){
            @Override
            public Iterator<Tuple2<String, Integer>> call(String s) throws Exception {
                String[] split = s.split("\\s+");
                ArrayList<Tuple2<String, Integer>> list = new ArrayList<>();
                for (String str : split) {
                    Tuple2<String, Integer> tuple2 = new Tuple2<>(str, 1);
                    list.add(tuple2);
                }
                return list.iterator();
            }
        };

        Function2<Integer, Integer, Integer> function2 = new Function2<Integer, Integer, Integer>() {
            @Override
            public Integer call(Integer v1, Integer v2) throws Exception {
                return v1 + v2;
            }
        };

        JavaPairRDD<String, Integer> javaPairRDD = javaRDD.flatMapToPair(pairFlatMapFunction).reduceByKey(function2);

        List<Tuple2<String, Integer>> collect = javaPairRDD.collect();

        for (Tuple2<String, Integer> tuple2 : collect) {
            System.out.println(tuple2);
        }


    }
}

foldByKey

def foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)]

def foldByKey(zeroValue: V, numPartitions: Int)(func: (V, V) => V): RDD[(K, V)]

def foldByKey(zeroValue: V, partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)]

该函数用于RDD[K,V]根据K将V做折叠、合并处理,其中的参数zeroValue表示先根据映射函数将zeroValue应用于V,进行初始化V,再将映射函数应用于初始化后的V.
foldByKey可以参考我之前的scala的fold的介绍

与reduce不同的是 foldByKey开始折叠的第一个元素不是集合中的第一个元素,而是传入的一个元素
参考LXW的博客 scala的例子

val rdd: RDD[(String, Int)] = sc.parallelize(List(("a",2),("a",3),("b",3)))
rdd.foldByKey(0)((x,y)=>{println("one:"+x+"two:"+y);x+y}).collect.foreach(println)

SortByKey

  def sortByKey(ascending : scala.Boolean = { /* compiled code */ }, numPartitions : scala.Int = { /* compiled code */ }) : org.apache.spark.rdd.RDD[scala.Tuple2[K, V]] = { /* compiled code */ }

SortByKey用于对pairRDD按照key进行排序,第一个参数可以设置true或者false,默认是true

scala例子

val rdd: RDD[(Int, Int)] = sc.parallelize(Array((3, 4),(1, 2),(4,4),(2,5), (6,5), (5, 6)))
rdd.sortByKey().collect.foreach(println)

groupByKey

def groupByKey(): RDD[(K, Iterable[V])]

def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]

def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]

groupByKey会将RDD[key,value] 按照相同的key进行分组,形成RDD[key,Iterable[value]]的形式, 有点类似于sql中的groupby,例如类似于mysql中的group_concat

例如这个例子, 我们对学生的成绩进行分组

scala版本

val scoreDetail = sc.parallelize(List(("xiaoming",75),("xiaoming",90),("lihua",95),("lihua",100),("xiaofeng",85)))
val scoreGroup: RDD[(String, Iterable[Int])] = scoreDetail.groupByKey()
scoreGroup.collect.foreach(x=>{
  x._2.foreach(y=>{
    println(x._1,y)
  })
})

java版本

JavaRDD<Tuple2<String, Integer>> scoreDetail = sc.parallelize(Arrays.asList(new Tuple2<>("xiaoming", 75),
        new Tuple2<>("xiaoming", 90),
        new Tuple2<>("lihua", 95),
        new Tuple2<>("lihua", 100),
        new Tuple2<>("xiaofeng", 85)));
JavaPairRDD<String, Integer> scoreMapRdd = JavaPairRDD.fromJavaRDD(scoreDetail);
Map<String, Iterable<Integer>> collect = scoreMapRdd.groupByKey().collectAsMap();
for (String s : collect.keySet()) {
    for (Integer score : collect.get(s)) {
        System.out.println(s+":"+score);
    }
}

cogroup

groupByKey是对单个 RDD 的数据进行分组,还可以使用一个叫作 cogroup() 的函数对多个共享同一个键的 RDD 进行分组
例如
RDD1.cogroup(RDD2) 会将RDD1和RDD2按照相同的key进行分组,得到(key,RDD[key,Iterable[value1],Iterable[value2]])的形式
cogroup也可以多个进行分组
例如RDD1.cogroup(RDD2,RDD3,…RDDN), 可以得到(key,Iterable[value1],Iterable[value2],Iterable[value3],…,Iterable[valueN])
案例,scoreDetail存放的是学生的优秀学科的分数,scoreDetai2存放的是刚刚及格的分数,scoreDetai3存放的是没有及格的科目的分数,我们要对每一个学生的优秀学科,刚及格和不及格的分数给分组统计出来
scala版本

val rdd = sc.parallelize(List(("xiaoming",75),("xiaoming",90),("lihua",95),("lihua",100),("xiaofeng",85)))
val rdd1 = sc.parallelize(List(("xiaoming",85),("xiaoming",95),("lisi",95),("lisi",100),("xiaofeng",90)))
val coRdd: RDD[(String, (Iterable[Int], Iterable[Int]))] = rdd.cogroup(rdd1)
coRdd.foreach(println)

java版本

JavaRDD<Tuple2<String, Integer>> scoreDetail = sc.parallelize(Arrays.asList(new Tuple2<>("xiaoming", 75),
        new Tuple2<>("xiaoming", 90),
        new Tuple2<>("lihua", 95),
        new Tuple2<>("lihua", 100),
        new Tuple2<>("xiaofeng", 85)));
JavaRDD<Tuple2<String, Integer>> scoreDetail2 = sc.parallelize(Arrays.asList(
        new Tuple2<>("lisi", 90),
        new Tuple2<>("lihua", 95),
        new Tuple2<>("lihua", 100),
        new Tuple2<>("xiaomi", 85)
));
JavaPairRDD<String, Integer> rdd1 = JavaPairRDD.fromJavaRDD(scoreDetail);
JavaPairRDD<String, Integer> rdd2 = JavaPairRDD.fromJavaRDD(scoreDetail2);
JavaPairRDD<String, Tuple2<Iterable<Integer>, Iterable<Integer>>> cogroupRdd = rdd1.cogroup(rdd2);
Map<String, Tuple2<Iterable<Integer>, Iterable<Integer>>> myMap = cogroupRdd.collectAsMap();
Set<String> keys = myMap.keySet();
for (String key : keys) {
    Tuple2<Iterable<Integer>, Iterable<Integer>> tuple2 = myMap.get(key);
    System.out.println(key+":"+tuple2);
}

subtractByKey

函数定义

def subtractByKey[W](other: RDD[(K, W)])(implicit arg0: ClassTag[W]): RDD[(K, V)]

def subtractByKey[W](other: RDD[(K, W)], numPartitions: Int)(implicit arg0: ClassTag[W]): RDD[(K, V)]

def subtractByKey[W](other: RDD[(K, W)], p: Partitioner)(implicit arg0: ClassTag[W]): RDD[(K, V)]

类似于subtract,删掉 RDD 中键与 other RDD 中的键相同的元素

val rdd = sc.parallelize(List(("xiaoming",75),("xiaoming",90),("lihua",95),("lihua",100),("xiaofeng",85)))
val rdd1 = sc.parallelize(List(("xiaoming",85),("xiaoming",95),("lisi",95),("lisi",100),("xiaofeng",90)))
rdd.subtractByKey(rdd1).collect.foreach(println)
(lihua,95)
(lihua,100)

join

函数定义

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]

def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]

RDD1.join(RDD2)

可以把RDD1,RDD2中的相同的key给连接起来,类似于sql中的join操作

val rdd = sc.parallelize(List(("xiaoming",75),("xiaoming",90),("lihua",95),("lihua",100),("xiaofeng",85)))
val rdd1 = sc.parallelize(List(("xiaoming",85),("xiaoming",95),("lisi",95),("lisi",100),("xiaofeng",90)))
rdd.join(rdd1).collect.foreach(println)
(xiaoming,(75,85))
(xiaoming,(75,95))
(xiaoming,(90,85))
(xiaoming,(90,95))
(xiaofeng,(85,90))

fullOuterJoin

和join类似,不过这是全连接

val rdd = sc.parallelize(List(("xiaoming",75),("xiaoming",90),("lihua",95),("lihua",100),("xiaofeng",85)))
val rdd1 = sc.parallelize(List(("xiaoming",85),("xiaoming",95),("lisi",95),("lisi",100),("xiaofeng",90)))
rdd.fullOuterJoin(rdd1).collect.foreach(println)
(lihua,(Some(95),None))
(lihua,(Some(100),None))
(xiaoming,(Some(75),Some(85)))
(xiaoming,(Some(75),Some(95)))
(xiaoming,(Some(90),Some(85)))
(xiaoming,(Some(90),Some(95)))
(lisi,(None,Some(95)))
(lisi,(None,Some(100)))
(xiaofeng,(Some(85),Some(90)))

leftOuterJoin

def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]

def leftOuterJoin[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, Option[W]))]

def leftOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))]

直接看图即可

对两个 RDD 进行连接操作,类似于sql中的左外连接

val rdd = sc.parallelize(List(("xiaoming",75),("xiaoming",90),("lihua",95),("lihua",100),("xiaofeng",85)))
val rdd1 = sc.parallelize(List(("xiaoming",85),("xiaoming",95),("lisi",95),("lisi",100),("xiaofeng",90)))
rdd.leftOuterJoin(rdd1).collect.foreach(println)
(lihua,(95,None))
(lihua,(100,None))
(xiaoming,(75,Some(85)))
(xiaoming,(75,Some(95)))
(xiaoming,(90,Some(85)))
(xiaoming,(90,Some(95)))
(xiaofeng,(85,Some(90)))

rightOuterJoin

对两个 RDD 进行连接操作,类似于sql中的右外连接,存在的话,value用的Some, 不存在用的None,具体的看上面的图和下面的代码即可

val rdd = sc.parallelize(List(("xiaoming",75),("xiaoming",90),("lihua",95),("lihua",100),("xiaofeng",85)))
val rdd1 = sc.parallelize(List(("xiaoming",85),("xiaoming",95),("lisi",95),("lisi",100),("xiaofeng",90)))
rdd.rightOuterJoin(rdd1).collect.foreach(println)
(xiaoming,(Some(75),85))
(xiaoming,(Some(75),95))
(xiaoming,(Some(90),85))
(xiaoming,(Some(90),95))
(lisi,(None,95))
(lisi,(None,100))
(xiaofeng,(Some(85),90))

java语言

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.Optional;
import scala.Tuple2;

import java.util.Arrays;
import java.util.Map;


public class RddJava {

    public static void main(String[] args) {
        SparkConf conf = new SparkConf().setMaster("local[3]").setAppName("join");
        JavaSparkContext sc = new JavaSparkContext(conf);

        JavaRDD<Tuple2<Integer, Integer>> javaRDD = sc.parallelize(Arrays.asList(
                new Tuple2<>(1, 2),
                new Tuple2<>(2, 9),
                new Tuple2<>(3, 8),
                new Tuple2<>(4, 10),
                new Tuple2<>(5, 20)
        ));
        JavaRDD<Tuple2<Integer, Integer>> javaRDD1 = sc.parallelize(Arrays.asList(
                new Tuple2<>(3, 15),
                new Tuple2<>(4, 19),
                new Tuple2<>(5, 20),
                new Tuple2<>(6, 2),
                new Tuple2<>(7, 23)
        ));
        //JavaRDD转换成JavaPairRDD
        JavaPairRDD<Integer, Integer> rdd = JavaPairRDD.fromJavaRDD(javaRDD);
        JavaPairRDD<Integer, Integer> other = JavaPairRDD.fromJavaRDD(javaRDD1);

        //subtractByKey
        JavaPairRDD<Integer, Integer> subtractByKey = rdd.subtractByKey(other);
        Map<Integer, Integer> subMap = subtractByKey.collectAsMap();
        System.out.println("----substractByKey----");
        for (Integer key : subMap.keySet()) {
            System.out.println(key+":"+subMap.get(key));
        }

        //join
        JavaPairRDD<Integer, Tuple2<Integer, Integer>> join = rdd.join(other);
        Map<Integer, Tuple2<Integer, Integer>> joinMap = join.collectAsMap();
        System.out.println("----join----");
        for (Integer key : joinMap.keySet()) {
            System.out.println(key+":"+joinMap.get(key));
        }

        //leftoutjoin
        JavaPairRDD<Integer, Tuple2<Integer, Optional<Integer>>> leftoutjoin = rdd.leftOuterJoin(other);
        Map<Integer, Tuple2<Integer, Optional<Integer>>> leftjoinMap = leftoutjoin.collectAsMap();
        System.out.println("----leftjoin----");
        for (Integer key : leftjoinMap.keySet()) {
            System.out.println(key+":"+leftjoinMap.get(key));
        }

        //rightoutjoin
        JavaPairRDD<Integer, Tuple2<Optional<Integer>, Integer>> rightOuterJoin = rdd.rightOuterJoin(other);
        Map<Integer, Tuple2<Optional<Integer>, Integer>> rightooutMap = rightOuterJoin.collectAsMap();
        System.out.println("----rightjoin----");
        for (Integer key : rightooutMap.keySet()) {
            System.out.println(key+":"+rightooutMap.get(key));
        }
    }
}
----substractByKey----
2:9
1:2
----join----
5:(20,20)
4:(10,19)
3:(8,15)
----leftjoin----
2:(9,Optional.empty)
5:(20,Optional[20])
4:(10,Optional[19])
1:(2,Optional.empty)
3:(8,Optional[15])
----rightjoin----
5:(Optional[20],20)
4:(Optional[10],19)
7:(Optional.empty,23)
3:(Optional[8],15)
6:(Optional.empty,2)

first

返回第一个元素

scala

val rdd: RDD[Int] = sc.parallelize(List(1,2,4,5,6,7,8,9))
println(rdd.first())

java

JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
System.out.println(javaRDD.first());

take

rdd.take(n)返回第n个元素

scala

val rdd: RDD[Int] = sc.parallelize(List(1,2,4,5,6,7,8,9))
println(rdd.take(3).toList)

java

JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
System.out.println(javaRDD.take(3));

collect

rdd.collect() 返回 RDD 中的所有元素

scala

val rdd: RDD[Int] = sc.parallelize(List(1,2,4,5,6,7,8,9))
rdd.collect().foreach(println)

java

JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
System.out.println(javaRDD.collect());

count

rdd.count() 返回 RDD 中的元素个数

scala

val rdd: RDD[Int] = sc.parallelize(List(1,2,4,5,6,7,8,9))
println(rdd.count())

java

JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
System.out.println(javaRDD.count());

countByValue

各元素在 RDD 中出现的次数 返回{(key1,次数),(key2,次数),…(keyn,次数)}

scala

val rdd: RDD[Int] = sc.parallelize(List(1,2,2,5,6,7,8,9))
rdd.countByValue().foreach(println)

java

JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
System.out.println(javaRDD.countByValue());

reduce

rdd.reduce(func)

并行整合RDD中所有数据, 类似于是scala中集合的reduce

scala

val rdd: RDD[Int] = sc.parallelize(List(1,2,2,5,6,7,8,9))
println(rdd.reduce(_ + _))

java

JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
Integer res = javaRDD.reduce(new Function2<Integer, Integer, Integer>() {
    @Override
    public Integer call(Integer v1, Integer v2) throws Exception {
        return v1 + v2;
    }
});
System.out.println(res);

aggregate

和 reduce() 相 似, 但 是 通 常

返回不同类型的函数 一般不用这个函数

scala

val rdd: RDD[Int] = sc.parallelize(List(1,2,2,5,6,7,8,9))
val i: Int = rdd.aggregate(5)(_+_,_+_)
println(i)

java

JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));


Function2<Integer, Integer, Integer> seqop = new Function2<Integer, Integer, Integer>() {
    @Override
    public Integer call(Integer v1, Integer v2) throws Exception {
        return v1 + v2;
    }
};
Function2<Integer, Integer, Integer> comop = new Function2<Integer, Integer, Integer>() {
    @Override
    public Integer call(Integer v1, Integer v2) throws Exception {
        return v1 + v2;
    }
};

Integer res = javaRDD.aggregate(5, seqop, comop);
System.out.println(res);

fold

rdd.fold(num)(func) 一般不用这个函数

和 reduce() 一 样, 但是提供了初始值num,每个元素计算时,先要合这个初始值进行折叠, 注意,这里会按照

每个分区进行fold,然后分区之间还会再次进行fold

提供初始值

scala

val rdd: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6,7))
val i: Int = rdd.fold(5)(_+_)
println(i)

java

JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
Integer res = javaRDD.fold(5, new Function2<Integer, Integer, Integer>() {
    @Override
    public Integer call(Integer v1, Integer v2) throws Exception {
        return v1 + v2;
    }
});
System.out.println(res);

top

rdd.top(n)

按照降序的或者指定的排序规则,返回前n个元素

scala

val rdd: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6,7))
val res: Array[Int] = rdd.top(3)
println(res.toList)

java

JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
List<Integer> res = javaRDD.top(5);
for (Integer re : res) {
    System.out.println(re);
}

takeOrdered

rdd.take(n)

对RDD元素进行升序排序,取出前n个元素并返回,也可以自定义比较器(这里不介绍),类似于top的相反的方法

scala

val rdd: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6,7))
val res: Array[Int] = rdd.takeOrdered(2)
println(res.toList)

java

JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
List<Integer> res = javaRDD.takeOrdered(2);
for (Integer re : res) {
    System.out.println(re);
}

foreach

对 RDD 中的每个元素使用给

定的函数

scala

val rdd: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6,7))
rdd.foreach(println)

java

JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
javaRDD.foreach(new VoidFunction<Integer>() {
    @Override
    public void call(Integer integer) throws Exception {
        System.out.println(integer);
    }
});

countByKey

def countByKey(): Map[K, Long]

以RDD{(1, 2),(2,4),(2,5), (3, 4),(3,5), (3, 6)}为例 rdd.countByKey会返回{(1,1),(2,2),(3,3)}

scala例子

val rdd: RDD[(Int, Int)] = sc.parallelize(Array((1, 2),(2,4),(2,5), (3, 4),(3,5), (3, 6)))
val rdd2: collection.Map[Int, Long] = rdd.countByKey()
rdd2.foreach(println)

java例子

JavaRDD<Tuple2<Integer, Integer>> javaRDD = sc.parallelize(Arrays.asList(
        new Tuple2<>(1, 2),
        new Tuple2<>(2, 3),
        new Tuple2<>(2, 2),
        new Tuple2<>(3, 2),
        new Tuple2<>(3, 2)
));
JavaPairRDD<Integer, Integer> javaRDD1 = JavaPairRDD.fromJavaRDD(javaRDD);
Map<Integer, Long> count = javaRDD1.countByKey();
for (Integer key : count.keySet()) {
    System.out.println(key+":"+count.get(key));
}

collectAsMap

将pair类型(键值对类型)的RDD转换成map, 还是上面的例子

scala例子

val rdd: RDD[(Int, Int)] = sc.parallelize(Array((1, 2),(2,4),(2,2), (3, 4),(3,5), (3, 6),(2,0),(1,0)))
val rdd2: collection.Map[Int, Int] = rdd.collectAsMap()
rdd2.foreach(println)

java例子

JavaRDD<Tuple2<Integer, Integer>> javaRDD = sc.parallelize(Arrays.asList(
        new Tuple2<>(1, 2),
        new Tuple2<>(2, 3),
        new Tuple2<>(2, 2),
        new Tuple2<>(3, 2),
        new Tuple2<>(3, 2)
));
JavaPairRDD<Integer, Integer> pairRDD = javaRDD.mapToPair(new PairFunction<Tuple2<Integer, Integer>, Integer, Integer>() {
    @Override
    public Tuple2<Integer, Integer> call(Tuple2<Integer, Integer> tp) throws Exception {
        return new Tuple2<>(tp._1, tp._2);
    }
});
Map<Integer, Integer> collectAsMap = pairRDD.collectAsMap();
for (Integer key : collectAsMap.keySet()) {
    System.out.println(key+":"+collectAsMap.get(key));
}

saveAsTextFile

def saveAsTextFile(path: String): Unit

def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit

saveAsTextFile用于将RDD以文本文件的格式存储到文件系统中。

codec参数可以指定压缩的类名。

val rdd: RDD[(Int, Int)] = sc.parallelize(Array((1, 2),(2,4),(2,2), (3, 4),(3,5), (3, 6),(2,0),(1,0)))
rdd.saveAsTextFile("in/test")

注意:如果使用rdd.saveAsTextFile(“hdfs://ip:3306/”)将文件保存到hdfs文件系统

指定压缩格式保存

<dependency>
            <groupId>org.anarres.lzo</groupId>
            <artifactId>lzo-hadoop</artifactId>
            <version>1.0.0</version>
            <scope>compile</scope>
        </dependency>
val rdd: RDD[(Int, Int)] = sc.parallelize(Array((1, 2),(2,4),(2,2), (3, 4),(3,5), (3, 6),(2,0),(1,0)))
rdd.saveAsTextFile("in/test",classOf[com.hadoop.compression.lzo.LzopCodec])

saveAsSequenceFile

saveAsSequenceFile用于将RDD以SequenceFile的文件格式保存到HDFS上。

用法同saveAsTextFile。

saveAsObjectFile

def saveAsObjectFile(path: String): Unit

saveAsObjectFile用于将RDD中的元素序列化成对象,存储到文件中。

对于HDFS,默认采用SequenceFile保存。

val rdd: RDD[Int] = sc.makeRDD(1 to 10)
rdd.saveAsObjectFile("in/testob")

saveAsHadoopFile

def saveAsHadoopFile(path: String, keyClass: Class[], valueClass: Class[], outputFormatClass: Class[_ <: OutputFormat[, ]], codec: Class[_ <: CompressionCodec]): Unit

def saveAsHadoopFile(path: String, keyClass: Class[], valueClass: Class[], outputFormatClass: Class[_ <: OutputFormat[, ]], conf: JobConf = …, codec: Option[Class[_ <: CompressionCodec]] = None): Unit

saveAsHadoopFile是将RDD存储在HDFS上的文件中,支持老版本Hadoop API。

可以指定outputKeyClass、outputValueClass以及压缩格式。

每个分区输出一个文件。

var rdd1 = sc.makeRDD(Array(("A",2),("A",1),("B",6),("B",3),("B",7)))

import org.apache.hadoop.mapred.TextOutputFormat
import org.apache.hadoop.io.Text
import org.apache.hadoop.io.IntWritable

rdd1.saveAsHadoopFile("/tmp/test/",classOf[Text],classOf[IntWritable],classOf[TextOutputFormat[Text,IntWritable]])

rdd1.saveAsHadoopFile("/tmp/.test/",classOf[Text],classOf[IntWritable],classOf[TextOutputFormat[Text,IntWritable]],classOf[com.hadoop.compression.lzo.LzopCodec])

saveAsHadoopDataset

def saveAsHadoopDataset(conf: JobConf): Unit

saveAsHadoopDataset用于将RDD保存到除了HDFS的其他存储中,比如HBase。

在JobConf中,通常需要关注或者设置五个参数:

文件的保存路径、key值的class类型、value值的class类型、RDD的输出格式(OutputFormat)、以及压缩相关的参数。

#使用saveAsHadoopDataset将RDD保存到HDFS中

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import SparkContext._
import org.apache.hadoop.mapred.TextOutputFormat
import org.apache.hadoop.io.Text
import org.apache.hadoop.io.IntWritable
import org.apache.hadoop.mapred.JobConf



var rdd1 = sc.makeRDD(Array(("A",2),("A",1),("B",6),("B",3),("B",7)))
var jobConf = new JobConf()
jobConf.setOutputFormat(classOf[TextOutputFormat[Text,IntWritable]])
jobConf.setOutputKeyClass(classOf[Text])
jobConf.setOutputValueClass(classOf[IntWritable])
jobConf.set("mapred.output.dir","/tmp/test/")
rdd1.saveAsHadoopDataset(jobConf)

#保存数据到HBASE

HBase建表:

create ‘test′,{NAME => ‘f1′,VERSIONS => 1},{NAME => ‘f2′,VERSIONS => 1},{NAME => ‘f3′,VERSIONS => 1}

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import SparkContext._
import org.apache.hadoop.mapred.TextOutputFormat
import org.apache.hadoop.io.Text
import org.apache.hadoop.io.IntWritable
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.io.ImmutableBytesWritable

var conf = HBaseConfiguration.create()
    var jobConf = new JobConf(conf)
    jobConf.set("hbase.zookeeper.quorum","zkNode1,zkNode2,zkNode3")
    jobConf.set("zookeeper.znode.parent","/hbase")
    jobConf.set(TableOutputFormat.OUTPUT_TABLE,"test")
    jobConf.setOutputFormat(classOf[TableOutputFormat])

    var rdd1 = sc.makeRDD(Array(("A",2),("B",6),("C",7)))
    rdd1.map(x => 
      {
        var put = new Put(Bytes.toBytes(x._1))
        put.add(Bytes.toBytes("f1"), Bytes.toBytes("c1"), Bytes.toBytes(x._2))
        (new ImmutableBytesWritable,put)
      }
    ).saveAsHadoopDataset(jobConf)

注意:保存到HBase,运行时候需要在SPARK_CLASSPATH中加入HBase相关的jar包。

可参考:http://lxw1234.com/archives/2015/07/332.htm

saveAsNewAPIHadoopFile

def saveAsNewAPIHadoopFile[F <: OutputFormat[K, V]](path: String)(implicit fm: ClassTag[F]): Unit

def saveAsNewAPIHadoopFile(path: String, keyClass: Class[], valueClass: Class[], outputFormatClass: Class[_ <: OutputFormat[, ]], conf: Configuration = self.context.hadoopConfiguration): Unit

saveAsNewAPIHadoopFile用于将RDD数据保存到HDFS上,使用新版本Hadoop API。

用法基本同saveAsHadoopFile。

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import SparkContext._
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
import org.apache.hadoop.io.Text
import org.apache.hadoop.io.IntWritable

var rdd1 = sc.makeRDD(Array(("A",2),("A",1),("B",6),("B",3),("B",7)))
rdd1.saveAsNewAPIHadoopFile("/tmp/lxw1234/",classOf[Text],classOf[IntWritable],classOf[TextOutputFormat[Text,IntWritable]])

saveAsNewAPIHadoopDataset

def saveAsNewAPIHadoopDataset(conf: Configuration): Unit

作用同saveAsHadoopDataset,只不过采用新版本Hadoop API。

以写入HBase为例:

HBase建表:

create ‘lxw1234′,{NAME => ‘f1′,VERSIONS => 1},{NAME => ‘f2′,VERSIONS => 1},{NAME => ‘f3′,VERSIONS => 1}

完整的Spark应用程序:

package com.lxw1234.test

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import SparkContext._
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Put

object Test {
  def main(args : Array[String]) {
   val sparkConf = new SparkConf().setMaster("spark://lxw1234.com:7077").setAppName("lxw1234.com")
   val sc = new SparkContext(sparkConf);
   var rdd1 = sc.makeRDD(Array(("A",2),("B",6),("C",7)))

    sc.hadoopConfiguration.set("hbase.zookeeper.quorum ","zkNode1,zkNode2,zkNode3")
    sc.hadoopConfiguration.set("zookeeper.znode.parent","/hbase")
    sc.hadoopConfiguration.set(TableOutputFormat.OUTPUT_TABLE,"lxw1234")
    var job = new Job(sc.hadoopConfiguration)
    job.setOutputKeyClass(classOf[ImmutableBytesWritable])
    job.setOutputValueClass(classOf[Result])
    job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])

    rdd1.map(
      x => {
        var put = new Put(Bytes.toBytes(x._1))
        put.add(Bytes.toBytes("f1"), Bytes.toBytes("c1"), Bytes.toBytes(x._2))
        (new ImmutableBytesWritable,put)
      }    
    ).saveAsNewAPIHadoopDataset(job.getConfiguration)

    sc.stop()   
  }
}

mapPartitions

mapPartition可以倒过来理解,先partition,再把每个partition进行map函数,

适用场景

如果在映射的过程中需要频繁创建额外的对象,使用mapPartitions要比map高效的过。

比如,将RDD中的所有数据通过JDBC连接写入数据库,如果使用map函数,可能要为每一个元素都创建一个connection,这样开销很大,如果使用mapPartitions,那么只需要针对每一个分区建立一个connection。
下面的例子,把每一个元素平方
java 每一个元素平方

JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
JavaRDD<Integer> javaRDD1 = javaRDD.mapPartitions(new FlatMapFunction<Iterator<Integer>, Integer>() {
    @Override
    public Iterator<Integer> call(Iterator<Integer> it) throws Exception {
        ArrayList<Integer> res = new ArrayList<>();
        while (it.hasNext()) {
            Integer i = it.next();
            res.add(i * i);
        }
        return res.iterator();
    }
});
for (Integer integer : javaRDD1.collect()) {
    System.out.println(integer);
}

把每一个数字i变成一个map(i,i*i)的形式

java,把每一个元素变成map(i,i*i)

JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
JavaRDD<Tuple2<Integer, Integer>> tuple2JavaRDD = javaRDD.mapPartitions(new FlatMapFunction<Iterator<Integer>, Tuple2<Integer, Integer>>() {
    @Override
    public Iterator<Tuple2<Integer, Integer>> call(Iterator<Integer> it) throws Exception {
        ArrayList<Tuple2<Integer, Integer>> tuple2s = new ArrayList<>();
        while (it.hasNext()) {
            Integer i = it.next();
            tuple2s.add(new Tuple2<>(i, i * i));
        }
        return tuple2s.iterator();
    }
});
for (Tuple2<Integer, Integer> tuple2 : tuple2JavaRDD.collect()) {
    System.out.println(tuple2);
}

scala 把每一个元素变成map(i,i*i)

val rdd: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6))
def mapPar(iter:Iterator[Int]): Iterator[(Int,Int)] ={
  var res: List[(Int, Int)] = List[(Int,Int)]()
  while (iter.hasNext){
    val i: Int = iter.next()
    res= res.+:(i,i*i)
  }
  res.iterator
}
val rdd1: RDD[(Int, Int)] = rdd.mapPartitions(mapPar)
rdd1.foreach(println)

mapPartitions操作键值对 把(i,j) 变成(i,j*j)

scala版本

val rdd: RDD[(Int, Int)] = sc.parallelize(List((1,1),(1,2),(1,3)))
def mapPar(iter:Iterator[(Int,Int)]): Iterator[(Int,Int)] ={
  var res: List[(Int, Int)] = List[(Int,Int)]()
  while (iter.hasNext){
    val tuple: (Int, Int) = iter.next()
    res= res.+:(tuple._1,tuple._2*tuple._2)
  }
  res.iterator
}
val rdd1: RDD[(Int, Int)] = rdd.mapPartitions(mapPar)
rdd1.foreach(println)

java版本

JavaRDD<Tuple2<Integer, Integer>> javaRDD = sc.parallelize(Arrays.asList(
        new Tuple2<>(1, 1),
        new Tuple2<>(1, 2),
        new Tuple2<>(1, 3)
));
JavaRDD<Tuple2<Integer, Integer>> tuple2JavaRDD = javaRDD.mapPartitions(new FlatMapFunction<Iterator<Tuple2<Integer, Integer>>, Tuple2<Integer, Integer>>() {
    @Override
    public Iterator<Tuple2<Integer, Integer>> call(Iterator<Tuple2<Integer, Integer>> tuple2Iterator) throws Exception {
        ArrayList<Tuple2<Integer, Integer>> tuple2s = new ArrayList<>();
        while (tuple2Iterator.hasNext()) {
            Tuple2<Integer, Integer> next = tuple2Iterator.next();
            tuple2s.add(new Tuple2<Integer, Integer>(next._1, next._2 * next._2));
        }
        return tuple2s.iterator();
    }
});
for (Tuple2<Integer, Integer> tuple2 : tuple2JavaRDD.collect()) {
    System.out.println(tuple2);
}

mapPartitionsWithIndex

与mapPartition类似,也是按照分区进行的map操作,不过mapPartitionsWithIndex传入的参数多了一个分区的值,下面举个例子,为统计各个分区中的元素 (稍加修改可以做统计各个分区的数量)

java

JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6));
JavaRDD<Tuple2<Integer, Integer>> tuple2JavaRDD = javaRDD.mapPartitionsWithIndex(new Function2<Integer, Iterator<Integer>, Iterator<Tuple2<Integer, Integer>>>() {
    @Override
    public Iterator<Tuple2<Integer, Integer>> call(Integer v1, Iterator<Integer> v2) throws Exception {
        ArrayList<Tuple2<Integer, Integer>> tuple2s = new ArrayList<>();
        while (v2.hasNext()) {
            Integer next = v2.next();
            tuple2s.add(new Tuple2<>(v1, next));
        }

        return tuple2s.iterator();
    }
}, false);

for (Tuple2<Integer, Integer> tuple2 : tuple2JavaRDD.collect()) {
    System.out.println(tuple2);
}

scala

val rdd: RDD[Int] = sc.parallelize(List(1,2,3,4,5))
def mapPar(i:Int,iter:Iterator[Int]):Iterator[(Int,Int)]={
  var tuples: List[(Int, Int)] = List[(Int,Int)]()
  while (iter.hasNext){
    val x: Int = iter.next()
    tuples=tuples.+:(i,x)
  }
  tuples.iterator
}
val res: RDD[(Int, Int)] = rdd.mapPartitionsWithIndex(mapPar)
res.foreach(println)

mapPartitionsWithIndex 统计键值对中的各个分区的元素

scala版本

val rdd: RDD[(Int, Int)] = sc.parallelize(List((1,1),(1,2),(2,3),(2,4)))
def mapPar(i:Int,iter:Iterator[(Int,Int)]):Iterator[(Int,(Int,Int))]={
  var tuples: List[(Int, (Int, Int))] = List[(Int,(Int,Int))]()
  while (iter.hasNext){
    val tuple: (Int, Int) = iter.next()
    tuples=tuples.::(i,tuple)
  }
  tuples.iterator
}

val res: RDD[(Int, (Int, Int))] = rdd.mapPartitionsWithIndex(mapPar)
res.foreach(println)

java版本

JavaRDD<Tuple2<Integer, Integer>> javaRDD = sc.parallelize(Arrays.asList(
        new Tuple2<>(1, 1),
        new Tuple2<>(1, 2),
        new Tuple2<>(2, 1),
        new Tuple2<>(2, 2),
        new Tuple2<>(3, 1)
));
JavaPairRDD<Integer, Integer> javaPairRDD = JavaPairRDD.fromJavaRDD(javaRDD);
JavaRDD<Tuple2<Integer, Tuple2<Integer, Integer>>> mapPartitionIndexRDD = javaPairRDD.mapPartitionsWithIndex(new Function2<Integer, Iterator<Tuple2<Integer, Integer>>, Iterator<Tuple2<Integer, Tuple2<Integer, Integer>>>>() {
    @Override
    public Iterator<Tuple2<Integer, Tuple2<Integer, Integer>>> call(Integer partIndex, Iterator<Tuple2<Integer, Integer>> tuple2Iterator) {
        ArrayList<Tuple2<Integer, Tuple2<Integer, Integer>>> tuple2s = new ArrayList<>();

        while (tuple2Iterator.hasNext()) {
            Tuple2<Integer, Integer> next = tuple2Iterator.next();
            tuple2s.add(new Tuple2<Integer, Tuple2<Integer, Integer>>(partIndex, next));
        }
        return tuple2s.iterator();
    }
}, false);
for (Tuple2<Integer, Tuple2<Integer, Integer>> tuple2 : mapPartitionIndexRDD.collect()) {
    System.out.println(tuple2);
}

补充: 打印各个分区的操作,可以使用 glom 的方法

JavaRDD<Tuple2<Integer, Integer>> javaRDD = sc.parallelize(Arrays.asList(
        new Tuple2<>(1, 1),
        new Tuple2<>(1, 2),
        new Tuple2<>(2, 1),
        new Tuple2<>(2, 2),
        new Tuple2<>(3, 1)
));
JavaPairRDD<Integer, Integer> javaPairRDD = JavaPairRDD.fromJavaRDD(javaRDD);
JavaRDD<List<Tuple2<Integer, Integer>>> glom = javaPairRDD.glom();
for (List<Tuple2<Integer, Integer>> tuple2s : glom.collect()) {
    System.out.println(tuple2s);
}

默认分区和HashPartitioner分区

默认的分区就是HashPartition分区,默认分区不再介绍,下面介绍HashPartition的使用

通过上一章 [mapPartitionsWithIndex的例子]我们可以构建一个方法,用来查看RDD的分区

val rdd: RDD[(Int, Int)] = sc.parallelize(List((1,1),(1,2),(2,3),(2,4)))
def mapPar(i:Int,iter:Iterator[(Int,Int)]):Iterator[(Int,(Int,Int))]={
  var tuples: List[(Int, (Int, Int))] = List[(Int,(Int,Int))]()
  while (iter.hasNext){
    val tuple: (Int, Int) = iter.next()
    tuples=tuples.::(i,tuple)
  }
  tuples.iterator
}
def printMapPar(rdd:RDD[(Int,Int)]): Unit ={
  val rdd1: RDD[(Int, (Int, Int))] = rdd.mapPartitionsWithIndex(mapPar)
  rdd1.foreach(println)
}

printMapPar(rdd)

HashPartitioner分区 scala

使用pairRdd.partitionBy(new spark.HashPartitioner(n)), 可以分为n个区

val rdd: RDD[(Int, Int)] = sc.parallelize(List((1,1),(1,2),(2,3),(2,4)))
val rdd1: RDD[(Int, Int)] = rdd.partitionBy(new spark.HashPartitioner(3))
rdd1.foreach(println)

HashPartitioner是如何分区的: 国内很多说法都是有问题的,参考国外的一个说法 Uses Java’s Object.hashCodemethod to determine the partition as partition = key.hashCode() % numPartitions. 翻译过来就是使用java对象的hashCode来决定是哪个分区,对于piarRDD, 分区就是key.hashCode() % numPartitions, 3%3=0,所以 (3,6) 这个元素在0 分区, 4%3=1,所以元素(4,8) 在1 分区。

RangePartitioner

我理解成范围分区器

使用一个范围,将范围内的键分配给相应的分区。这种方法适用于键中有自然排序,键不为负。本文主要介绍如何使用,原理以后再仔细研究,以下代码片段显示了RangePartitioner的用法

val rdd: RDD[(Int, Int)] = sc.parallelize(List((1,1),(1,2),(2,3),(2,4)))
def mapPar(i:Int,iter:Iterator[(Int,Int)]):Iterator[(Int,(Int,Int))]={
  var tuples: List[(Int, (Int, Int))] = List[(Int,(Int,Int))]()
  while (iter.hasNext){
    val tuple: (Int, Int) = iter.next()
    tuples=tuples.::(i,tuple)
  }
  tuples.iterator
}
def printMapPar(rdd:RDD[(Int,Int)]): Unit ={
  val rdd1: RDD[(Int, (Int, Int))] = rdd.mapPartitionsWithIndex(mapPar)
  rdd1.foreach(println)
}

printMapPar(rdd)
println("----------------")
val rdd1: RDD[(Int, Int)] = rdd.partitionBy(new RangePartitioner(3,rdd))
printMapPar(rdd1)

上面的RDD生成的时候是乱的,但是我们让他分成三个范围,按照范围,key值为1,2的划分到第一个分区,key值为3,4的划分到第二个分区,key值为5的划分到第三个分区

自定义分区

要实现自定义的分区器,你需要继承 org.apache.spark.Partitioner 类并实现下面三个方法
- numPartitions: Int:返回创建出来的分区数。
- getPartition(key: Any): Int:返回给定键的分区编号( 0 到 numPartitions-1)。
下面我自定义一个分区,让key大于等于4的落在第一个分区,key>=2并且key<4的落在第二个分区,其余的落在第一个分区。
scala版本
自定义分区器

class RddScala(numParts:Int) extends Partitioner{
  override def numPartitions: Int = numParts

  override def getPartition(key: Any): Int = {
    if (key.toString.toInt>=4){
      0
    }else if(key.toString.toInt>=2&&key.toString.toInt<4){
      1
    }else{
      2
    }
  }
}

分区, 然后调用前面我们写的printRDDPart方法把各个分区中的RDD打印出来

printMapPar(rdd)
println("----------------")
val rdd1: RDD[(Int, Int)] = rdd.partitionBy(new RddScala(3))
printMapPar(rdd1)
(0,(3,5))
(1,(1,2))
(0,(2,4))
(1,(2,3))
(0,(5,9))
(1,(4,8))
(0,(5,10))
(1,(4,7))
(0,(1,1))
(1,(3,6))
----------------
(1,(2,3))
(1,(3,6))
(1,(3,5))
(1,(2,4))
(0,(4,8))
(0,(4,7))
(0,(5,9))
(0,(5,10))
(2,(1,2))
(2,(1,1))

java 分区的用法

同样,写个方法,该方法能打印RDD下的每个分区下的各个元素

打印每个分区下的各个元素的printPartRDD函数

public static void printPartRdd(JavaPairRDD<Integer,Integer> pairRDD){
    JavaRDD<Tuple2<Integer, Tuple2<Integer, Integer>>> mapPartitionsWithIndex = pairRDD.mapPartitionsWithIndex(new Function2<Integer, Iterator<Tuple2<Integer, Integer>>, Iterator<Tuple2<Integer, Tuple2<Integer, Integer>>>>() {
        @Override
        public Iterator<Tuple2<Integer, Tuple2<Integer, Integer>>> call(Integer v1, Iterator<Tuple2<Integer, Integer>> v2) throws Exception {
            ArrayList<Tuple2<Integer, Tuple2<Integer, Integer>>> list = new ArrayList<>();
            while (v2.hasNext()) {
                Tuple2<Integer, Integer> next = v2.next();
                list.add(new Tuple2<>(v1, next));
            }
            return list.iterator();
        }
    }, false);
    for (Tuple2<Integer, Tuple2<Integer, Integer>> tuple2 : mapPartitionsWithIndex.collect()) {
        System.out.println(tuple2);
    }
}

java HashPartitioner 分区

JavaRDD<Tuple2<Integer, Integer>> javaRDD = sc.parallelize(Arrays.asList(new Tuple2<Integer, Integer>(1, 1), new Tuple2<Integer, Integer>(1, 2)
        , new Tuple2<Integer, Integer>(2, 3), new Tuple2<Integer, Integer>(2, 4)
        , new Tuple2<Integer, Integer>(3, 5), new Tuple2<Integer, Integer>(3, 6)
        , new Tuple2<Integer, Integer>(4, 7), new Tuple2<Integer, Integer>(4, 8)
        , new Tuple2<Integer, Integer>(5, 9), new Tuple2<Integer, Integer>(5, 10)
), 3);
JavaPairRDD<Integer, Integer> javaPairRDD = JavaPairRDD.fromJavaRDD(javaRDD);
JavaPairRDD<Integer, Integer> partitionRDD = javaPairRDD.partitionBy(new HashPartitioner(3));
printPartRdd(partitionRDD);
(0,(3,5))
(0,(3,6))
(1,(1,1))
(1,(1,2))
(1,(4,7))
(1,(4,8))
(2,(2,3))
(2,(2,4))
(2,(5,9))
(2,(5,10))

java 自定义分区

自定义分区器 ,key大于4的落在第一个分区,[2,4)之间的落在第二个分区,其余的落在第三个分区

public class JavaCustomPart extends Partitioner {
    int i = 1;

    public JavaCustomPart(int i) {
        this.i = i;
    }

    public JavaCustomPart() {
    }

    @Override
    public int numPartitions() {
        return i;
    }

    @Override
    public int getPartition(Object key) {
        int keyCode = Integer.parseInt(key.toString());
        if (keyCode >= 4) {
            return 0;
        } else if (keyCode >= 2 && keyCode < 4) {
            return 1;
        } else {
            return 2;
        }
    }

}

分区并打印

JavaPairRDD<Integer, Integer> javaPairRDD = JavaPairRDD.fromJavaRDD(javaRDD);
JavaPairRDD<Integer, Integer> partitionRDD = javaPairRDD.partitionBy(new JavaCustomPart(3));
printPartRdd(partitionRDD);
(0,(4,7))
(0,(4,8))
(0,(5,9))
(0,(5,10))
(1,(2,3))
(1,(2,4))
(1,(3,5))
(1,(3,6))
(2,(1,1))
(2,(1,2))
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值