文章目录
- parallelize
- makeRDD
- textFile
- filter
- map
- flatMap
- distinct
- union
- intersection
- subtract
- cartesian
- mapToPair
- flatMapToPair
- combineByKey
- reduceByKey
- foldByKey
- SortByKey
- groupByKey
- cogroup
- subtractByKey
- join
- fullOuterJoin
- leftOuterJoin
- rightOuterJoin
- first
- take
- collect
- count
- countByValue
- reduce
- aggregate
- fold
- top
- takeOrdered
- foreach
- countByKey
- collectAsMap
- saveAsTextFile
- saveAsSequenceFile
- saveAsObjectFile
- saveAsHadoopFile
- saveAsHadoopDataset
- saveAsNewAPIHadoopFile
- saveAsNewAPIHadoopDataset
- mapPartitions
- mapPartitionsWithIndex
- 默认分区和HashPartitioner分区
- RangePartitioner
- 自定义分区
- java 分区的用法
parallelize
调用SparkContext 的 parallelize(),将一个存在的集合,变成一个RDD,这种方式试用于学习spark和做一些spark的测试
scala版本
def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T]
- 第一个参数一是一个 Seq集合
- 第二个参数是分区数
- 返回的是RDD[T]
val rdd: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6,7))
java版本
def parallelize[T](list : java.util.List[T], numSlices : scala.Int) : org.apache.spark.api.java.JavaRDD[T] = { /* compiled code */ }
- 第一个参数是一个List集合
- 第二个参数是一个分区,可以默认
- 返回的是一个JavaRDD[T]
java版本只能接收List的集合
JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6));
makeRDD
只有scala版本的才有makeRDD
def makeRDD[T](seq : scala.Seq[T], numSlices : scala.Int = { /* compiled code */ })
跟parallelize类似
val rdd: RDD[Int] = sc.makeRDD(List(1,2,3,4,5,6,7))
textFile
调用SparkContext.textFile()方法,从外部存储中读取数据来创建 RDD
例如在我本地input下有个word.txt文件,文件随便写了点内容,我需要将里面的内容读取出来创建RDD
scala版本
val rdd: RDD[String] = sc.textFile("in/word.txt")
java版本
JavaRDD<String> stringJavaRDD = sc.textFile("in/word.txt");
注: textFile支持分区,支持模式匹配,例如把in目录下.txt的给转换成RDD
var lines = sc.textFile("in/*.txt")
多个路径可以使用逗号分隔,例如
var lines = sc.textFile("dir1,dir2",3)
filter
举例,在sample.txt 文件的内容如下
aa bb cc aa aa aa dd dd ee ee ee ee
ff aa bb zks
ee kks
ee zz zks
我要将包含zks的行的内容给找出来
scala版本
val rdd: RDD[String] = sc.textFile("in/sample.txt")
val rdd2: RDD[String] = rdd.filter(x=>x.contains("zks"))
rdd2.foreach(println)
java版本
JavaRDD<String> rdd2 = sc.textFile("in/sample.txt");
JavaRDD<String> filterRdd = rdd2.filter(new Function<String, Boolean>() {
@Override
public Boolean call(String v1) throws Exception {
return v1.contains("zks");
}
});
List<String> collect3 = filterRdd.collect();
for (String s : collect3) {
System.out.println(s);
}
map
map() 接收一个函数,把这个函数用于 RDD 中的每个元素,将函数的返回结果作为结果RDD编程
RDD 中对应元素的值 map是一对一的关系
scala版本
//读取数据
scala> val lines = sc.textFile("F:\\sparktest\\sample.txt")
//用map,对于每一行数据,按照空格分割成一个一个数组,然后返回的是一对一的关系
scala> var mapRDD = lines.map(line => line.split("\\s+"))
---------------输出-----------
res0: Array[Array[String]] = Array(Array(aa, bb, cc, aa, aa, aa, dd, dd, ee, ee, ee, ee), Array(ff, aa, bb, zks), Array(ee, kks), Array(ee, zz, zks))
//读取第一个元素
scala> mapRDD.first
---输出----
res1: Array[String] = Array(aa, bb, cc, aa, aa, aa, dd, dd, ee, ee, ee, ee)
java版本
JavaRDD<String> stringJavaRDD = sc.textFile("in/sample.txt");
JavaRDD<Iterable> mapRdd = stringJavaRDD.map(new Function<String, Iterable>() {
@Override
public Iterable call(String v1) throws Exception {
String[] split = v1.split(" ");
return Arrays.asList(split);
}
});
List<Iterable> collect = mapRdd.collect();
for (Iterable iterable : collect) {
Iterator iterator = iterable.iterator();
while (iterator.hasNext()) System.out.println(iterator.next());
}
System.out.println(mapRdd.first());
flatMap
有时候,我们希望对某个元素生成多个元素,实现该功能的操作叫作 flatMap()
faltMap的函数应用于每一个元素,对于每一个元素返回的是多个元素组成的迭代器(想要了解更多,请参考scala的flatMap和map用法)
例如我们将数据切分为单词
scala版本
val rdd: RDD[String] = sc.textFile("in/sample.txt")
rdd.flatMap(x=>x.split(" ")).foreach(println)
java版本,spark2.0以下
JavaRDD<String> lines = sc.textFile("/in/sample.txt");
JavaRDD<String> flatMapRDD = lines.flatMap(new FlatMapFunction<String, String>() {
@Override
public Iterable<String> call(String s) throws Exception {
String[] split = s.split("\\s+");
return Arrays.asList(split);
}
});
//输出第一个
System.out.println(flatMapRDD.first());
------------输出----------
aa
java版本,spark2.0以上
spark2.0以上,对flatMap的方法有所修改,就是flatMap中的Iterator和Iteratable的小区别
JavaRDD<String> flatMapRdd = stringJavaRDD.flatMap(new FlatMapFunction<String, String>() {
@Override
public Iterator<String> call(String s) throws Exception {
String[] split = s.split("\\s+");
return Arrays.asList(split).iterator();
}
});
List<String> collect = flatMapRdd.collect();
for (String s : collect) {
System.out.println(s);
}
distinct
distinct用于去重, 我们生成的RDD可能有重复的元素,使用distinct方法可以去掉重复的元素, 不过此方法涉及到混洗,操作开销很大
scala版本
val rdd: RDD[Int] = sc.parallelize(List(1,1,1,2,3,4,5,6))
val rdd2: RDD[Int] = rdd.distinct()
rdd2.collect.foreach(println)
java版本
JavaRDD<String> javaRDD = sc.parallelize(Arrays.asList("aa", "aa", "cc", "dd"));
JavaRDD<String> distinctRdd = javaRDD.distinct();
List<String> collect = distinctRdd.collect();
for (String s : collect) {
System.out.println(s);
}
union
两个RDD进行合并
scala版本
val rdd1: RDD[Int] = sc.parallelize(List(1,1,1,1))
val rdd2: RDD[Int] = sc.parallelize(List(2,2,2,2))
val rdd3: RDD[Int] = rdd1.union(rdd2)
rdd3.collect.foreach(println)
java版本
JavaRDD<String> javaRDD = sc.parallelize(Arrays.asList("aa", "aa", "cc", "dd"));
JavaRDD<String> javaRDD2 = sc.parallelize(Arrays.asList("aa", "aa", "cc", "dd"));
JavaRDD<String> unionRdd = javaRDD.union(javaRDD2);
List<String> collect = unionRdd.collect();
for (String s : collect) {
System.out.print(s+",");
}
intersection
RDD1.intersection(RDD2) 返回两个RDD的交集,并且去重
intersection 需要混洗数据,比较浪费性能
scala版本
val rdd1: RDD[String] = sc.parallelize(List("aa","aa","bb","cc"))
val rdd2: RDD[String] = sc.parallelize(List("aa","aa","bb","ff"))
val intersectionRdd: RDD[String] = rdd1.intersection(rdd2)
intersectionRdd.collect.foreach(println)
java版本
JavaRDD<String> javaRDD = sc.parallelize(Arrays.asList("aa", "aa", "cc", "dd"));
JavaRDD<String> javaRDD2 = sc.parallelize(Arrays.asList("aa", "aa", "cc", "ff"));
List<String> collect = javaRDD.intersection(javaRDD2).collect();
for (String s : collect) {
System.out.print(s+",");
}
subtract
RDD1.subtract(RDD2),返回在RDD1中出现,但是不在RDD2中出现的元素,不去重
scala版本
val rdd1: RDD[String] = sc.parallelize(List("aa","aa","bb","cc"))
val rdd2: RDD[String] = sc.parallelize(List("aa","aa","bb","ff"))
val intersectionRdd: RDD[String] = rdd1.subtract(rdd2)
intersectionRdd.collect.foreach(println)
java版本
JavaRDD<String> javaRDD = sc.parallelize(Arrays.asList("aa", "aa", "cc", "dd"));
JavaRDD<String> javaRDD2 = sc.parallelize(Arrays.asList("aa", "aa", "cc", "ff"));
List<String> collect = javaRDD.subtract(javaRDD2).collect();
for (String s : collect) {
System.out.print(s+",");
}
cartesian
RDD1.cartesian(RDD2) 返回RDD1和RDD2的笛卡儿积,这个开销非常大
scala版本
val rdd1: RDD[String] = sc.parallelize(List("aa","aa","bb","cc"))
val rdd2: RDD[String] = sc.parallelize(List("aa","aa","bb","ff"))
val rdd3: RDD[(String, String)] = rdd1.cartesian(rdd2)
rdd3.collect.foreach(println)
java版本
JavaRDD<String> javaRDD = sc.parallelize(Arrays.asList("1","2","3"));
JavaRDD<String> javaRDD2 = sc.parallelize(Arrays.asList("aa", "aa", "cc", "ff"));
List<Tuple2<String, String>> collect = javaRDD.cartesian(javaRDD2).collect();
for (Tuple2<String, String> tuple2 : collect) {
System.out.println(tuple2);
}
mapToPair
将每一行的第一个单词作为键,1 作为value创建pairRDD
scala版本
scala是没有mapToPair函数的,scala版本只需要map就可以了
val rdd: RDD[String] = sc.textFile("in/sample.txt")
val rdd2: RDD[(String, Int)] = rdd.map(x=>(x.split(" ")(0),1))
rdd2.collect.foreach(println)
java版本
JavaRDD<String> javaRDD = sc.textFile("in/sample.txt");
JavaPairRDD<String, Integer> mapToPair = javaRDD.mapToPair(new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String s) throws Exception {
String key = s.split(" ")[0];
return new Tuple2<>(key, 1);
}
});
List<Tuple2<String, Integer>> collect = mapToPair.collect();
for (Tuple2<String, Integer> tuple2 : collect) {
System.out.println(tuple2);
}
flatMapToPair
类似于xxx连接 mapToPair是一对一,一个元素返回一个元素,而flatMapToPair可以一个元素返回多个,相当于先flatMap,在mapToPair
例子: 将每一行的第一个单词作为键,1 作为value
scala版本
val rdd1: RDD[String] = sc.textFile("in/sample.txt")
val flatRdd: RDD[String] = rdd1.flatMap(x=>x.split(" "))
val pairs: RDD[(String, Int)] = flatRdd.map(x=>(x,1))
pairs.collect.foreach(println)
java版本 spark2.0以下
JavaPairRDD<String, Integer> wordPairRDD = lines.flatMapToPair(new PairFlatMapFunction<String, String, Integer>() {
@Override
public Iterable<Tuple2<String, Integer>> call(String s) throws Exception {
ArrayList<Tuple2<String, Integer>> tpLists = new ArrayList<Tuple2<String, Integer>>();
String[] split = s.split("\\s+");
for (int i = 0; i <split.length ; i++) {
Tuple2 tp = new Tuple2<String,Integer>(split[i], 1);
tpLists.add(tp);
}
return tpLists;
}
});
java版本 spark2.0以上
主要是iterator和iteratable的一些区别
JavaRDD<String> javaRDD = sc.textFile("in/sample.txt");
JavaPairRDD<String, Integer> flatMapToPair = javaRDD.flatMapToPair(new PairFlatMapFunction<String, String, Integer>() {
@Override
public Iterator<Tuple2<String, Integer>> call(String s) throws Exception {
ArrayList<Tuple2<String, Integer>> list = new ArrayList<>();
String[] split = s.split(" ");
for (int i = 0; i < split.length; i++) {
String key = split[i];
Tuple2<String, Integer> tuple2 = new Tuple2<>(key, 1);
list.add(tuple2);
}
return list.iterator();
}
});
List<Tuple2<String, Integer>> collect = flatMapToPair.collect();
for (Tuple2<String, Integer> tuple2 : collect) {
System.out.println("key "+tuple2._1+" value "+tuple2._2);
}
combineByKey
聚合数据一般在集中式数据比较方便,如果涉及到分布式的数据集,该如何去实现呢。这里介绍一下combineByKey, 这个是各种聚集操作的鼻祖,应该要好好了解一下,参考scala API
简要介绍
def combineByKey[C](createCombiner: (V) => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C): RD
- createCombiner: combineByKey() 会遍历分区中的所有元素,因此每个元素的键要么还没有遇到过,要么就和之前的某个元素的键相同。如果这是一个新的元素, combineByKey() 会使用一个叫作 createCombiner() 的函数来创建那个键对应的累加器的初始值
- mergeValue: 如果这是一个在处理当前分区之前已经遇到的键, 它会使用 mergeValue() 方法将该键的累加器对应的当前值与这个新的值进行合并
- mergeCombiners: 由于每个分区都是独立处理的, 因此对于同一个键可以有多个累加器。如果有两个或者更多的分区都有对应同一个键的累加器, 就需要使用用户提供的 mergeCombiners() 方法将各
个分区的结果进行合并。
计算学生平均成绩例子
这里举一个计算学生平均成绩的例子,例子参考至https://www.edureka.co/blog/apache-spark-combinebykey-explained, github源码 我对此进行了解析
创建一个学生成绩说明的类
case class ScoreDetail(studentName:String,subject:String,score:Float)
下面是一些测试数据,加载测试数据集合 key = Students name and value = ScoreDetail instance
val scores = List(
ScoreDetail("xiaoming", "Math", 98),
ScoreDetail("xiaoming", "English", 88),
ScoreDetail("wangwu", "Math", 75),
ScoreDetail("wangwu", "English", 78),
ScoreDetail("lihua", "Math", 90),
ScoreDetail("lihua", "English", 80),
ScoreDetail("zhangsan", "Math", 91),
ScoreDetail("zhangsan", "English", 80))
将集合转换成二元组, 也可以理解成转换成一个map, 利用了for 和 yield的组合
val scoresWithKey = for { i <- scores } yield (i.studentName, i)
创建RDD, 并且指定三个分区
val scoresWithKeyRDD: RDD[(String, ScoreDetail)] = sc.parallelize(scoresWithKey).partitionBy(new HashPartitioner(3)).cache()
输出打印一下各个分区的长度和各个分区的一些数据
scoresWithKeyRDD.foreachPartition(partitions=>{
partitions.foreach(x=>println(x._1,x._2.subject,x._2.score))
})
聚合求平均值让后打印
val avgScoresRdd: RDD[(String, Float)] = scoresWithKeyRDD.combineByKey(
(x: ScoreDetail) => (x.score, 1),
(acc: (Float, Int), x: ScoreDetail) => (acc._1 + x.score, acc._2 + 1),
(acc1: (Float, Int), acc2: (Float, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)
).map({ case (key, value) => (key, value._1 / value._2) })
avgScoresRdd.collect.foreach(println)
解释一下scoresWithKeyRDD.combineByKey
createCombiner: (x: ScoreDetail) => (x.score, 1)
这是第一次遇到zhangsan,创建一个函数,把map中的value转成另外一个类型 ,这里是把(zhangsan,(ScoreDetail类))转换成(zhangsan,(91,1))
mergeValue: (acc: (Float, Int), x: ScoreDetail) => (acc._1 + x.score, acc._2 + 1) 再次碰到张三, 就把这两个合并, 这里是将(zhangsan,(91,1)) 这种类型 和 (zhangsan,(ScoreDetail类))这种类型合并,合并成了(zhangsan,(171,2))
mergeCombiners (acc1: (Float, Int), acc2: (Float, Int)) 这个是将多个分区中的zhangsan的数据进行合并, 我们这里zhansan在同一个分区,这个地方就没有用上
java版本
ScoreDetail类
package nj.zb.CombineByKey;
import java.io.Serializable;
public class ScoreDetailsJava implements Serializable {
public String stuName;
public Integer score;
public String subject;
public ScoreDetailsJava(String stuName, String subject,Integer score) {
this.stuName = stuName;
this.score = score;
this.subject = subject;
}
}
CombineByKey的测试类
package nj.zb.CombineByKey;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;
import java.util.ArrayList;
import java.util.Map;
public class CombineByKey {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("cby").setMaster("local[2]");
JavaSparkContext sc = new JavaSparkContext(conf);
ArrayList<ScoreDetailsJava> scoreDetails = new ArrayList<>();
scoreDetails.add(new ScoreDetailsJava("xiaoming", "Math", 98));
scoreDetails.add(new ScoreDetailsJava("xiaoming", "English", 88));
scoreDetails.add(new ScoreDetailsJava("wangwu", "Math", 75));
scoreDetails.add(new ScoreDetailsJava("wangwu", "English", 78));
scoreDetails.add(new ScoreDetailsJava("lihua", "Math", 90));
scoreDetails.add(new ScoreDetailsJava("lihua", "English", 80));
scoreDetails.add(new ScoreDetailsJava("zhangsan", "Math", 91));
scoreDetails.add(new ScoreDetailsJava("zhangsan", "English", 80));
JavaRDD<ScoreDetailsJava> scoreDetailsRDD = sc.parallelize(scoreDetails);
JavaPairRDD<String, ScoreDetailsJava> pairRdd = scoreDetailsRDD.mapToPair(new PairFunction<ScoreDetailsJava, String, ScoreDetailsJava>() {
@Override
public Tuple2<String, ScoreDetailsJava> call(ScoreDetailsJava scoreDetailsJava) throws Exception {
return new Tuple2<>(scoreDetailsJava.stuName, scoreDetailsJava);
}
});
//createCombine
Function<ScoreDetailsJava, Tuple2<Integer, Integer>> createCombine = new Function<ScoreDetailsJava, Tuple2<Integer, Integer>>() {
@Override
public Tuple2<Integer, Integer> call(ScoreDetailsJava v1) throws Exception {
return new Tuple2<>(v1.score, 1);
}
};
//mergeValue
Function2<Tuple2<Integer, Integer>, ScoreDetailsJava, Tuple2<Integer, Integer>> mergeValue = new Function2<Tuple2<Integer, Integer>, ScoreDetailsJava, Tuple2<Integer, Integer>>() {
@Override
public Tuple2<Integer, Integer> call(Tuple2<Integer, Integer> v1, ScoreDetailsJava v2) throws Exception {
return new Tuple2<>(v1._1 + v2.score, v1._2 + 1);
}
};
//mergeCombiners
Function2<Tuple2<Integer, Integer>, Tuple2<Integer, Integer>, Tuple2<Integer, Integer>> mergeCombiners = new Function2<Tuple2<Integer, Integer>, Tuple2<Integer, Integer>, Tuple2<Integer, Integer>>() {
@Override
public Tuple2<Integer, Integer> call(Tuple2<Integer, Integer> v1, Tuple2<Integer, Integer> v2) throws Exception {
return new Tuple2<>(v1._1 + v2._1, v1._2 + v2._2);
}
};
JavaPairRDD<String, Tuple2<Integer, Integer>> combineByRdd = pairRdd.combineByKey(createCombine, mergeValue, mergeCombiners);
Map<String, Tuple2<Integer, Integer>> stringTuple2Map = combineByRdd.collectAsMap();
for (String s : stringTuple2Map.keySet()) {
System.out.println(s+":"+stringTuple2Map.get(s)._1/stringTuple2Map.get(s)._2);
}
}
}
reduceByKey
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]
接收一个函数,按照相同的key进行reduce操作,类似于scala的reduce的操作
例如RDD {(1, 2), (3, 4), (3, 6)}进行reduce
val rdd: RDD[(Int, Int)] = sc.parallelize(List((1,2),(3,4),(3,6)))
val rdd2: RDD[(Int, Int)] = rdd.reduceByKey((x,y)=>{println("one:"+x+"two:"+y);x+y})
rdd2.collect.foreach(println)
再举例
单词计数
sample.txt中的内容如下
aa bb cc aa aa aa dd dd ee ee ee ee
ff aa bb zks
ee kks
ee zz zks
scala版本
val rdd: RDD[String] = sc.textFile("in/sample.txt")
rdd.flatMap(_.split("\\s+")).map((_,1)).reduceByKey(_+_).foreach(println)
java版本
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.*;
import scala.Tuple2;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
public class RddJava {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("RddJava").setMaster("local[1]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> javaRDD = sc.textFile("in/sample.txt");
PairFlatMapFunction<String, String, Integer> pairFlatMapFunction = new PairFlatMapFunction<String, String, Integer>(){
@Override
public Iterator<Tuple2<String, Integer>> call(String s) throws Exception {
String[] split = s.split("\\s+");
ArrayList<Tuple2<String, Integer>> list = new ArrayList<>();
for (String str : split) {
Tuple2<String, Integer> tuple2 = new Tuple2<>(str, 1);
list.add(tuple2);
}
return list.iterator();
}
};
Function2<Integer, Integer, Integer> function2 = new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2;
}
};
JavaPairRDD<String, Integer> javaPairRDD = javaRDD.flatMapToPair(pairFlatMapFunction).reduceByKey(function2);
List<Tuple2<String, Integer>> collect = javaPairRDD.collect();
for (Tuple2<String, Integer> tuple2 : collect) {
System.out.println(tuple2);
}
}
}
foldByKey
def foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)]
def foldByKey(zeroValue: V, numPartitions: Int)(func: (V, V) => V): RDD[(K, V)]
def foldByKey(zeroValue: V, partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)]
该函数用于RDD[K,V]根据K将V做折叠、合并处理,其中的参数zeroValue表示先根据映射函数将zeroValue应用于V,进行初始化V,再将映射函数应用于初始化后的V.
foldByKey可以参考我之前的scala的fold的介绍
与reduce不同的是 foldByKey开始折叠的第一个元素不是集合中的第一个元素,而是传入的一个元素
参考LXW的博客 scala的例子
val rdd: RDD[(String, Int)] = sc.parallelize(List(("a",2),("a",3),("b",3)))
rdd.foldByKey(0)((x,y)=>{println("one:"+x+"two:"+y);x+y}).collect.foreach(println)
SortByKey
def sortByKey(ascending : scala.Boolean = { /* compiled code */ }, numPartitions : scala.Int = { /* compiled code */ }) : org.apache.spark.rdd.RDD[scala.Tuple2[K, V]] = { /* compiled code */ }
SortByKey用于对pairRDD按照key进行排序,第一个参数可以设置true或者false,默认是true
scala例子
val rdd: RDD[(Int, Int)] = sc.parallelize(Array((3, 4),(1, 2),(4,4),(2,5), (6,5), (5, 6)))
rdd.sortByKey().collect.foreach(println)
groupByKey
def groupByKey(): RDD[(K, Iterable[V])]
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]
groupByKey会将RDD[key,value] 按照相同的key进行分组,形成RDD[key,Iterable[value]]的形式, 有点类似于sql中的groupby,例如类似于mysql中的group_concat
例如这个例子, 我们对学生的成绩进行分组
scala版本
val scoreDetail = sc.parallelize(List(("xiaoming",75),("xiaoming",90),("lihua",95),("lihua",100),("xiaofeng",85)))
val scoreGroup: RDD[(String, Iterable[Int])] = scoreDetail.groupByKey()
scoreGroup.collect.foreach(x=>{
x._2.foreach(y=>{
println(x._1,y)
})
})
java版本
JavaRDD<Tuple2<String, Integer>> scoreDetail = sc.parallelize(Arrays.asList(new Tuple2<>("xiaoming", 75),
new Tuple2<>("xiaoming", 90),
new Tuple2<>("lihua", 95),
new Tuple2<>("lihua", 100),
new Tuple2<>("xiaofeng", 85)));
JavaPairRDD<String, Integer> scoreMapRdd = JavaPairRDD.fromJavaRDD(scoreDetail);
Map<String, Iterable<Integer>> collect = scoreMapRdd.groupByKey().collectAsMap();
for (String s : collect.keySet()) {
for (Integer score : collect.get(s)) {
System.out.println(s+":"+score);
}
}
cogroup
groupByKey是对单个 RDD 的数据进行分组,还可以使用一个叫作 cogroup() 的函数对多个共享同一个键的 RDD 进行分组
例如
RDD1.cogroup(RDD2) 会将RDD1和RDD2按照相同的key进行分组,得到(key,RDD[key,Iterable[value1],Iterable[value2]])的形式
cogroup也可以多个进行分组
例如RDD1.cogroup(RDD2,RDD3,…RDDN), 可以得到(key,Iterable[value1],Iterable[value2],Iterable[value3],…,Iterable[valueN])
案例,scoreDetail存放的是学生的优秀学科的分数,scoreDetai2存放的是刚刚及格的分数,scoreDetai3存放的是没有及格的科目的分数,我们要对每一个学生的优秀学科,刚及格和不及格的分数给分组统计出来
scala版本
val rdd = sc.parallelize(List(("xiaoming",75),("xiaoming",90),("lihua",95),("lihua",100),("xiaofeng",85)))
val rdd1 = sc.parallelize(List(("xiaoming",85),("xiaoming",95),("lisi",95),("lisi",100),("xiaofeng",90)))
val coRdd: RDD[(String, (Iterable[Int], Iterable[Int]))] = rdd.cogroup(rdd1)
coRdd.foreach(println)
java版本
JavaRDD<Tuple2<String, Integer>> scoreDetail = sc.parallelize(Arrays.asList(new Tuple2<>("xiaoming", 75),
new Tuple2<>("xiaoming", 90),
new Tuple2<>("lihua", 95),
new Tuple2<>("lihua", 100),
new Tuple2<>("xiaofeng", 85)));
JavaRDD<Tuple2<String, Integer>> scoreDetail2 = sc.parallelize(Arrays.asList(
new Tuple2<>("lisi", 90),
new Tuple2<>("lihua", 95),
new Tuple2<>("lihua", 100),
new Tuple2<>("xiaomi", 85)
));
JavaPairRDD<String, Integer> rdd1 = JavaPairRDD.fromJavaRDD(scoreDetail);
JavaPairRDD<String, Integer> rdd2 = JavaPairRDD.fromJavaRDD(scoreDetail2);
JavaPairRDD<String, Tuple2<Iterable<Integer>, Iterable<Integer>>> cogroupRdd = rdd1.cogroup(rdd2);
Map<String, Tuple2<Iterable<Integer>, Iterable<Integer>>> myMap = cogroupRdd.collectAsMap();
Set<String> keys = myMap.keySet();
for (String key : keys) {
Tuple2<Iterable<Integer>, Iterable<Integer>> tuple2 = myMap.get(key);
System.out.println(key+":"+tuple2);
}
subtractByKey
函数定义
def subtractByKey[W](other: RDD[(K, W)])(implicit arg0: ClassTag[W]): RDD[(K, V)]
def subtractByKey[W](other: RDD[(K, W)], numPartitions: Int)(implicit arg0: ClassTag[W]): RDD[(K, V)]
def subtractByKey[W](other: RDD[(K, W)], p: Partitioner)(implicit arg0: ClassTag[W]): RDD[(K, V)]
类似于subtract,删掉 RDD 中键与 other RDD 中的键相同的元素
val rdd = sc.parallelize(List(("xiaoming",75),("xiaoming",90),("lihua",95),("lihua",100),("xiaofeng",85)))
val rdd1 = sc.parallelize(List(("xiaoming",85),("xiaoming",95),("lisi",95),("lisi",100),("xiaofeng",90)))
rdd.subtractByKey(rdd1).collect.foreach(println)
(lihua,95)
(lihua,100)
join
函数定义
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
RDD1.join(RDD2)
可以把RDD1,RDD2中的相同的key给连接起来,类似于sql中的join操作
val rdd = sc.parallelize(List(("xiaoming",75),("xiaoming",90),("lihua",95),("lihua",100),("xiaofeng",85)))
val rdd1 = sc.parallelize(List(("xiaoming",85),("xiaoming",95),("lisi",95),("lisi",100),("xiaofeng",90)))
rdd.join(rdd1).collect.foreach(println)
(xiaoming,(75,85))
(xiaoming,(75,95))
(xiaoming,(90,85))
(xiaoming,(90,95))
(xiaofeng,(85,90))
fullOuterJoin
和join类似,不过这是全连接
val rdd = sc.parallelize(List(("xiaoming",75),("xiaoming",90),("lihua",95),("lihua",100),("xiaofeng",85)))
val rdd1 = sc.parallelize(List(("xiaoming",85),("xiaoming",95),("lisi",95),("lisi",100),("xiaofeng",90)))
rdd.fullOuterJoin(rdd1).collect.foreach(println)
(lihua,(Some(95),None))
(lihua,(Some(100),None))
(xiaoming,(Some(75),Some(85)))
(xiaoming,(Some(75),Some(95)))
(xiaoming,(Some(90),Some(85)))
(xiaoming,(Some(90),Some(95)))
(lisi,(None,Some(95)))
(lisi,(None,Some(100)))
(xiaofeng,(Some(85),Some(90)))
leftOuterJoin
def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
def leftOuterJoin[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, Option[W]))]
def leftOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))]
直接看图即可
对两个 RDD 进行连接操作,类似于sql中的左外连接
val rdd = sc.parallelize(List(("xiaoming",75),("xiaoming",90),("lihua",95),("lihua",100),("xiaofeng",85)))
val rdd1 = sc.parallelize(List(("xiaoming",85),("xiaoming",95),("lisi",95),("lisi",100),("xiaofeng",90)))
rdd.leftOuterJoin(rdd1).collect.foreach(println)
(lihua,(95,None))
(lihua,(100,None))
(xiaoming,(75,Some(85)))
(xiaoming,(75,Some(95)))
(xiaoming,(90,Some(85)))
(xiaoming,(90,Some(95)))
(xiaofeng,(85,Some(90)))
rightOuterJoin
对两个 RDD 进行连接操作,类似于sql中的右外连接,存在的话,value用的Some, 不存在用的None,具体的看上面的图和下面的代码即可
val rdd = sc.parallelize(List(("xiaoming",75),("xiaoming",90),("lihua",95),("lihua",100),("xiaofeng",85)))
val rdd1 = sc.parallelize(List(("xiaoming",85),("xiaoming",95),("lisi",95),("lisi",100),("xiaofeng",90)))
rdd.rightOuterJoin(rdd1).collect.foreach(println)
(xiaoming,(Some(75),85))
(xiaoming,(Some(75),95))
(xiaoming,(Some(90),85))
(xiaoming,(Some(90),95))
(lisi,(None,95))
(lisi,(None,100))
(xiaofeng,(Some(85),90))
java语言
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.Optional;
import scala.Tuple2;
import java.util.Arrays;
import java.util.Map;
public class RddJava {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local[3]").setAppName("join");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<Tuple2<Integer, Integer>> javaRDD = sc.parallelize(Arrays.asList(
new Tuple2<>(1, 2),
new Tuple2<>(2, 9),
new Tuple2<>(3, 8),
new Tuple2<>(4, 10),
new Tuple2<>(5, 20)
));
JavaRDD<Tuple2<Integer, Integer>> javaRDD1 = sc.parallelize(Arrays.asList(
new Tuple2<>(3, 15),
new Tuple2<>(4, 19),
new Tuple2<>(5, 20),
new Tuple2<>(6, 2),
new Tuple2<>(7, 23)
));
//JavaRDD转换成JavaPairRDD
JavaPairRDD<Integer, Integer> rdd = JavaPairRDD.fromJavaRDD(javaRDD);
JavaPairRDD<Integer, Integer> other = JavaPairRDD.fromJavaRDD(javaRDD1);
//subtractByKey
JavaPairRDD<Integer, Integer> subtractByKey = rdd.subtractByKey(other);
Map<Integer, Integer> subMap = subtractByKey.collectAsMap();
System.out.println("----substractByKey----");
for (Integer key : subMap.keySet()) {
System.out.println(key+":"+subMap.get(key));
}
//join
JavaPairRDD<Integer, Tuple2<Integer, Integer>> join = rdd.join(other);
Map<Integer, Tuple2<Integer, Integer>> joinMap = join.collectAsMap();
System.out.println("----join----");
for (Integer key : joinMap.keySet()) {
System.out.println(key+":"+joinMap.get(key));
}
//leftoutjoin
JavaPairRDD<Integer, Tuple2<Integer, Optional<Integer>>> leftoutjoin = rdd.leftOuterJoin(other);
Map<Integer, Tuple2<Integer, Optional<Integer>>> leftjoinMap = leftoutjoin.collectAsMap();
System.out.println("----leftjoin----");
for (Integer key : leftjoinMap.keySet()) {
System.out.println(key+":"+leftjoinMap.get(key));
}
//rightoutjoin
JavaPairRDD<Integer, Tuple2<Optional<Integer>, Integer>> rightOuterJoin = rdd.rightOuterJoin(other);
Map<Integer, Tuple2<Optional<Integer>, Integer>> rightooutMap = rightOuterJoin.collectAsMap();
System.out.println("----rightjoin----");
for (Integer key : rightooutMap.keySet()) {
System.out.println(key+":"+rightooutMap.get(key));
}
}
}
----substractByKey----
2:9
1:2
----join----
5:(20,20)
4:(10,19)
3:(8,15)
----leftjoin----
2:(9,Optional.empty)
5:(20,Optional[20])
4:(10,Optional[19])
1:(2,Optional.empty)
3:(8,Optional[15])
----rightjoin----
5:(Optional[20],20)
4:(Optional[10],19)
7:(Optional.empty,23)
3:(Optional[8],15)
6:(Optional.empty,2)
first
返回第一个元素
scala
val rdd: RDD[Int] = sc.parallelize(List(1,2,4,5,6,7,8,9))
println(rdd.first())
java
JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
System.out.println(javaRDD.first());
take
rdd.take(n)返回第n个元素
scala
val rdd: RDD[Int] = sc.parallelize(List(1,2,4,5,6,7,8,9))
println(rdd.take(3).toList)
java
JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
System.out.println(javaRDD.take(3));
collect
rdd.collect() 返回 RDD 中的所有元素
scala
val rdd: RDD[Int] = sc.parallelize(List(1,2,4,5,6,7,8,9))
rdd.collect().foreach(println)
java
JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
System.out.println(javaRDD.collect());
count
rdd.count() 返回 RDD 中的元素个数
scala
val rdd: RDD[Int] = sc.parallelize(List(1,2,4,5,6,7,8,9))
println(rdd.count())
java
JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
System.out.println(javaRDD.count());
countByValue
各元素在 RDD 中出现的次数 返回{(key1,次数),(key2,次数),…(keyn,次数)}
scala
val rdd: RDD[Int] = sc.parallelize(List(1,2,2,5,6,7,8,9))
rdd.countByValue().foreach(println)
java
JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
System.out.println(javaRDD.countByValue());
reduce
rdd.reduce(func)
并行整合RDD中所有数据, 类似于是scala中集合的reduce
scala
val rdd: RDD[Int] = sc.parallelize(List(1,2,2,5,6,7,8,9))
println(rdd.reduce(_ + _))
java
JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
Integer res = javaRDD.reduce(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2;
}
});
System.out.println(res);
aggregate
和 reduce() 相 似, 但 是 通 常
返回不同类型的函数 一般不用这个函数
scala
val rdd: RDD[Int] = sc.parallelize(List(1,2,2,5,6,7,8,9))
val i: Int = rdd.aggregate(5)(_+_,_+_)
println(i)
java
JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
Function2<Integer, Integer, Integer> seqop = new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2;
}
};
Function2<Integer, Integer, Integer> comop = new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2;
}
};
Integer res = javaRDD.aggregate(5, seqop, comop);
System.out.println(res);
fold
rdd.fold(num)(func) 一般不用这个函数
和 reduce() 一 样, 但是提供了初始值num,每个元素计算时,先要合这个初始值进行折叠, 注意,这里会按照
每个分区进行fold,然后分区之间还会再次进行fold
提供初始值
scala
val rdd: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6,7))
val i: Int = rdd.fold(5)(_+_)
println(i)
java
JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
Integer res = javaRDD.fold(5, new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2;
}
});
System.out.println(res);
top
rdd.top(n)
按照降序的或者指定的排序规则,返回前n个元素
scala
val rdd: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6,7))
val res: Array[Int] = rdd.top(3)
println(res.toList)
java
JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
List<Integer> res = javaRDD.top(5);
for (Integer re : res) {
System.out.println(re);
}
takeOrdered
rdd.take(n)
对RDD元素进行升序排序,取出前n个元素并返回,也可以自定义比较器(这里不介绍),类似于top的相反的方法
scala
val rdd: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6,7))
val res: Array[Int] = rdd.takeOrdered(2)
println(res.toList)
java
JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
List<Integer> res = javaRDD.takeOrdered(2);
for (Integer re : res) {
System.out.println(re);
}
foreach
对 RDD 中的每个元素使用给
定的函数
scala
val rdd: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6,7))
rdd.foreach(println)
java
JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
javaRDD.foreach(new VoidFunction<Integer>() {
@Override
public void call(Integer integer) throws Exception {
System.out.println(integer);
}
});
countByKey
def countByKey(): Map[K, Long]
以RDD{(1, 2),(2,4),(2,5), (3, 4),(3,5), (3, 6)}为例 rdd.countByKey会返回{(1,1),(2,2),(3,3)}
scala例子
val rdd: RDD[(Int, Int)] = sc.parallelize(Array((1, 2),(2,4),(2,5), (3, 4),(3,5), (3, 6)))
val rdd2: collection.Map[Int, Long] = rdd.countByKey()
rdd2.foreach(println)
java例子
JavaRDD<Tuple2<Integer, Integer>> javaRDD = sc.parallelize(Arrays.asList(
new Tuple2<>(1, 2),
new Tuple2<>(2, 3),
new Tuple2<>(2, 2),
new Tuple2<>(3, 2),
new Tuple2<>(3, 2)
));
JavaPairRDD<Integer, Integer> javaRDD1 = JavaPairRDD.fromJavaRDD(javaRDD);
Map<Integer, Long> count = javaRDD1.countByKey();
for (Integer key : count.keySet()) {
System.out.println(key+":"+count.get(key));
}
collectAsMap
将pair类型(键值对类型)的RDD转换成map, 还是上面的例子
scala例子
val rdd: RDD[(Int, Int)] = sc.parallelize(Array((1, 2),(2,4),(2,2), (3, 4),(3,5), (3, 6),(2,0),(1,0)))
val rdd2: collection.Map[Int, Int] = rdd.collectAsMap()
rdd2.foreach(println)
java例子
JavaRDD<Tuple2<Integer, Integer>> javaRDD = sc.parallelize(Arrays.asList(
new Tuple2<>(1, 2),
new Tuple2<>(2, 3),
new Tuple2<>(2, 2),
new Tuple2<>(3, 2),
new Tuple2<>(3, 2)
));
JavaPairRDD<Integer, Integer> pairRDD = javaRDD.mapToPair(new PairFunction<Tuple2<Integer, Integer>, Integer, Integer>() {
@Override
public Tuple2<Integer, Integer> call(Tuple2<Integer, Integer> tp) throws Exception {
return new Tuple2<>(tp._1, tp._2);
}
});
Map<Integer, Integer> collectAsMap = pairRDD.collectAsMap();
for (Integer key : collectAsMap.keySet()) {
System.out.println(key+":"+collectAsMap.get(key));
}
saveAsTextFile
def saveAsTextFile(path: String): Unit
def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit
saveAsTextFile用于将RDD以文本文件的格式存储到文件系统中。
codec参数可以指定压缩的类名。
val rdd: RDD[(Int, Int)] = sc.parallelize(Array((1, 2),(2,4),(2,2), (3, 4),(3,5), (3, 6),(2,0),(1,0)))
rdd.saveAsTextFile("in/test")
注意:如果使用rdd.saveAsTextFile(“hdfs://ip:3306/”)将文件保存到hdfs文件系统
指定压缩格式保存
<dependency>
<groupId>org.anarres.lzo</groupId>
<artifactId>lzo-hadoop</artifactId>
<version>1.0.0</version>
<scope>compile</scope>
</dependency>
val rdd: RDD[(Int, Int)] = sc.parallelize(Array((1, 2),(2,4),(2,2), (3, 4),(3,5), (3, 6),(2,0),(1,0)))
rdd.saveAsTextFile("in/test",classOf[com.hadoop.compression.lzo.LzopCodec])
saveAsSequenceFile
saveAsSequenceFile用于将RDD以SequenceFile的文件格式保存到HDFS上。
用法同saveAsTextFile。
saveAsObjectFile
def saveAsObjectFile(path: String): Unit
saveAsObjectFile用于将RDD中的元素序列化成对象,存储到文件中。
对于HDFS,默认采用SequenceFile保存。
val rdd: RDD[Int] = sc.makeRDD(1 to 10)
rdd.saveAsObjectFile("in/testob")
saveAsHadoopFile
def saveAsHadoopFile(path: String, keyClass: Class[], valueClass: Class[], outputFormatClass: Class[_ <: OutputFormat[, ]], codec: Class[_ <: CompressionCodec]): Unit
def saveAsHadoopFile(path: String, keyClass: Class[], valueClass: Class[], outputFormatClass: Class[_ <: OutputFormat[, ]], conf: JobConf = …, codec: Option[Class[_ <: CompressionCodec]] = None): Unit
saveAsHadoopFile是将RDD存储在HDFS上的文件中,支持老版本Hadoop API。
可以指定outputKeyClass、outputValueClass以及压缩格式。
每个分区输出一个文件。
var rdd1 = sc.makeRDD(Array(("A",2),("A",1),("B",6),("B",3),("B",7)))
import org.apache.hadoop.mapred.TextOutputFormat
import org.apache.hadoop.io.Text
import org.apache.hadoop.io.IntWritable
rdd1.saveAsHadoopFile("/tmp/test/",classOf[Text],classOf[IntWritable],classOf[TextOutputFormat[Text,IntWritable]])
rdd1.saveAsHadoopFile("/tmp/.test/",classOf[Text],classOf[IntWritable],classOf[TextOutputFormat[Text,IntWritable]],classOf[com.hadoop.compression.lzo.LzopCodec])
saveAsHadoopDataset
def saveAsHadoopDataset(conf: JobConf): Unit
saveAsHadoopDataset用于将RDD保存到除了HDFS的其他存储中,比如HBase。
在JobConf中,通常需要关注或者设置五个参数:
文件的保存路径、key值的class类型、value值的class类型、RDD的输出格式(OutputFormat)、以及压缩相关的参数。
#使用saveAsHadoopDataset将RDD保存到HDFS中
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import SparkContext._
import org.apache.hadoop.mapred.TextOutputFormat
import org.apache.hadoop.io.Text
import org.apache.hadoop.io.IntWritable
import org.apache.hadoop.mapred.JobConf
var rdd1 = sc.makeRDD(Array(("A",2),("A",1),("B",6),("B",3),("B",7)))
var jobConf = new JobConf()
jobConf.setOutputFormat(classOf[TextOutputFormat[Text,IntWritable]])
jobConf.setOutputKeyClass(classOf[Text])
jobConf.setOutputValueClass(classOf[IntWritable])
jobConf.set("mapred.output.dir","/tmp/test/")
rdd1.saveAsHadoopDataset(jobConf)
#保存数据到HBASE
HBase建表:
create ‘test′,{NAME => ‘f1′,VERSIONS => 1},{NAME => ‘f2′,VERSIONS => 1},{NAME => ‘f3′,VERSIONS => 1}
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import SparkContext._
import org.apache.hadoop.mapred.TextOutputFormat
import org.apache.hadoop.io.Text
import org.apache.hadoop.io.IntWritable
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
var conf = HBaseConfiguration.create()
var jobConf = new JobConf(conf)
jobConf.set("hbase.zookeeper.quorum","zkNode1,zkNode2,zkNode3")
jobConf.set("zookeeper.znode.parent","/hbase")
jobConf.set(TableOutputFormat.OUTPUT_TABLE,"test")
jobConf.setOutputFormat(classOf[TableOutputFormat])
var rdd1 = sc.makeRDD(Array(("A",2),("B",6),("C",7)))
rdd1.map(x =>
{
var put = new Put(Bytes.toBytes(x._1))
put.add(Bytes.toBytes("f1"), Bytes.toBytes("c1"), Bytes.toBytes(x._2))
(new ImmutableBytesWritable,put)
}
).saveAsHadoopDataset(jobConf)
注意:保存到HBase,运行时候需要在SPARK_CLASSPATH中加入HBase相关的jar包。
可参考:http://lxw1234.com/archives/2015/07/332.htm
saveAsNewAPIHadoopFile
def saveAsNewAPIHadoopFile[F <: OutputFormat[K, V]](path: String)(implicit fm: ClassTag[F]): Unit
def saveAsNewAPIHadoopFile(path: String, keyClass: Class[], valueClass: Class[], outputFormatClass: Class[_ <: OutputFormat[, ]], conf: Configuration = self.context.hadoopConfiguration): Unit
saveAsNewAPIHadoopFile用于将RDD数据保存到HDFS上,使用新版本Hadoop API。
用法基本同saveAsHadoopFile。
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import SparkContext._
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
import org.apache.hadoop.io.Text
import org.apache.hadoop.io.IntWritable
var rdd1 = sc.makeRDD(Array(("A",2),("A",1),("B",6),("B",3),("B",7)))
rdd1.saveAsNewAPIHadoopFile("/tmp/lxw1234/",classOf[Text],classOf[IntWritable],classOf[TextOutputFormat[Text,IntWritable]])
saveAsNewAPIHadoopDataset
def saveAsNewAPIHadoopDataset(conf: Configuration): Unit
作用同saveAsHadoopDataset,只不过采用新版本Hadoop API。
以写入HBase为例:
HBase建表:
create ‘lxw1234′,{NAME => ‘f1′,VERSIONS => 1},{NAME => ‘f2′,VERSIONS => 1},{NAME => ‘f3′,VERSIONS => 1}
完整的Spark应用程序:
package com.lxw1234.test
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import SparkContext._
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Put
object Test {
def main(args : Array[String]) {
val sparkConf = new SparkConf().setMaster("spark://lxw1234.com:7077").setAppName("lxw1234.com")
val sc = new SparkContext(sparkConf);
var rdd1 = sc.makeRDD(Array(("A",2),("B",6),("C",7)))
sc.hadoopConfiguration.set("hbase.zookeeper.quorum ","zkNode1,zkNode2,zkNode3")
sc.hadoopConfiguration.set("zookeeper.znode.parent","/hbase")
sc.hadoopConfiguration.set(TableOutputFormat.OUTPUT_TABLE,"lxw1234")
var job = new Job(sc.hadoopConfiguration)
job.setOutputKeyClass(classOf[ImmutableBytesWritable])
job.setOutputValueClass(classOf[Result])
job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
rdd1.map(
x => {
var put = new Put(Bytes.toBytes(x._1))
put.add(Bytes.toBytes("f1"), Bytes.toBytes("c1"), Bytes.toBytes(x._2))
(new ImmutableBytesWritable,put)
}
).saveAsNewAPIHadoopDataset(job.getConfiguration)
sc.stop()
}
}
mapPartitions
mapPartition可以倒过来理解,先partition,再把每个partition进行map函数,
适用场景
如果在映射的过程中需要频繁创建额外的对象,使用mapPartitions要比map高效的过。
比如,将RDD中的所有数据通过JDBC连接写入数据库,如果使用map函数,可能要为每一个元素都创建一个connection,这样开销很大,如果使用mapPartitions,那么只需要针对每一个分区建立一个connection。
下面的例子,把每一个元素平方
java 每一个元素平方
JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
JavaRDD<Integer> javaRDD1 = javaRDD.mapPartitions(new FlatMapFunction<Iterator<Integer>, Integer>() {
@Override
public Iterator<Integer> call(Iterator<Integer> it) throws Exception {
ArrayList<Integer> res = new ArrayList<>();
while (it.hasNext()) {
Integer i = it.next();
res.add(i * i);
}
return res.iterator();
}
});
for (Integer integer : javaRDD1.collect()) {
System.out.println(integer);
}
把每一个数字i变成一个map(i,i*i)的形式
java,把每一个元素变成map(i,i*i)
JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
JavaRDD<Tuple2<Integer, Integer>> tuple2JavaRDD = javaRDD.mapPartitions(new FlatMapFunction<Iterator<Integer>, Tuple2<Integer, Integer>>() {
@Override
public Iterator<Tuple2<Integer, Integer>> call(Iterator<Integer> it) throws Exception {
ArrayList<Tuple2<Integer, Integer>> tuple2s = new ArrayList<>();
while (it.hasNext()) {
Integer i = it.next();
tuple2s.add(new Tuple2<>(i, i * i));
}
return tuple2s.iterator();
}
});
for (Tuple2<Integer, Integer> tuple2 : tuple2JavaRDD.collect()) {
System.out.println(tuple2);
}
scala 把每一个元素变成map(i,i*i)
val rdd: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6))
def mapPar(iter:Iterator[Int]): Iterator[(Int,Int)] ={
var res: List[(Int, Int)] = List[(Int,Int)]()
while (iter.hasNext){
val i: Int = iter.next()
res= res.+:(i,i*i)
}
res.iterator
}
val rdd1: RDD[(Int, Int)] = rdd.mapPartitions(mapPar)
rdd1.foreach(println)
mapPartitions操作键值对 把(i,j) 变成(i,j*j)
scala版本
val rdd: RDD[(Int, Int)] = sc.parallelize(List((1,1),(1,2),(1,3)))
def mapPar(iter:Iterator[(Int,Int)]): Iterator[(Int,Int)] ={
var res: List[(Int, Int)] = List[(Int,Int)]()
while (iter.hasNext){
val tuple: (Int, Int) = iter.next()
res= res.+:(tuple._1,tuple._2*tuple._2)
}
res.iterator
}
val rdd1: RDD[(Int, Int)] = rdd.mapPartitions(mapPar)
rdd1.foreach(println)
java版本
JavaRDD<Tuple2<Integer, Integer>> javaRDD = sc.parallelize(Arrays.asList(
new Tuple2<>(1, 1),
new Tuple2<>(1, 2),
new Tuple2<>(1, 3)
));
JavaRDD<Tuple2<Integer, Integer>> tuple2JavaRDD = javaRDD.mapPartitions(new FlatMapFunction<Iterator<Tuple2<Integer, Integer>>, Tuple2<Integer, Integer>>() {
@Override
public Iterator<Tuple2<Integer, Integer>> call(Iterator<Tuple2<Integer, Integer>> tuple2Iterator) throws Exception {
ArrayList<Tuple2<Integer, Integer>> tuple2s = new ArrayList<>();
while (tuple2Iterator.hasNext()) {
Tuple2<Integer, Integer> next = tuple2Iterator.next();
tuple2s.add(new Tuple2<Integer, Integer>(next._1, next._2 * next._2));
}
return tuple2s.iterator();
}
});
for (Tuple2<Integer, Integer> tuple2 : tuple2JavaRDD.collect()) {
System.out.println(tuple2);
}
mapPartitionsWithIndex
与mapPartition类似,也是按照分区进行的map操作,不过mapPartitionsWithIndex传入的参数多了一个分区的值,下面举个例子,为统计各个分区中的元素 (稍加修改可以做统计各个分区的数量)
java
JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6));
JavaRDD<Tuple2<Integer, Integer>> tuple2JavaRDD = javaRDD.mapPartitionsWithIndex(new Function2<Integer, Iterator<Integer>, Iterator<Tuple2<Integer, Integer>>>() {
@Override
public Iterator<Tuple2<Integer, Integer>> call(Integer v1, Iterator<Integer> v2) throws Exception {
ArrayList<Tuple2<Integer, Integer>> tuple2s = new ArrayList<>();
while (v2.hasNext()) {
Integer next = v2.next();
tuple2s.add(new Tuple2<>(v1, next));
}
return tuple2s.iterator();
}
}, false);
for (Tuple2<Integer, Integer> tuple2 : tuple2JavaRDD.collect()) {
System.out.println(tuple2);
}
scala
val rdd: RDD[Int] = sc.parallelize(List(1,2,3,4,5))
def mapPar(i:Int,iter:Iterator[Int]):Iterator[(Int,Int)]={
var tuples: List[(Int, Int)] = List[(Int,Int)]()
while (iter.hasNext){
val x: Int = iter.next()
tuples=tuples.+:(i,x)
}
tuples.iterator
}
val res: RDD[(Int, Int)] = rdd.mapPartitionsWithIndex(mapPar)
res.foreach(println)
mapPartitionsWithIndex 统计键值对中的各个分区的元素
scala版本
val rdd: RDD[(Int, Int)] = sc.parallelize(List((1,1),(1,2),(2,3),(2,4)))
def mapPar(i:Int,iter:Iterator[(Int,Int)]):Iterator[(Int,(Int,Int))]={
var tuples: List[(Int, (Int, Int))] = List[(Int,(Int,Int))]()
while (iter.hasNext){
val tuple: (Int, Int) = iter.next()
tuples=tuples.::(i,tuple)
}
tuples.iterator
}
val res: RDD[(Int, (Int, Int))] = rdd.mapPartitionsWithIndex(mapPar)
res.foreach(println)
java版本
JavaRDD<Tuple2<Integer, Integer>> javaRDD = sc.parallelize(Arrays.asList(
new Tuple2<>(1, 1),
new Tuple2<>(1, 2),
new Tuple2<>(2, 1),
new Tuple2<>(2, 2),
new Tuple2<>(3, 1)
));
JavaPairRDD<Integer, Integer> javaPairRDD = JavaPairRDD.fromJavaRDD(javaRDD);
JavaRDD<Tuple2<Integer, Tuple2<Integer, Integer>>> mapPartitionIndexRDD = javaPairRDD.mapPartitionsWithIndex(new Function2<Integer, Iterator<Tuple2<Integer, Integer>>, Iterator<Tuple2<Integer, Tuple2<Integer, Integer>>>>() {
@Override
public Iterator<Tuple2<Integer, Tuple2<Integer, Integer>>> call(Integer partIndex, Iterator<Tuple2<Integer, Integer>> tuple2Iterator) {
ArrayList<Tuple2<Integer, Tuple2<Integer, Integer>>> tuple2s = new ArrayList<>();
while (tuple2Iterator.hasNext()) {
Tuple2<Integer, Integer> next = tuple2Iterator.next();
tuple2s.add(new Tuple2<Integer, Tuple2<Integer, Integer>>(partIndex, next));
}
return tuple2s.iterator();
}
}, false);
for (Tuple2<Integer, Tuple2<Integer, Integer>> tuple2 : mapPartitionIndexRDD.collect()) {
System.out.println(tuple2);
}
补充: 打印各个分区的操作,可以使用 glom 的方法
JavaRDD<Tuple2<Integer, Integer>> javaRDD = sc.parallelize(Arrays.asList(
new Tuple2<>(1, 1),
new Tuple2<>(1, 2),
new Tuple2<>(2, 1),
new Tuple2<>(2, 2),
new Tuple2<>(3, 1)
));
JavaPairRDD<Integer, Integer> javaPairRDD = JavaPairRDD.fromJavaRDD(javaRDD);
JavaRDD<List<Tuple2<Integer, Integer>>> glom = javaPairRDD.glom();
for (List<Tuple2<Integer, Integer>> tuple2s : glom.collect()) {
System.out.println(tuple2s);
}
默认分区和HashPartitioner分区
默认的分区就是HashPartition分区,默认分区不再介绍,下面介绍HashPartition的使用
通过上一章 [mapPartitionsWithIndex的例子]我们可以构建一个方法,用来查看RDD的分区
val rdd: RDD[(Int, Int)] = sc.parallelize(List((1,1),(1,2),(2,3),(2,4)))
def mapPar(i:Int,iter:Iterator[(Int,Int)]):Iterator[(Int,(Int,Int))]={
var tuples: List[(Int, (Int, Int))] = List[(Int,(Int,Int))]()
while (iter.hasNext){
val tuple: (Int, Int) = iter.next()
tuples=tuples.::(i,tuple)
}
tuples.iterator
}
def printMapPar(rdd:RDD[(Int,Int)]): Unit ={
val rdd1: RDD[(Int, (Int, Int))] = rdd.mapPartitionsWithIndex(mapPar)
rdd1.foreach(println)
}
printMapPar(rdd)
HashPartitioner分区 scala
使用pairRdd.partitionBy(new spark.HashPartitioner(n)), 可以分为n个区
val rdd: RDD[(Int, Int)] = sc.parallelize(List((1,1),(1,2),(2,3),(2,4)))
val rdd1: RDD[(Int, Int)] = rdd.partitionBy(new spark.HashPartitioner(3))
rdd1.foreach(println)
HashPartitioner是如何分区的: 国内很多说法都是有问题的,参考国外的一个说法 Uses Java’s Object.hashCodemethod to determine the partition as partition = key.hashCode() % numPartitions. 翻译过来就是使用java对象的hashCode来决定是哪个分区,对于piarRDD, 分区就是key.hashCode() % numPartitions, 3%3=0,所以 (3,6) 这个元素在0 分区, 4%3=1,所以元素(4,8) 在1 分区。
RangePartitioner
我理解成范围分区器
使用一个范围,将范围内的键分配给相应的分区。这种方法适用于键中有自然排序,键不为负。本文主要介绍如何使用,原理以后再仔细研究,以下代码片段显示了RangePartitioner的用法
val rdd: RDD[(Int, Int)] = sc.parallelize(List((1,1),(1,2),(2,3),(2,4)))
def mapPar(i:Int,iter:Iterator[(Int,Int)]):Iterator[(Int,(Int,Int))]={
var tuples: List[(Int, (Int, Int))] = List[(Int,(Int,Int))]()
while (iter.hasNext){
val tuple: (Int, Int) = iter.next()
tuples=tuples.::(i,tuple)
}
tuples.iterator
}
def printMapPar(rdd:RDD[(Int,Int)]): Unit ={
val rdd1: RDD[(Int, (Int, Int))] = rdd.mapPartitionsWithIndex(mapPar)
rdd1.foreach(println)
}
printMapPar(rdd)
println("----------------")
val rdd1: RDD[(Int, Int)] = rdd.partitionBy(new RangePartitioner(3,rdd))
printMapPar(rdd1)
上面的RDD生成的时候是乱的,但是我们让他分成三个范围,按照范围,key值为1,2的划分到第一个分区,key值为3,4的划分到第二个分区,key值为5的划分到第三个分区
自定义分区
要实现自定义的分区器,你需要继承 org.apache.spark.Partitioner 类并实现下面三个方法
- numPartitions: Int:返回创建出来的分区数。
- getPartition(key: Any): Int:返回给定键的分区编号( 0 到 numPartitions-1)。
下面我自定义一个分区,让key大于等于4的落在第一个分区,key>=2并且key<4的落在第二个分区,其余的落在第一个分区。
scala版本
自定义分区器
class RddScala(numParts:Int) extends Partitioner{
override def numPartitions: Int = numParts
override def getPartition(key: Any): Int = {
if (key.toString.toInt>=4){
0
}else if(key.toString.toInt>=2&&key.toString.toInt<4){
1
}else{
2
}
}
}
分区, 然后调用前面我们写的printRDDPart方法把各个分区中的RDD打印出来
printMapPar(rdd)
println("----------------")
val rdd1: RDD[(Int, Int)] = rdd.partitionBy(new RddScala(3))
printMapPar(rdd1)
(0,(3,5))
(1,(1,2))
(0,(2,4))
(1,(2,3))
(0,(5,9))
(1,(4,8))
(0,(5,10))
(1,(4,7))
(0,(1,1))
(1,(3,6))
----------------
(1,(2,3))
(1,(3,6))
(1,(3,5))
(1,(2,4))
(0,(4,8))
(0,(4,7))
(0,(5,9))
(0,(5,10))
(2,(1,2))
(2,(1,1))
java 分区的用法
同样,写个方法,该方法能打印RDD下的每个分区下的各个元素
打印每个分区下的各个元素的printPartRDD函数
public static void printPartRdd(JavaPairRDD<Integer,Integer> pairRDD){
JavaRDD<Tuple2<Integer, Tuple2<Integer, Integer>>> mapPartitionsWithIndex = pairRDD.mapPartitionsWithIndex(new Function2<Integer, Iterator<Tuple2<Integer, Integer>>, Iterator<Tuple2<Integer, Tuple2<Integer, Integer>>>>() {
@Override
public Iterator<Tuple2<Integer, Tuple2<Integer, Integer>>> call(Integer v1, Iterator<Tuple2<Integer, Integer>> v2) throws Exception {
ArrayList<Tuple2<Integer, Tuple2<Integer, Integer>>> list = new ArrayList<>();
while (v2.hasNext()) {
Tuple2<Integer, Integer> next = v2.next();
list.add(new Tuple2<>(v1, next));
}
return list.iterator();
}
}, false);
for (Tuple2<Integer, Tuple2<Integer, Integer>> tuple2 : mapPartitionsWithIndex.collect()) {
System.out.println(tuple2);
}
}
java HashPartitioner 分区
JavaRDD<Tuple2<Integer, Integer>> javaRDD = sc.parallelize(Arrays.asList(new Tuple2<Integer, Integer>(1, 1), new Tuple2<Integer, Integer>(1, 2)
, new Tuple2<Integer, Integer>(2, 3), new Tuple2<Integer, Integer>(2, 4)
, new Tuple2<Integer, Integer>(3, 5), new Tuple2<Integer, Integer>(3, 6)
, new Tuple2<Integer, Integer>(4, 7), new Tuple2<Integer, Integer>(4, 8)
, new Tuple2<Integer, Integer>(5, 9), new Tuple2<Integer, Integer>(5, 10)
), 3);
JavaPairRDD<Integer, Integer> javaPairRDD = JavaPairRDD.fromJavaRDD(javaRDD);
JavaPairRDD<Integer, Integer> partitionRDD = javaPairRDD.partitionBy(new HashPartitioner(3));
printPartRdd(partitionRDD);
(0,(3,5))
(0,(3,6))
(1,(1,1))
(1,(1,2))
(1,(4,7))
(1,(4,8))
(2,(2,3))
(2,(2,4))
(2,(5,9))
(2,(5,10))
java 自定义分区
自定义分区器 ,key大于4的落在第一个分区,[2,4)之间的落在第二个分区,其余的落在第三个分区
public class JavaCustomPart extends Partitioner {
int i = 1;
public JavaCustomPart(int i) {
this.i = i;
}
public JavaCustomPart() {
}
@Override
public int numPartitions() {
return i;
}
@Override
public int getPartition(Object key) {
int keyCode = Integer.parseInt(key.toString());
if (keyCode >= 4) {
return 0;
} else if (keyCode >= 2 && keyCode < 4) {
return 1;
} else {
return 2;
}
}
}
分区并打印
JavaPairRDD<Integer, Integer> javaPairRDD = JavaPairRDD.fromJavaRDD(javaRDD);
JavaPairRDD<Integer, Integer> partitionRDD = javaPairRDD.partitionBy(new JavaCustomPart(3));
printPartRdd(partitionRDD);
(0,(4,7))
(0,(4,8))
(0,(5,9))
(0,(5,10))
(1,(2,3))
(1,(2,4))
(1,(3,5))
(1,(3,6))
(2,(1,1))
(2,(1,2))