spark RDD算子(三) distinct,union,intersection,subtract,cartesian

distinct

distinct用于去重, 我们生成的RDD可能有重复的元素,使用distinct方法可以去掉重复的元素, 不过此方法涉及到混洗,操作开销很大

union

两个RDD进行合并

intersection

RDD1.intersection(RDD2) 返回两个RDD的交集,并且去重
intersection 需要混洗数据,比较浪费性能

subtract

RDD1.subtract(RDD2),返回在RDD1中出现,但是不在RDD2中出现的元素,不去重

cartesian

RDD1.cartesian(RDD2) 返回RDD1和RDD2的笛卡儿积,这个开销非常大

代码示例

scala代码

val conf = new SparkConf().setMaster("local[3]").setAppName("rdddemo"
val sc = SparkContext.getOrCreate(conf)

println("---------distinct----------")
val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9,9,2,6))
val rdd2 = rdd1.distinct()
println(rdd2.collect.mkString(","))

println("---------union-----------")
val RDD1 = sc.parallelize(List(1,2,3,5,6))
val RDD2 = sc.parallelize(List(1,2,3,4))
val unionRdd3 = RDD1.union(RDD2)
print(unionRdd3.collect.mkString(","))
println()

println("---------++-----------")
val jiajiaRdd1 = RDD1 ++ RDD2
print(jiajiaRdd1.collect.mkString(","))
println()

println("---------intersection-----------")
val intersectionRdd = RDD1.intersection(RDD2)
print(intersectionRdd.collect.mkString(","))
println()

println("---------subtract--------")
val subtractRdd = RDD1.subtract(RDD2)
print(subtractRdd.collect.mkString(","))
println()

println("-------cartesian-------")
val cartesianRdd = RDD1.cartesian(RDD2)
println(cartesianRdd.collect.mkString(","))
println()

打印输出结果

---------distinct----------
6,3,9,4,1,7,8,5,2
---------union-----------
1,2,3,5,6,1,2,3,4
---------++-----------
1,2,3,5,6,1,2,3,4
---------intersection-----------
3,1,2
---------subtract--------
6,5
-------cartesian-------
(1,1),(1,2),(1,3),(1,4),(2,1),(3,1),(2,2),(3,2),(2,3),(2,4),(3,3),(3,4),(5,1),(6,1),(5,2),(6,2),(5,3),(5,4),(6,3),(6,4)

Java代码

SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("distinctjava");
JavaSparkContext sc = new JavaSparkContext(conf);

//distinct
System.out.println("-----distinct-----");
List<String> strings = Arrays.asList("aa", "aa", "bb", "cc", "dd", "dd");
JavaRDD<String> strRdd = sc.parallelize(strings);
JavaRDD<String> distinctRdd = strRdd.distinct();
List<String> collect = distinctRdd.collect();
for (String str :
        collect) {
    System.out.println(str);
}

//union
System.out.println("-----union-----");
List<String> strings = Arrays.asList("aa", "aa", "bb", "c
List<String> strings2 = new ArrayList<String>();
strings2.add("aa");
strings2.add("bb");
strings2.add("cc");

JavaRDD<String> strRdd1 = sc.parallelize(strings);
JavaRDD<String> strRdd2 = sc.parallelize(strings2);
JavaRDD<String> unionRdd = strRdd1.union(strRdd2);
List<String> collect = unionRdd.collect();
for (String str :
        collect) {
    System.out.print(str+" ");
}

//intersection
System.out.println("-----intersection-----");
List<String> strings = Arrays.asList("aa", "aa", "bb", "cc", "dd", "
List<String> strings2 = new ArrayList<String>();
strings2.add("aa");
strings2.add("bb");
strings2.add("cc");

JavaRDD<String> strRdd1 = sc.parallelize(strings);
JavaRDD<String> strRdd2 = sc.parallelize(strings2);
JavaRDD<String> intersectionRdd = strRdd1.intersection(strRdd2);
List<String> collect = intersectionRdd.collect();
for (String str :
        collect) {
    System.out.println(str);
}

//subtract
System.out.println("-----subtract-----");
List<String> strings = Arrays.asList("aa", "aa", "bb", "cc", "dd", "dd");
List<String> strings2 = new ArrayList<String>();
strings2.add("aa");
strings2.add("bb");
strings2.add("cc");

JavaRDD<String> strRdd1 = sc.parallelize(strings);
JavaRDD<String> strRdd2 = sc.parallelize(strings2);
JavaRDD<String> subtractRdd = strRdd1.subtract(strRdd2);
List<String> collect = subtractRdd.collect();
for (String str :
       collect) {
    System.out.println(str);
}

//cartesian
System.out.println("-----cartesian-----");
JavaRDD<String> rdd1 = sc.parallelize(Arrays.asList("1", "2", "3"));
JavaRDD<String> rdd2 = sc.parallelize(Arrays.asList("a", "b", "c"));
JavaPairRDD<String, String> cartsianRdd = rdd1.cartesian(rdd2);
List<Tuple2<String, String>> collect = cartsianRdd.collect();
for (Tuple2<String,String> str:
        collect){
    System.out.println(str);
}

打印输出结果

-----distinct-----
aa
dd
bb
cc
-----union-----
aa aa bb cc dd dd aa bb cc 
-----intersection-----
aa
bb
cc
-----subtract-----
dd
dd
-----cartesian-----
(1,a)(1,b)(1,c)(2,a)(2,b)(2,c)(3,a)(3,b)(3,c)
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值