3 Spark入门distinct、union、intersection,subtract,cartesian等数学运算

标签: spark数学运算
10人阅读 评论(0) 收藏 举报
分类:

这一篇是一些简单的Spark操作,如去重、合并、取交集等,不管用不用的上,做个档案记录。

distinct去重

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;

import java.util.Arrays;
import java.util.List;

/**
 * 去除重复的元素,不过此方法涉及到混洗,操作开销很大
 * @author wuweifeng wrote on 2018/4/16.
 */
public class TestDistinct {
    public static void main(String[] args) {
        SparkSession sparkSession = SparkSession.builder().appName("JavaWordCount").master("local").getOrCreate();
        //spark对普通List的reduce操作
        JavaSparkContext javaSparkContext = new JavaSparkContext(sparkSession.sparkContext());
        List<Integer> data = Arrays.asList(1, 1, 2, 3, 4, 5);
        JavaRDD<Integer> originRDD = javaSparkContext.parallelize(data);
        List<Integer> results = originRDD.distinct().collect();
        System.out.println(results);
    }
}

结果是[4, 1, 3, 5, 2]

union合并,不去重

这个就是简单的将两个RDD合并到一起

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;

import java.util.Arrays;
import java.util.List;

/**
 * 合并两个RDD
 * @author wuweifeng wrote on 2018/4/16.
 */
public class TestUnion {
    public static void main(String[] args) {
        SparkSession sparkSession = SparkSession.builder().appName("JavaWordCount").master("local").getOrCreate();
        //spark对普通List的reduce操作
        JavaSparkContext javaSparkContext = new JavaSparkContext(sparkSession.sparkContext());
        List<Integer> one = Arrays.asList(1, 2, 3, 4, 5);
        List<Integer> two = Arrays.asList(1, 6, 7, 8, 9);
        JavaRDD<Integer> oneRDD = javaSparkContext.parallelize(one);
        JavaRDD<Integer> twoRDD = javaSparkContext.parallelize(two);
        List<Integer> results = oneRDD.union(twoRDD).collect();
        System.out.println(results);
    }
}

结果是[1, 2, 3, 4, 5, 1, 6, 7, 8, 9]

intersection取交集

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;

import java.util.Arrays;
import java.util.List;

/**
 * 返回两个RDD的交集
 * @author wuweifeng wrote on 2018/4/16.
 */
public class TestIntersection {
    public static void main(String[] args) {
        SparkSession sparkSession = SparkSession.builder().appName("JavaWordCount").master("local").getOrCreate();
        //spark对普通List的reduce操作
        JavaSparkContext javaSparkContext = new JavaSparkContext(sparkSession.sparkContext());
        List<Integer> one = Arrays.asList(1, 2, 3, 4, 5);
        List<Integer> two = Arrays.asList(1, 6, 7, 8, 9);
        JavaRDD<Integer> oneRDD = javaSparkContext.parallelize(one);
        JavaRDD<Integer> twoRDD = javaSparkContext.parallelize(two);
        List<Integer> results = oneRDD.intersection(twoRDD).collect();
        System.out.println(results);
    }
}

结果[1]

subtract

RDD1.subtract(RDD2),返回在RDD1中出现,但是不在RDD2中出现的元素,不去重 
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;

import java.util.Arrays;
import java.util.List;

/**
 * @author wuweifeng wrote on 2018/4/16.
 */
public class TestSubtract {
    public static void main(String[] args) {
        SparkSession sparkSession = SparkSession.builder().appName("JavaWordCount").master("local").getOrCreate();
        //spark对普通List的reduce操作
        JavaSparkContext javaSparkContext = new JavaSparkContext(sparkSession.sparkContext());
        List<Integer> one = Arrays.asList(1, 2, 3, 4, 5);
        List<Integer> two = Arrays.asList(1, 6, 7, 8, 9);
        JavaRDD<Integer> oneRDD = javaSparkContext.parallelize(one);
        JavaRDD<Integer> twoRDD = javaSparkContext.parallelize(two);

        List<Integer> results = oneRDD.subtract(twoRDD).collect();
        System.out.println(results);
    }
}

结果:[2, 3, 4, 5]

cartesian返回笛卡尔积

笛卡尔积就是两两组合的所有组合,这个的开销非常大,譬如A是["a","b","c"],B是["1","2","3"],那笛卡尔积就是(1 a)(1 b)(1 c)(2 a)(2 b)(2 c)(3 a)(3 b)(3 c)
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
import scala.Tuple2;

import java.util.Arrays;
import java.util.List;

/**
 * 返回笛卡尔积,开销很大
 * @author wuweifeng wrote on 2018/4/16.
 */
public class TestCartesian {
    public static void main(String[] args) {
        SparkSession sparkSession = SparkSession.builder().appName("JavaWordCount").master("local").getOrCreate();
        //spark对普通List的reduce操作
        JavaSparkContext javaSparkContext = new JavaSparkContext(sparkSession.sparkContext());
        List<Integer> one = Arrays.asList(1, 2, 3);
        List<Integer> two = Arrays.asList(1, 4, 5);
        JavaRDD<Integer> oneRDD = javaSparkContext.parallelize(one);
        JavaRDD<Integer> twoRDD = javaSparkContext.parallelize(two);
        List<Tuple2<Integer, Integer>> results = oneRDD.cartesian(twoRDD).collect();
        System.out.println(results);
    }
}

注意,返回的是键值对

[(1,1), (1,4), (1,5), (2,1), (2,4), (2,5), (3,1), (3,4), (3,5)]


查看评论

spark简单使用——union intersection subtract cartesian

spark简单使用——union intersection subtract cartesian 本文主要参考:OReilly Learning Spark 首先看一下书中的说明: ...
  • wild46cat
  • wild46cat
  • 2017-01-03 14:17:49
  • 756

Spark算子:RDD基本转换操作(4)–union、intersection、subtract

union def union(other: RDD[T]): RDD[T] 该函数比较简单,就是将两个RDD进行合并,不去重。   scala> var rdd1 = sc.ma...
  • wisgood
  • wisgood
  • 2016-09-26 13:56:03
  • 1432

spark RDD算子(三) distinct,union,intersection,subtract,cartesian

spark伪集合 尽管 RDD 本身不是严格意义上的集合,但它也支持许多数学上的集合操作,比如合并和相交操作, 下图展示了这四种操作 distinctdistinct用于去重, 我们生成的RD...
  • T1DMzks
  • T1DMzks
  • 2017-04-16 21:37:33
  • 1952

Spark算子[06]:union,distinct,cartesian,intersection,subtract

输入: #scala val rdd1 = sc.parallelize(List(“a”,”b”,”b”,”c”)) val rdd2 = sc.parallelize(List(“c”,...
  • leen0304
  • leen0304
  • 2017-12-08 12:18:00
  • 95

【Spark Java API】Transformation(3)—union、intersection

spark java api...
  • a6210575
  • a6210575
  • 2016-08-19 21:14:23
  • 682

Spark编程指南入门之Java篇三-常用Transformations操作

7. 常用的转换Transformations操作 JavaRDD map(Function f) 将数据集的每一个元素按指定的函数f转换为一个新的RDD JavaRDD filter(Funct...
  • gangchengzhong
  • gangchengzhong
  • 2016-12-28 16:20:18
  • 1346

Spark RDD中Transformation的filter、distinct、cartesian、union详解

filter是一个过滤操作,每一个V的值都要经过f函数,满的会留下,产生最终结果。 作用是:去除RDD中相同的元素 两个partions做笛卡尔积 把两个RDD联合起来,两个RDD里...
  • snail_gesture
  • snail_gesture
  • 2015-11-11 22:02:59
  • 772

union和distinct 的使用

insert into use_message(itemid,Category,staff_no,message,isSend) select '4875','Lib', c.staff_no ,''...
  • hutao1101175783
  • hutao1101175783
  • 2014-01-11 21:15:05
  • 1224

Intersection-over-union between two detections

1) You have two overlapping bounding boxes. You compute the intersection of the boxes, which is the ...
  • yihaizhiyan
  • yihaizhiyan
  • 2016-03-11 09:56:27
  • 2035

spark transform系列__Cartesian

Cartesian 这个操作返回两个RDD的笛卡尔集.如果两个RDD中某一个RDD的结果集为空集时,这个结果集也是一个空集. 这个操作不会执行shuffle的操作. def cartesian[...
  • u014393917
  • u014393917
  • 2016-01-28 18:15:50
  • 824
    个人资料
    专栏达人 持之以恒
    等级:
    访问量: 45万+
    积分: 5488
    排名: 6076
    博客专栏
    友情链接
    最新评论