spark(8)-spark RDD API(course16)

1. RDD的操作

Transformation:

数据状态转换,即算子,是基于已有的RDD创建一个新的RDD

Action:

触发作业。是最后取结果的操作。因为RDD是Lazy级别的,性能非常高,从后往前回溯。如foreach/reduce/saveAsTextFile,这些都可以保存结果到HDFS或给Driver。

Controller:

性能、效率、容错的支持。即cache/persist/checkpoint

这里写图片描述

1. Transformation API

① map(func)

将RDD中的每个元素传入自定义函数,获取一个新的元素,然后用新的元素组成新的RDD

② filter

filter 对RDD中每个元素进行判断,如果返回true则保留,返回false则剔除

③ flatMap

flatMap 与map类似,但是对每个元素都可以返回一个或多个新元素

④ groupByKey

根据key进行分组,每个key对应一个Iterable

这里写图片描述

⑤ reduceByKey

对每个key对应的value进行reduce操作

⑥ sortByKey

对每个key对应的value进行排序操作

⑦ join

对两个包含对的RDD进行join操作,每个key join上的pair,都会传入自定义函数进行处理

这里写图片描述

# join
x = sc.parallelize([('C',4),('B',3),('A',2),('A',1)])
y = sc.parallelize([('A',8),('B',7),('A',6),('D',5)])
z = x.join(y)
print(x.collect())
print(y.collect())
print(z.collect())

[('C', 4), ('B', 3), ('A', 2), ('A', 1)]
[('A', 8), ('B', 7), ('A', 6), ('D', 5)]
[('A', (2, 8)), ('A', (2, 6)), ('A', (1, 8)), ('A', (1, 6)), ('B', (3, 7))]

⑧ cogroup

同join,但是是每个key对应的Iterable都会传入自定义函数进行处理
cogroup与join的不同
相当于是,一个key join上的所有value,都给放到一个Iterable里面去了

这里写图片描述

代码示例:

package cn.whbing.spark.SparkApps.cores;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Comparator;
import java.util.Iterator;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.VoidFunction;

import javassist.expr.Instanceof;
import scala.Tuple1;
import scala.Tuple2;
import scala.collection.generic.BitOperations.Int;


public class RDDApi{
    public static void main(String[] args) {
        SparkConf conf = new SparkConf();
        conf.setAppName("RDD API Test").setMaster("local");

        JavaSparkContext sc = new JavaSparkContext(conf);
        ArrayList<Integer> list = new ArrayList<Integer>();
        for(int i=0;i<100;i++){
            list.add(i+1);
        }   
        JavaRDD<Integer> rdd =  sc.parallelize(list);

        /*1.map*/
        JavaRDD<Integer> rdd2 = rdd.map(new Function<Integer, Integer>() {
            @Override
            public Integer call(Integer v1) throws Exception {
                // TODO Auto-generated method stub
                return v1*2;
            }
        });
        System.out.println("1.map结果:key*2:");
        rdd2.foreach(new VoidFunction<Integer>() {          
            @Override
            public void call(Integer t) throws Exception {
                System.out.print(t+" ");                
            }
        });
        /*end map*/     
        /*2.filter*/
        JavaRDD<Integer> rdd3 = rdd.filter(new Function<Integer, Boolean>() {

            @Override
            public Boolean call(Integer v1) throws Exception {
                if(v1%3==0){
                    return false;//过滤掉3的倍数
                }
                return true;
            }
        });
        System.out.println("2.filter结果:过滤3的倍数:");
        rdd3.foreach(new VoidFunction<Integer>() {          
            @Override
            public void call(Integer t) throws Exception {
                System.out.print(t+" ");                
            }
        });
        /*end filter*/
        /*3.flatMap*/
        List<String> list2 =  Arrays.asList("hello spark !","hello java","hello today");
        JavaRDD<String> rddFlatMap = sc.parallelize(list2);
        JavaRDD<String> rddFlatMap2 = rddFlatMap.flatMap(new FlatMapFunction<String, String>() {

            @Override
            public Iterator<String> call(String t) throws Exception {

                return Arrays.asList(t.split(" ")).iterator();
            }
        });
        System.out.println("3.flatMap原数据:");
        rddFlatMap.foreach(new VoidFunction<String>() {         
            @Override
            public void call(String t) throws Exception {
                System.out.println(t);              
            }
        });
        System.out.println("3.flatMap结果:对每个key以空格分开");
        rddFlatMap2.foreach(new VoidFunction<String>() {            
            @Override
            public void call(String t) throws Exception {
                System.out.println(t);              
            }
        });
        /*end flatMap*/
        /*4.groupByKey*/
        List scoreList = Arrays.asList(
                new Tuple2("class1", 80),
                new Tuple2("class2", 90),
                new Tuple2("class1",100),
                new Tuple2("class1",60)
        );
        JavaPairRDD rddPair = sc.parallelizePairs(scoreList);
        JavaPairRDD rddGroupByKey2 = rddPair.groupByKey();
        rddGroupByKey2.foreach(new VoidFunction<Tuple2>() {

            @Override
            public void call(Tuple2 t) throws Exception {
                System.out.println("4.groupbykey");
                System.out.println("class:"+t._1);
                System.out.println(t._2);                           
            }
        });
        /*end groupbykey*/
        /*5.reduce by key:统计每个班级的总分*/
        JavaPairRDD<String, Integer> rddReduceByKey = rddPair.reduceByKey(new Function2<Integer, Integer,Integer>() {
            @Override
            public Integer call(Integer v1, Integer v2) throws Exception {
                return v1+v2;
            }
        }); 
        System.out.println("5.reduce by key:");
        rddReduceByKey.foreach(new VoidFunction<Tuple2<String,Integer>>() {

            @Override
            public void call(Tuple2<String, Integer> t) throws Exception {
                System.out.println(t._1);
                System.out.println(t._2);
            }
        });
        /*end reduce by key*/
        /*6.sort by key*/
        List li = Arrays.asList(
            new Tuple2<Integer,String>(99,"anni"),
            new Tuple2<Integer,String>(88,"tony"),
            new Tuple2<>(100, "whb"),
            new Tuple2<>(60, "miss")
        );
        JavaPairRDD<Integer, String> rddbysort = sc.parallelizePairs(li); 
        JavaPairRDD<Integer, String> rddBysort2 =rddbysort.sortByKey();//默认从小到大
        System.out.println("6.1.sort by key:默认按成绩从小到大排序的结果:");
        rddBysort2.foreach(new VoidFunction<Tuple2<Integer,String>>() {         
            @Override
            public void call(Tuple2<Integer, String> t) throws Exception {
                System.out.println(t._1);               
                System.out.println(t._2);               
            }
        }); 
        /*7.join将学生id,姓名和id,成绩通过id关联起来*/
        List l1 = Arrays.asList(
            new Tuple2<Integer,String>(1,"anni"),
            new Tuple2<Integer,String>(2,"tony"),
            new Tuple2<Integer,String>(3,"ted"),
            new Tuple2<Integer,String>(4,"lucy")
        );

        List l2 = Arrays.asList(
            new Tuple2<Integer,Integer>(1,80),
            new Tuple2<Integer,Integer>(1,90),
            new Tuple2<Integer,Integer>(2,40),
            new Tuple2<Integer,Integer>(3,66),
            new Tuple2<Integer,Integer>(4,50)
        );
        JavaPairRDD<Integer,String> rddList1 = sc.parallelizePairs(l1);
        JavaPairRDD<Integer,Integer> rddList2 = sc.parallelizePairs(l2);
        JavaPairRDD<Integer, Tuple2<String, Integer>> rddjoin = rddList1.join(rddList2);
        System.out.println("7.join:");
        rddjoin.foreach(new VoidFunction<Tuple2<Integer,Tuple2<String,Integer>>>() {

            @Override
            public void call(Tuple2<Integer, Tuple2<String, Integer>> t) throws Exception {
                System.out.println(t._1);
                System.out.println(t._2);
                System.out.println("id:"+t._1+",name:"+t._2._1+",score:"+t._2._2);
            }
        });
        /*end join*/
        // cogroup与join不同
        // 相当于是,一个key join上的所有value,都给放到一个Iterable里面去了 
        JavaPairRDD<Integer, Tuple2<Iterable<String>, Iterable<Integer>>> rddcogroup = rddList1.cogroup(rddList2);
        System.out.println("8.cogroup:");
        rddcogroup.foreach(new VoidFunction<Tuple2<Integer,Tuple2<Iterable<String>,Iterable<Integer>>>>() {         
            @Override
            public void call(Tuple2<Integer, Tuple2<Iterable<String>, Iterable<Integer>>> t) throws Exception {
                System.out.println("id:"+t._1);
                System.out.println("name:"+t._2._1);
                System.out.println("score:"+t._2._2);
            }
        });
    }
}

结果:

//1.map结果:key*2:
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

//2.filter结果:过滤3的倍数:
1 2 4 5 7 8 10 11 13 14 16 17 19 20

//3.flatMap原数据:
hello spark !
hello java
hello today
3.flatMap结果:对每个key以空格分开
hello
spark
!
hello
java
hello
today

//4.groupbykey
class:class1
[80, 100, 60]
4.groupbykey
class:class2
[90]

//5.reduce by key:
class1
240
class2
90

//6.1.sort by key:默认按成绩从小到大排序的结果:
60
miss
88
tony
99
anni
100
whb

//7.join:
4
(lucy,50)
id:4,name:lucy,score:50
1
(anni,80)
id:1,name:anni,score:80
1
(anni,90)
id:1,name:anni,score:90
3
(ted,66)
id:3,name:ted,score:66
2
(tony,40)
id:2,name:tony,score:40

//8.cogroup:
id:4
name:[lucy]
score:[50]
id:1
name:[anni]
score:[80, 90]
id:3
name:[ted]
score:[66]
id:2
name:[tony]
score:[40]

1. Action API

① reduce

reduce 将RDD中的所有元素进行聚合操作。第一个和第二个元素聚合,值与第三个元素聚合,值与第四个元素聚合,以此类推

② collect

collect 将RDD中所有元素获取到本地客户端

③ count

count 获取RDD元素总数

④ take(n)

take(n) 获取RDD中前n个元素

⑤ saveAsTextFile

将RDD元素保存到文件中,对每个元素调用toString方法

⑥ countByKey

对每个key对应的值进行count计数
这里写图片描述

# countByKey
x = sc.parallelize([('B',1),('B',2),('A',3),('A',4),('A',5)])
y = x.countByKey()
print(x.collect())
print(y)

[('B', 1), ('B', 2), ('A', 3), ('A', 4), ('A', 5)]
defaultdict(<type 'int'>, {'A': 3, 'B': 2})

⑦ foreach

遍历RDD中的每个元素

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值