Spark详解-1

0. 大纲

  • Spark入门案例的深入的学习

  • Spark作业的执行流程

  • SparkRDD的算子操作

1. Spark入门案例的深入的学习

1.1. 构建编程的聚合项目

1.1.1. 构建父目录

​ 该父项目,一般做项目的管理、不做具体的编码,这个所谓管理,主要指的就是项目中涉及到的依赖的管理、版本的管理,插件的管理。

[外链图片转存失败(img-yt3hoPOV-15645568687)(assets/177563844724918.png)%28img-ytRvoPOV-1564556868777%29%28assets/1563844724918.png%29)]

指定的maven坐标

[外链图片转存失败(img-YOTSzVMz-1564556868782)(assets/1563845042884.png)]

指定存储目录

[外链图片转存失败(img-7sgAaWAq-1564556868783)(assets/1563845067395.png)]

删除src目录(因为是父目录,不做编码)

[外链图片转存失败(img-b0VzCc4Q-1564556868785)(assets/1563845116898.png)]

进行依赖的管理

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.desheng.parent</groupId>
    <artifactId>spark-parent-1903</artifactId>
    <version>1.0-SNAPSHOT</version>
    <properties>
        <scala.version>2.11.8</scala.version>
        <spark.version>2.2.2</spark.version>
        <junit.version>4.12</junit.version>
        <mysql.version>5.1.46</mysql.version>
    </properties>
    <dependencyManagement><!-- 依赖的管理,限制子目录继承所有依赖-->
        <dependencies>
            <dependency>
                <groupId>org.scala-lang</groupId>
                <artifactId>scala-library</artifactId>
                <version>${scala.version}</version>
            </dependency>
            <dependency>
                <groupId>junit</groupId>
                <artifactId>junit</artifactId>
                <version>${junit.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-core_2.11</artifactId>
                <version>${spark.version}</version>
            </dependency>
            <dependency>
                <groupId>mysql</groupId>
                <artifactId>mysql-connector-java</artifactId>
                <version>${mysql.version}</version>
            </dependency>
        </dependencies>
    </dependencyManagement>
</project>

1.1.2. 构建子模块

1.1.2.1. Spark-common

[外链图片转存失败(img-qgPK1efN-1564556868788)(assets/1563845395681.png)]

指定maven坐标

[外链图片转存失败(img-aqkSqyf2-1564556868789)(assets/1563845495355.png)]

存储目录

[外链图片转存失败(img-s2DRUM2H-1564556868791)(assets/1563845517379.png)]

指定maven依赖

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <parent>
        <artifactId>spark-parent-1903</artifactId>
        <groupId>com.desheng.parent</groupId>
        <version>1.0-SNAPSHOT</version>
    </parent>
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.desheng.bigdata</groupId>
    <artifactId>spark-common</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
        </dependency>
    </dependencies>
</project>

此时会在spark-parent-1903的pom文件中多了modules的配置

[外链图片转存失败(img-42nQmwzU-1564556868792)(assets/1563845628422.png)]

1.1.2.2. Spark-core

右键指定module

[外链图片转存失败(img-77reyfw8-1564556868794)(assets/1563845677717.png)]

指定maven的坐标

[外链图片转存失败(img-e1HwboTk-1564556868795)(assets/1563845702192.png)]

存储的目录

[外链图片转存失败(img-uAfAHQoW-1564556868798)(assets/1563845714967.png)]

指定maven的依赖

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <parent>
        <artifactId>spark-parent-1903</artifactId>
        <groupId>com.desheng.parent</groupId>
        <version>1.0-SNAPSHOT</version>
    </parent>
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.desheng.bigdata</groupId>
    <artifactId>spark-core</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
        </dependency>
        <dependency>
            <groupId>com.desheng.bigdata</groupId>
            <artifactId>spark-common</artifactId>
            <version>1.0-SNAPSHOT</version>
            <exclusions>
                <exclusion>
                    <groupId>junit</groupId>
                    <artifactId>junit</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
    </dependencies>
</project>

同时会在spark-parent-1903的pom中添加一个module

[外链图片转存失败(img-3UnKYGvs-1564556868799)(assets/1563845883479.png)]

1.2. Spark入门案例之java开发

1.2.1. Spark编程步骤

第一步:构建sparkContext
SparkContext的构建需要SparkConf的支持
第二步:加载外部数据,转化为RDD
textFile() —>加载外部文件系统(本地、hdfs)的一个文件
parallelize —>加载外部的一个集合(list,map\set\seq等待)
第三步:对RDD基于业务逻辑进行计算
操作分为了两种:
transformation:转化算子
action:行动算子
第四步:释放资源
sparkcontext.stop()

1.2.2. Spark编程涉及到的Master url地址写法

[外链图片转存失败(img-fRrXcpDc-1564556868800)(assets/1563846987668.png)]

1.2.3. 撸代码

package com.desheng.bigdata.spark.p2;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;

import java.util.Arrays;
import java.util.Iterator;

/**
 * 基于Java版本的Spark的wordcount案例
 */
public class JavaSparkWordCountOps {
    public static void main(String[] args) {

        SparkConf conf = new SparkConf();
        conf.setAppName("JavaSparkWordCountOps");
        conf.setMaster("local[*]");
        JavaSparkContext jsc = new JavaSparkContext(conf);

        JavaRDD<String> lines = jsc.textFile("file:/E:/data/hello.txt");

        JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
            public Iterator<String> call(String line) throws Exception {
                String[] words = line.split("\\s+");
                return Arrays.asList(words).iterator();
            }
        });

        JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
            public Tuple2<String, Integer> call(String word) throws Exception {
                return new Tuple2<String, Integer>(word, 1);
            }
        });
        /*
            1+ ... + 10
            int sum = 0;
            for(int i = 1; i <= 10; i++) {
                sum += i;
            }
         */
        JavaPairRDD<String, Integer> ret = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
            public Integer call(Integer v1, Integer v2) throws Exception {
                return v1 + v2;
            }
        });
        //action操作触发
        ret.foreach(new VoidFunction<Tuple2<String, Integer>>() {
            public void call(Tuple2<String, Integer> t) throws Exception {
                System.out.println(t._1 + "--->" + t._2);
            }
        });

        jsc.stop();
    }
}

1.2.4. java的lambda版本

  • 在编写lambda表达式的时候有如下的错误

[外链图片转存失败(img-UY0heo5O-1564556868802)(assets/1563848247335.png)]

修改的话,将java编译的版本提高到1.8即可。

  1. 修改方式一:

    步骤一:

    [外链图片转存失败(img-ikpJXG1V-1564556868803)(assets/1563848363060.png)]

    步骤二:

    [外链图片转存失败(img-ShDUkZ7r-1564556868804)(assets/1563848433149.png)]

    这种方式一般都不用,因为一旦更新依赖,又会恢复原样!

  2. 修改方式二:

    一劳永逸的解决,在spark-core的pom入加入jdk版本的插件

    <!-- 添加maven集成jdk版本的插件-->
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.8.1</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
        </plugins>
    
```
- 具体代码实现

```java
/**
 * lambda表达式版本
 */
public class JavaSparkWordCountOps2 {
    public static void main(String[] args) {

        SparkConf conf = new SparkConf();
        conf.setAppName("JavaSparkWordCountOps");
        conf.setMaster("local[*]");
        JavaSparkContext jsc = new JavaSparkContext(conf);

        JavaRDD<String> lines = jsc.textFile("file:/E:/data/hello.txt");
        int partitions = lines.getNumPartitions();
        System.out.println("##############partition: " + partitions);
        JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line).iterator());

        JavaPairRDD<String, Integer> pairs = words.mapToPair(word -> new Tuple2<String, Integer>(word, 1));
        JavaPairRDD<String, Integer> ret = pairs.reduceByKey((v1, v2) -> {
//            int i = 1 / 0; //lazy验证
            return v1 + v2;
        });
        //action操作触发
        ret.foreach(t -> System.out.println(t._1 + "--->" + t._2));

        jsc.stop();
}
```

1.2.5. spark代码的日志的管理

  • 局部控制

    就在当前的类中指定。

    import org.apache.log4j.Level;
    import org.apache.log4j.Logger;
    
    Logger.getLogger("org.apache.spark").setLevel(Level.WARN);
    Logger.getLogger("org.apache.hadoop").setLevel(Level.WARN);
    Logger.getLogger("org.spark_project").setLevel(Level.WARN);
    
  • 全局控制

    [外链图片转存失败(img-84TL88b9-1564556868818)(assets/1563849359746.png)]

    log4j.properties

    # Set everything to be logged to the console
    log4j.rootCategory=WARN, console
    log4j.appender.console=org.apache.log4j.ConsoleAppender
    log4j.appender.console.target=System.err
    log4j.appender.console.layout=org.apache.log4j.PatternLayout
    log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
    
    # Set the default spark-shell log level to WARN. When running the spark-shell, the
    # log level for this class is used to overwrite the root logger's log level, so that
    # the user can have different defaults for the shell and regular Spark apps.
    log4j.logger.org.apache.spark.repl.Main=WARN
    
    # Settings to quiet third party logs that are too verbose
    log4j.logger.org.spark_project.jetty=WARN
    log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
    log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
    log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
    log4j.logger.org.apache.parquet=ERROR
    log4j.logger.parquet=ERROR
    
    # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
    log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
    log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
    

1.3. Spark入门案例之Scala开发

1.3.1.修改sdk,支持scala开发

[外链图片转存失败(img-DaBZgzRW-1564556868820)(assets/1563849679384.png)]

[外链图片转存失败(img-4VY3Q9qv-1564556868821)(assets/1563849794180.png)]

1.3.2. scala代码完整版

object _01ScalaWordCountOps {
    def main(args: Array[String]): Unit = {
        val conf = new SparkConf()
                    .setAppName("ScalaWordCount")
                    .setMaster("local[*]")
        val sc = new SparkContext(conf)

        val lines = sc.textFile("file:/E:/data/hello.txt")

        val ret = lines.flatMap(_.split("\\s+")).map((_, 1)).reduceByKey(_+_)

//      ret.foreach(t => println(t._1 + "--->" + t._2))
        //模式匹配的写法
        ret.foreach{case (word, count) => println(word + "--->" + count)}
        sc.stop()
    }
}

1.3.3. 读取hdfs文件

编码如下,但是有异常

[外链图片转存失败(img-FB6I2Vbi-1564556868821)(assets/1563850384154.png)]

怎么解决

​ 将能够解析ns1是神马东西的配置加载到项目的classpath中即可。

​ 也就是hdfs-site.xml和core-site.xml

[外链图片转存失败(img-lBIIsmlr-1564556868824)(assets/1563850656346.png)]

1.3.4. 编译打包之后再集群中运行

1.3.4.0. 编译打包
  • 可视化

    ​ 使用maven的声明周期中的package进行打包

    [外链图片转存失败(img-L5dij2Yf-1564556868827)(assets/1563851035617.png)]

    但是出现如下错误:

    [外链图片转存失败(img-hQSpMZtf-1564556868832)(assets/1563851075762.png)]

    意为spark-core模块无法找到对应的spark-common的依赖

    [外链图片转存失败(img-haLpk2Tz-1564556868835)(assets/1563851192515.png)]

    pom中的依赖,全都加载自maven的仓库(中央仓库、私服、本地仓库),先本地仓库没有对应的spark-common,中央仓库更没有。只有将对应的依赖,提交或者安装到仓库。

    如何提交/安装依赖到仓库

    到对应的maven项目/模块上面执行install命令即可。

    [外链图片转存失败(img-keWvaNPg-1564556868838)(assets/1563851418161.png)]

    成功之后执行package打包即可。

  • 命令行

    [外链图片转存失败(img-G9jb5Nkv-1564556868840)(assets/1563851729301.png)]

  • 执行打包scala

    最终的pom配置

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <parent>
            <artifactId>spark-parent-1903</artifactId>
            <groupId>com.desheng.parent</groupId>
            <version>1.0-SNAPSHOT</version>
        </parent>
        <modelVersion>4.0.0</modelVersion>
    
        <groupId>com.desheng.bigdata</groupId>
        <artifactId>spark-core</artifactId>
        <version>1.0-SNAPSHOT</version>
    
        <dependencies>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-core_2.11</artifactId>
                <!--
                    scope:代表的是当前dependency的作用域范围
                        compiled    : 默认的,在scope的各个范围内都能使用
                        test        : 代表当前dependency只能在src/test测试目录下面能用
                        runtime     : 代表的是在源代码和编译阶段不用,在运行时才是用
                                        最典型的代表就是jdbc的驱动
                        provided    : 编译和源代码阶段使用,在运行的时候,不用,使用集群中已经提供的依赖
                -->
                <scope>provided</scope>
            </dependency>
            <dependency>
                <groupId>com.desheng.bigdata</groupId>
                <artifactId>spark-common</artifactId>
                <version>1.0-SNAPSHOT</version>
                <exclusions>
                    <exclusion>
                        <groupId>junit</groupId>
                        <artifactId>junit</artifactId>
                    </exclusion>
                </exclusions>
            </dependency>
        </dependencies>
        <!-- 添加maven集成jdk版本的插件-->
        <build>
            <sourceDirectory>src/main/scala</sourceDirectory>
            <plugins>
                <plugin>
                    <groupId>org.scala-tools</groupId>
                    <artifactId>maven-scala-plugin</artifactId>
                    <executions>
                        <execution>
                            <goals>
                                <goal>compile</goal>
                                <goal>testCompile</goal>
                            </goals>
                        </execution>
                    </executions>
                    <configuration>
                        <scalaVersion>${scala.version}</scalaVersion>
                        <args>
                            <arg>-target:jvm-1.5</arg>
                        </args>
                    </configuration>
                </plugin>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-eclipse-plugin</artifactId>
                    <configuration>
                        <downloadSources>true</downloadSources>
                        <buildcommands>
                            <buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>
                        </buildcommands>
                        <additionalProjectnatures>
                            <projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>
                        </additionalProjectnatures>
                        <classpathContainers>
                            <classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
                            <classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>
                        </classpathContainers>
                    </configuration>
                </plugin>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <version>3.8.1</version>
                    <configuration>
                        <source>1.8</source>
                        <target>1.8</target>
                    </configuration>
                </plugin>
                <!-- 对第三方依赖进行打包-->
                <plugin>
                    <artifactId>maven-assembly-plugin</artifactId>
                    <configuration>
                        <descriptorRefs>
                            <descriptorRef>jar-with-dependencies</descriptorRef>
                        </descriptorRefs>
                        <archive>
                            <!--<manifest>
                              <mainClass></mainClass>
                            </manifest>-->
                        </archive>
                    </configuration>
                    <executions>
                        <execution>
                            <id>make-assembly</id>
                            <phase>package</phase>
                            <goals>
                                <goal>single</goal>
                            </goals>
                        </execution>
                    </executions>
                </plugin>
            </plugins>
        </build>
    </project>
    
1.3.4.1. standalone的模式

​ 代码的提交,要参考spark官方文档:http://spark.apache.org/docs/2.2.2/submitting-applications.html

  • client
#!/bin/sh


SPARK_HOME=/home/bigdata/app/spark

$SPARK_HOME/bin/spark-submit \
--master spark://bigdata01:7077 \
--deploy-mode client \
--class com.desheng.bigdata.spark.p2._02ScalaWordCountRemoteOps \
--executor-memory 600M \
--executor-cores 1 \
--total-executor-cores 1 \
/home/bigdata/jars/spark/1903-bd/spark-core-1.0-SNAPSHOT-jar-with-dependencies.jar \
hdfs://ns1/data/spark/hello.log
  • cluster

执行脚本

#!/bin/sh


SPARK_HOME=/home/bigdata/app/spark

$SPARK_HOME/bin/spark-submit \
--master spark://bigdata01:7077 \
--deploy-mode cluster \
--class com.desheng.bigdata.spark.p2._02ScalaWordCountRemoteOps \
--executor-memory 600M \
--executor-cores 1 \
--driver-cores 1 \
--supervise \
--total-executor-cores 2 \
/home/bigdata/jars/spark/1903-bd/spark-core-1.0-SNAPSHOT-jar-with-dependencies.jar \
hdfs://ns1/data/spark/hello.log

报错:

[外链图片转存失败(img-buPi4mNZ-1564556868842)(assets/1563863679953.png)]

报错的原因:

​ 因为基于cluster的模式下,dirver需要在worker节点上面进行启动,自然加载数据的时候也就会从对应的worker节点上面加载,包括对应jar也在对应的节点上面加载,而提交的脚本是在bigdata01,worker在02和03上面,所以是无法找到对应的jar的。

[外链图片转存失败(img-MyasaGkN-1564556868845)(assets/1563864379332.png)]

解决方法就是将jar上传到hdfs中。

#!/bin/sh
SPARK_HOME=/home/bigdata/app/spark
$SPARK_HOME/bin/spark-submit \
--master spark://bigdata01:7077 \
--deploy-mode cluster \
--class com.desheng.bigdata.spark.p2._02ScalaWordCountRemoteOps \
--executor-memory 600M \
--executor-cores 1 \
--driver-cores 1 \
--supervise \
--total-executor-cores 2 \
hdfs://ns1/jars/spark/1903-bd/spark-core-1.0-SNAPSHOT-jar-with-dependencies.jar \
hdfs://ns1/data/spark/hello.log
1.3.4.2. yarn模式(大多数情况下的运行方式)

​ 需要启动yarn集群

  • client
SPARK_HOME=/home/bigdata/app/spark
export HADOOP_CONF_DIR=/home/bigdata/app/hadoop/etc/hadoop

$SPARK_HOME/bin/spark-submit \
--master yarn \
--deploy-mode client \
--class com.desheng.bigdata.spark.p2._02ScalaWordCountRemoteOps \
--executor-memory 600M \
--executor-cores 1 \
--driver-cores 1 \
--num-executors 1 \
--driver-memory 600M \
/home/bigdata/jars/spark/1903-bd/spark-core-1.0-SNAPSHOT-jar-with-dependencies.jar \
hdfs://ns1/data/spark/hello.log
  • cluster
#!/bin/sh


SPARK_HOME=/home/bigdata/app/spark
export HADOOP_CONF_DIR=/home/bigdata/app/hadoop/etc/hadoop

$SPARK_HOME/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--class com.desheng.bigdata.spark.p2._02ScalaWordCountRemoteOps \
--executor-memory 600M \
--executor-cores 1 \
--driver-cores 1 \
--num-executors 1 \
--driver-memory 600M \
hdfs://ns1/jars/spark/1903-bd/spark-core-1.0-SNAPSHOT-jar-with-dependencies.jar \
hdfs://ns1/data/spark/hello.lo

注意:

​ 基于yarn模式的运行,一般的会无法提交,报错!虚拟内存受限制。类似

Container killed by YARN for exceeding memory limits. 15.6 GB of 15.5 GB physical memory used.

解决问题:

在yarn-site.xml中添加如下两个配置

<property>
    <name>yarn.nodemanager.pmem-check-enabled</name>
    <value>false</value>
</property>
<property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
</property>

2. Spark作业的执行流程

2.1. 基于wordcount研究spark作业的执行原理

[外链图片转存失败(img-JS3wR2aP-1564556868846)(assets/01Spark作业运行原理剖析.png)]

2.2. spark作业运行架构原理图

[外链图片转存失败(img-iyJWzqgB-1564556868847)(assets/02spark运行原理架构图.png)]

3. Spark的操作

spark官网的翻译:https://www.cnblogs.com/BYRans/p/5057110.html

​ Spark中有两个重要的抽象概念,其一为RDD,其一为共享变量(仅有的两种:广播变量Broadcast ,累加器Accumulator)。

3.1. RDD的操作

3.1.1. RDD的简述

[外链图片转存失败(img-dCAFnrov-1564556868847)(assets/1563870853686.png)]

​ RDD的操作,大体上分为了两种:transformation,action。前者是转换算子,同时是lazy的,执行需要action行动算子的触发。详细了分还可以有输入算子,转换算子,缓存算子,行动算子。

[外链图片转存失败(img-t62ZJR1u-1564556868850)(assets/1563870899915.png)]

3.1.2. RDD的操作

3.1.2.1. transformation的操作
  • map

    map(func),为该rdd中的每一条记录执行一次该func函数,返回一个新的RDD操作

    注意:每一条操作执行完毕之后的数据还是一条,所以map的操作是one-2-one

    eg. 给集合中的每一个元素乘以7

    object _03SparkTransformationOps {
        def main(args: Array[String]): Unit = {
            val conf = new SparkConf()
                        .setAppName("SparkTransformation")
                        .setMaster("local[*]")
            val sc = new SparkContext(conf)
            val list = 1 to 7
            val listRDD:RDD[Int] = sc.parallelize(list)
    
    //        listRDD.map(num => num * 7)
            val retRDD:RDD[Int] = listRDD.map(_ * 7)
    
            retRDD.foreach(println)
            sc.stop()
        }
    }
    
  • flatMap

  • filter

  • sample

  • union

  • groupByKey

  • join

  • reduceByKey

  • sortByKey

  • combineByKey

  • aggregateByKey

3.1.2.2. action操作

foreach

3.2. 共享变量的操作

eg. 给集合中的每一个元素乘以7

```scala
object _03SparkTransformationOps {
    def main(args: Array[String]): Unit = {
        val conf = new SparkConf()
                    .setAppName("SparkTransformation")
                    .setMaster("local[*]")
        val sc = new SparkContext(conf)
        val list = 1 to 7
        val listRDD:RDD[Int] = sc.parallelize(list)

//        listRDD.map(num => num * 7)
        val retRDD:RDD[Int] = listRDD.map(_ * 7)

        retRDD.foreach(println)
        sc.stop()
    }
}
```
  • flatMap

  • filter

  • sample

  • union

  • groupByKey

  • join

  • reduceByKey

  • sortByKey

  • combineByKey

  • aggregateByKey

3.1.2.2. action操作

foreach

3.2. 共享变量的操作

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值