019 大数据之Spark

小哥哥咯

已于 2022-08-14 22:07:07 修改

阅读量1k

点赞数

分类专栏：大数据文章标签： spark big data

于 2022-02-24 22:46:09 首次发布

本文链接：https://blog.csdn.net/qq_24964575/article/details/123120942

版权

大数据专栏收录该内容

32 篇文章 0 订阅

订阅专栏

1、Spark概述

Spark是一种基于内存的快速、通用、可扩展的大数据分析计算引擎。在绝大多数的数据计算场景中，Spark确实会比MapReduce更有优势。但是Spark是基于内存的，所以在实际的生产环境中，由于内存的限制，可能会由于内存资源不够导致Job执行失败，此时，MapReduce其实是一个更好的选择，所以Spark并不能完全替代MR。
在这里插入图片描述
Spark Core：
Spark Core中提供了Spark最基础与最核心的功能，Spark其他的功能如：Spark SQL，Spark Streaming，GraphX, MLlib都是在Spark Core的基础上进行扩展的
Spark SQL：
Spark SQL是Spark用来操作结构化数据的组件。通过Spark SQL，用户可以使用SQL或者Apache Hive版本的SQL方言（HQL）来查询数据。
Spark Streaming：
Spark Streaming是Spark平台上针对实时数据进行流式计算的组件，提供了丰富的处理数据流的API。
Spark MLlib：
MLlib是Spark提供的一个机器学习算法库。MLlib不仅提供了模型评估、数据导入等额外的功能，还提供了一些更底层的机器学习原语。
Spark GraphX：
GraphX是Spark面向图计算提供的框架与算法库。

2、Spark快速上手

2.1、Local模式

Maven创建scala项目及打包

spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[2] \
./examples/jars/spark-examples_2.12-3.0.0.jar \
10

1)class表示要执行程序的主类，此处可以更换为咱们自己写的应用程序；
2)master local[2] 部署模式，默认为本地模式，数字表示分配的虚拟CPU核数量（即线程数），local[*]表示最大虚拟核数；
3)spark-examples_2.12-3.0.0.jar 运行的应用类所在的jar包，实际使用时，可以设定为咱们自己打的jar包；
4)数字10表示程序的入口参数，用于设定当前应用的任务数量
注意：①jar包一定要包含class文件，②程序输入文件和jar包的路径是相对spark-submit执行时所在的位置

2.2、Running Spark on YARN

配置Spark on Yarn和Spark历史服务器

[atguigu@hadoop102 conf]$ cat spark-env.sh 
#!/usr/bin/env bash
# export JAVA_HOME=/opt/module/jdk1.8.0_212
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.

# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program

# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos

# Options read in YARN client/cluster mode
# - SPARK_CONF_DIR, Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN
# Spark on Yarn时指定Yarn的配置文件
YARN_CONF_DIR=/opt/module/hadoop-3.1.3/etc/hadoop

# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)

# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
export SPARK_HISTORY_OPTS="
-Dspark.history.ui.port=18080 
-Dspark.history.fs.logDirectory=hdfs://hadoop102:9820/directory 
-Dspark.history.retainedApplications=30"

# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_DAEMON_CLASSPATH, to set the classpath for all daemons
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers

# Options for launcher
# - SPARK_LAUNCHER_OPTS, to set config properties and Java options for the launcher (e.g. "-Dx=y")

# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR      Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR       Where log files are stored.  (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR       Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING  A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS      The scheduling priority for daemons. (Default: 0)
# - SPARK_NO_DAEMONIZE  Run the proposed command in the foreground. It will not output a PID file.
# Options for native BLAS, like Intel MKL, OpenBLAS, and so on.
# You might get better performance to enable these options if using native BLAS (see SPARK-21305).
# - MKL_NUM_THREADS=1        Disable multi-threading of Intel MKL
# - OPENBLAS_NUM_THREADS=1   Disable multi-threading of OpenBLAS

[atguigu@hadoop102 conf]$ cat spark-defaults.conf 
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

# Example:
# spark.master                     spark://master:7077
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://hadoop102:9820/directory
# spark.serializer                 org.apache.spark.serializer.KryoSerializer
# spark.driver.memory              5g
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
# 历史服务器的主机地址为主节点的主机名hadoop102
spark.yarn.historyServer.address=hadoop102:18080
spark.history.ui.port=18080

[atguigu@hadoop102 conf]$ sbin/start-dfs.sh
[atguigu@hadoop102 conf]$ hadoop fs -mkdir /directory

集群模式和客户端模式提交应用举例

bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
./examples/jars/spark-examples_2.12-3.0.0.jar \
10

bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
./examples/jars/spark-examples_2.12-3.0.0.jar \
10

在这里插入图片描述

3、Spark运行架构

术语定义
在这里插入图片描述
Spark应用程序提交到Yarn环境中执行的时候，一般会有两种部署执行的方式：Client和Cluster。两种模式主要区别在于：Driver程序的运行节点位置。Client模式将用于监控和调度的Driver模块在客户端执行，而不是在Yarn中，所以一般用于测试。Cluster模式将用于监控和调度的Driver模块启动在Yarn集群资源中执行，所以一般应用于实际生产环境。
在这里插入图片描述
RDD（Resilient Distributed Dataset）叫做弹性分布式数据集，是Spark中最基本的数据处理模型。代码中是一个抽象类，它代表一个弹性的、不可变、可分区、里面的元素可并行计算的集合。

从计算的角度来讲，数据处理过程中需要计算资源（内存 & CPU）和计算模型（逻辑）。执行时，需要将计算资源和计算模型进行协调和整合。

Spark框架在执行时，先申请资源，然后将应用程序的数据处理逻辑分解成一个一个的计算任务。然后将任务发到已经分配资源的计算节点上, 按照指定的计算模型进行数据计算。最后得到计算结果。RDD是Spark框架中用于数据处理的核心模型，接下来我们看看，在Yarn环境中，RDD的工作原理:
在这里插入图片描述
从以上流程可以看出RDD在整个流程中主要用于将逻辑进行封装，并生成Task发送给Executor节点执行计算，RDD的分区数目决定了总的Task数目，下面给出分区数的确定原理。

map、reduce的并行度设定

 # mapred.map.tasks大于实际需要的maptask时才生效
 --jobconf mapred.map.tasks=20
 # 设置就生效
 --jobconf mapred.reduce.tasks=5

当Spark（Map）读取文件作为输入时，会根据具体数据格式对应的InputFormat进行解析，一般是将若干个Block合并成一个输入分片，称为InputSplit，注意InputSplit不能跨越文件。

Task被执行的并发度 = Executor数目 * 每个Executor核数（=虚拟core总个数）

在spark调优中，增大RDD分区数目（map：InputSplit决定；reduce：shuffle决定），可以增大任务并行度（避免资源闲着）。

单个RDD执行轮次 = Task被执行的并发度 / RDD分区数目
在这里插入图片描述

Spark（快，资源要求高：CPU、内存）和MapReduce（慢，资源要求低：CPU、内存）异同
① 计算不涉及与其他节点进行数据交换时，Spark可以在内存中一次性完成这些操作；如果计算过程中涉及数据交换，Spark 也是会把 shuffle 的数据写磁盘的！
② Spark的DAGScheduler可以实现map->reduce->reduce
③ Spark是一次性申请资源，MapReduce逐次申请资源
④ Spark编程模型RDD/DataFrame/DataSet更加灵活
⑤ MapReduce任务在启动时已经在JVM内指定了最大内存，不能超过指定的最大内存；Spark在超过指定最大内存后，会使用操作系统内存，既保证了内存的基本使用，又避免了提早分配过多内存带来的资源浪费
⑥ MapReduce中一个进程运行一个task，按序执行；Spark中一个线程运行一个task，增加了并行度。

Spark除Map和Reduce外（并不是算法，只是提供了Map阶段和Reduce阶段，两个阶段提供了很多算法：Map阶段的map、flatMap、filter、keyBy等，Reduce阶段的reduceByKey、sortByKey、mean、gourpBy、sort等），还支持RDD（RDD封装了计算逻辑，并不保存数据）/DataFrame/DataSet等多种数据模型操作，编程模型更加灵活。Spark在超过指定最大内存后，会使用操作系统内存，既保证了内存的基本使用，又避免了提早分配过多内存带来的资源浪费，修改hadoop配置文件/opt/module/hadoop-3.1.3/etc/hadoop/yarn-site.xml：

<!--是否启动一个线程检查每个任务正使用的物理内存量，如果任务超出分配值，则直接将其杀掉，默认是true -->
<property>
     <name>yarn.nodemanager.pmem-check-enabled</name>
     <value>false</value>
</property>

<!--是否启动一个线程检查每个任务正使用的虚拟内存量，如果任务超出分配值，则直接将其杀掉，默认是true -->
<property>
     <name>yarn.nodemanager.vmem-check-enabled</name>
     <value>false</value>
</property>

Yarn的内存超出指定的 yarn.nodemanager.resource.memory-mb 的解决过程
 Spark内存管理之堆内/堆外内存原理详解

4、Spark核心编程

在这里插入图片描述

log4j的配置文件参考这篇博客：031 Log4j日志框架

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

  <modelVersion>4.0.0</modelVersion>
  <groupId>com.jieky.studySpark</groupId>
  <artifactId>studySpark</artifactId>
  <version>1.0-SNAPSHOT</version>
  <inceptionYear>2008</inceptionYear>

  <properties>
    <maven.compiler.source>8</maven.compiler.source>
    <maven.compiler.target>8</maven.compiler.target>
  </properties>

  <dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.12</artifactId>
      <version>3.0.0</version>
      <exclusions>
        <exclusion>
          <groupId>org.slf4j</groupId>
          <artifactId>slf4j-log4j12</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

    <!--slf4j-log4j12是log4j的1.X版本，log4j-slf4j-impl是log4j的2.X版本-->
    <!--这个依赖需要放在桥接器依赖之前，不然会报错-->
    <!--The Apache Log4j SLF4J API binding to Log4j 2 Core-->
    <dependency>
      <groupId>org.apache.logging.log4j</groupId>
      <artifactId>log4j-slf4j-impl</artifactId>
      <version>2.9.1</version>
    </dependency>

    <!-- 面对多种日志框架同时存在的问题，Ceki 的 Slf4j 给出了解决方案，就是下文
    的桥接（ Bridging legacy），简单来说就是劫持所有第三方日志输出并重定
    向至 SLF4j，最终实现统一日志上层API（编码）与下层实现（输出日志位置、格式统一）-->
    <!--JCL 1.2 implemented over SLF4J-->
    <dependency>
      <groupId>org.slf4j</groupId>
      <artifactId>jcl-over-slf4j</artifactId>
      <version>1.7.36</version>
    </dependency>
    <!--JUL to SLF4J bridge-->
    <dependency>
      <groupId>org.slf4j</groupId>
      <artifactId>jul-to-slf4j</artifactId>
      <version>1.7.36</version>
    </dependency>
  </dependencies>
</project>

package com.jieky.studySpark
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object App  {
  def main(args: Array[String]): Unit = {
    // 设置并行度，local[*]表示并行度为本地机器的最大虚拟核数（线程）
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    // 手动设置并行度，即能够同时运行的task数量（还是线程）
    sparkConf.set("spark.default.parallelism", "4")
    val sparkContext = new SparkContext(sparkConf)
    // 设置分区，一个分区对应一个线程，一个线程可被多个分区重复使用
    // 极端情况：只有一个线程，但有多个分区，分区中数据会串行执行
    val dataRDD: RDD[Int] = sparkContext.makeRDD(List(1,2,3,4), 5)
    val fileRDD: RDD[String] = sparkContext.textFile("data",6)
    dataRDD.collect().foreach(println)
    fileRDD.collect().foreach(println)
    sparkContext.stop()
  }
}

4.1、数据可以按照并行度的设定进行数据的分区操作

val rdd1 : RDD[Int] = sc.makeRDD(Seq(1,2,3,4,5))

def parallelize[T: ClassTag](
     seq: Seq[T],
     numSlices: Int = defaultParallelism): RDD[T] = withScope {
   assertNotStopped()
   new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
 }

override def getPartitions: Array[Partition] = {
   val slices = ParallelCollectionRDD.slice(data, numSlices).toArray
   slices.indices.map(i => new ParallelCollectionPartition(id, i, slices(i))).toArray
 }

def slice[T: ClassTag](seq: Seq[T], numSlices: Int): Seq[Seq[T]] = {
   if (numSlices < 1) {
     throw new IllegalArgumentException("Positive number of partitions required")
   }
   // Sequences need to be sliced at the same set of index positions for operations
   // like RDD.zip() to behave as expected
   def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] = {
     (0 until numSlices).iterator.map { i =>
       val start = ((i * length) / numSlices).toInt
       val end = (((i + 1) * length) / numSlices).toInt
       (start, end)
     }
   }
   seq match {
     case r: Range =>
       positions(r.length, numSlices).zipWithIndex.map { case ((start, end), index) =>
         // If the range is inclusive, use inclusive range for the last slice
         if (r.isInclusive && index == numSlices - 1) {
           new Range.Inclusive(r.start + start * r.step, r.end, r.step)
         }
         else {
           new Range(r.start + start * r.step, r.start + end * r.step, r.step)
         }
       }.toSeq.asInstanceOf[Seq[Seq[T]]]
     case nr: NumericRange[_] =>
       // For ranges of Long, Double, BigInteger, etc
       val slices = new ArrayBuffer[Seq[T]](numSlices)
       var r = nr
       for ((start, end) <- positions(nr.length, numSlices)) {
         val sliceSize = end - start
         slices += r.take(sliceSize).asInstanceOf[Seq[T]]
         r = r.drop(sliceSize)
       }
       slices
     case _ =>
       val array = seq.toArray // To prevent O(n^2) operations for List etc
       positions(array.length, numSlices).map { case (start, end) =>
           array.slice(start, end).toSeq
       }.toSeq
   }
 }

4.2、Spark的文件读取底层就是Hadoop的文件读取，最终的分区数量就是hadoop读取文件的切片数

# 设置预计的最小切片数（分区数）
val rdd: RDD[String] = sc.textFile("data/word*.txt", 2)

def textFile(
     path: String,
     minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
   assertNotStopped()
   hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
     minPartitions).map(pair => pair._2.toString).setName(path)
 }

public class TextInputFormat extends FileInputFormat<LongWritable, Text>

public InputSplit[] getSplits(JobConf job, int numSplits)
  throws IOException {
  StopWatch sw = new StopWatch().start();
  FileStatus[] files = listStatus(job);
  
  // Save the number of input files for metrics/loadgen
  job.setLong(NUM_INPUT_FILES, files.length);
  long totalSize = 0;                           // compute total size
  for (FileStatus file: files) {                // check we have valid files
    if (file.isDirectory()) {
      throw new IOException("Not a file: "+ file.getPath());
    }
    totalSize += file.getLen();
  }
  # 预计每一个分区处理数据的字节大小
  long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
  long minSize = Math.max(job.getLong(org.apache.hadoop.mapreduce.lib.input.
    FileInputFormat.SPLIT_MINSIZE, 1), minSplitSize);

  // generate splits
  ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);
  NetworkTopology clusterMap = new NetworkTopology();
  for (FileStatus file: files) {
    Path path = file.getPath();
    long length = file.getLen();
    if (length != 0) {
      FileSystem fs = path.getFileSystem(job);
      BlockLocation[] blkLocations;
      if (file instanceof LocatedFileStatus) {
        blkLocations = ((LocatedFileStatus) file).getBlockLocations();
      } else {
        blkLocations = fs.getFileBlockLocations(file, 0, length);
      }
      if (isSplitable(fs, path)) {
        long blockSize = file.getBlockSize();
        # 计算最终合适的切片大小，minSize默认值是1
        long splitSize = computeSplitSize(goalSize, minSize, blockSize);

        long bytesRemaining = length;
        # SPLIT_SLOP = 1.1,10%以内不创建新的分区
        while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
          String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations,
              length-bytesRemaining, splitSize, clusterMap);
          splits.add(makeSplit(path, length-bytesRemaining, splitSize,
              splitHosts[0], splitHosts[1]));
          bytesRemaining -= splitSize;
        }

        if (bytesRemaining != 0) {
          String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations, length
              - bytesRemaining, bytesRemaining, clusterMap);
          splits.add(makeSplit(path, length - bytesRemaining, bytesRemaining,
              splitHosts[0], splitHosts[1]));
        }
      } else {
        String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations,0,length,clusterMap);
        splits.add(makeSplit(path, 0, length, splitHosts[0], splitHosts[1]));
      }
    } else { 
      //Create empty hosts array for zero length files
      splits.add(makeSplit(path, 0, length, new String[0]));
    }
  }
  sw.stop();
  if (LOG.isDebugEnabled()) {
    LOG.debug("Total # of splits generated by getSplits: " + splits.size()
        + ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));
  }
  return splits.toArray(new FileSplit[splits.size()]);
}

4.3、Spark的分区数据的划分由hadoop决定（读取文件时）

Spark的RDD 文件读取与保存

val conf = new SparkConf().setMaster("local[*]").setAppName("RDD")
val sc = new SparkContext(conf)
// TODO 1. 分区数据的处理也是由Hadoop决定的。
// TODO 2. hadoop在计算分区时和处理数据时的逻辑不一样。
// TODO 3. Spark读取文件数据底层使用的就是hadoop读取的，所以读取规则用的是hadoop
//         3.1 hadoop读取数据是按行读取的，不是按字节读取
//         3.2 hadoop读取数据是偏移量读取的
//         3.3 hadoop读取数据时，不会重复读取相同的偏移量
val rdd = sc.textFile("data/word.txt", 3)
rdd.saveAsTextFile("output")
sc.stop()
/*
文件中的数据：1\r\n、2\r\n、3\r\n
1@@ => 012
2@@ => 345
3   => 6

计算读取偏移量 => 数据
[0, 3] => [12]
[3, 6] => [3]
[6, 7] => []
*/

4.3、算子（分布式计算和单机计算是不同的）

数据分区数一般不变
数据所在分区一般不变
数据分区内有序、分区间无序
分区内单个数据处理逻辑(RDD)有序
分区内多个数据间处理逻辑(RDD)无序

spark map和mapPartitions区别

package com.jieky.studySpark
import org.apache.spark.{SparkConf, SparkContext}

object App  {
  def main(args: Array[String]): Unit = {
    // 设置并行度，local[*]表示并行度为本地机器的最大虚拟核数（线程）
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    // 手动设置并行度，即能够同时运行的task数量（还是线程）
    sparkConf.set("spark.default.parallelism", "4")
    val sparkContext = new SparkContext(sparkConf)

    // map是对rdd中的每一个元素进行操作；若是执行数据插入数据库操作，每条数据的插入都会连接一次数据库
    println("1.map--------------------------------")
    val aa   = sparkContext.parallelize(1 to 9, 4)
    val aa_res = aa.map(temp => (temp, temp*2))
    println(aa.getNumPartitions)
    println(aa_res.collect().mkString)

    // mapPartitions则是对rdd中的每个分区的迭代器进行操作；若是执行数据插入数据库操作，每个partition的插入都会连接一次数据库
    println("2.mapPartitions--------------------------------")
    val bb   = sparkContext.parallelize(1 to 9, 4)
    val bb_res =bb.mapPartitions(temp =>{
      var result = List[(Int,Int)]()
      while (temp.hasNext){
        val cur = temp.next()
        result = (cur,cur*2)::result
      }
      result.iterator
    })
    println(bb.getNumPartitions)
    println(bb_res.collect().mkString)

    // mapPartionsWithIndex跟mapPatition的区别是输入的值多出一个Index
    println("3.mapPartitionsWithIndex--------------------------------")
    val cc   = sparkContext.parallelize(1 to 9, 4)
    val cc_res =bb.mapPartitionsWithIndex((index,temp) =>{
      var result = List[(Int,Int,Int)]()
      while (temp.hasNext){
        val cur = temp.next()
        result = (index,cur,cur*2)::result
      }
      result.iterator
    })
    println(cc.getNumPartitions)
    println(cc_res.collect().mkString)

    sparkContext.stop()
  }
}

1.map--------------------------------
4
(1,2)(2,4)(3,6)(4,8)(5,10)(6,12)(7,14)(8,16)(9,18)
2.mapPartitions--------------------------------
4
(2,4)(1,2)(4,8)(3,6)(6,12)(5,10)(9,18)(8,16)(7,14)
3.mapPartitionsWithIndex--------------------------------
4
(0,2,4)(0,1,2)(1,4,8)(1,3,6)(2,6,12)(2,5,10)(3,9,18)(3,8,16)(3,7,14)

深入解读 Spark 宽依赖和窄依赖（ShuffleDependency & NarrowDependency）

简单来说，NarrowDependency 为 parent RDD 的一个或多个分区的数据全部流入到 child RDD 的一个或多个分区，而 ShuffleDependency 则为 parent RDD 的每个分区的每一部分，分别流入到 child RDD 的不同分区。

Spark 之所以要将依赖关系分为 NarrowDependency 和 ShuffleDependency ，是可以更好的将各种依赖类型进行分类，明确数据怎么流出流入，从而更容易生成对应的物理执行计划。NarrowDependency 不需要 shuffle 操作，并且可以用于流式操作（pipeline）。ShuffleDependency 则需要进行 shuffle 操作，有 shuffle 的地方需要划分不同的 stage。

转换算子：Transformation，懒执行，需要Action触发执行
①窄依赖转换算子：filter、map、flatMap、sample、union、intersection、mapPartitions、mapPartitionsWithIndex、zip
②宽依赖转换算子：sortBy、sortByKey、reduceByKey、join、leftOuterJoin、rightOuterJoin、fullOuterJoin、distinct、cogroup、repartition
③coalesce算子可以增多分区，也可以减少分区，默认没有shuffle，有shuffle就是宽依赖（repartition算子是coalesce接口中shuffle为true的实现），没shuffle就是窄依赖。

行动算子：Action，触发transformation类算子执行，一个application中有一个action算子就有一个job
①清单：foreach、count、collect、first、take、foreachPartition、reduce、countByKey、countByValue

持久化算子：
①清单：cache、persist

Spark中map和flatMap的区别详解
在这里插入图片描述
在使用时map会将一个长度为N的RDD转换为另一个长度为N的RDD（单个元素为序列化对象）；而flatMap会在map操作的基础上，再把这N个序列化对象合并，成为长度为1的RDD结果集（单个元素为序列化对象）。

package com.jieky.studySpark
import org.apache.spark.{SparkConf, SparkContext}

object App  {
  def main(args: Array[String]): Unit = {
    // 设置并行度，local[*]表示并行度为本地机器的最大虚拟核数（线程）
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    // 手动设置并行度，即能够同时运行的task数量（还是线程）
    sparkConf.set("spark.default.parallelism", "4")
    val sc = new SparkContext(sparkConf)

    val rdd = sc.parallelize(List("coffee panda","happy panda","happiest panda party"),20)

    val temp1 = rdd.map(x=>x.split("\\s+")).collect()
    // 输出对象个数，对象数据类型
    println(temp1.size,temp1.getClass.getSimpleName())
    temp1.foreach(_.foreach(println(_)))

    println("-"*20)

    val temp2 = rdd.flatMap(x=>x.split("\\s+")).collect()
    // 输出对象个数，对象数据类型
    println(temp2.size,temp2.getClass.getSimpleName())
    temp2.foreach(println(_))
    sc.stop()
  }
}

(3,String[][])
coffee
panda
happy
panda
happiest
panda
party
--------------------
(7,String[])
coffee
panda
happy
panda
happiest
panda
party

spark partition 理解 / coalesce 与 repartition的区别

repartition只是coalesce接口中shuffle为true的实现
①.多个executor，如果结果产生的文件数要比源RDD partition少，用coalesce（shuffle参数为false）是实现不了的，例如有4个小文件（4个partition），你要生成5个文件用coalesce实现不了，也就是说不产生shuffle，无法实现文件数变多。
② .如果你只有1个executor（1个core），源RDD partition有5个，你要用coalesce产生2个文件。那么他是预分partition到executor上的，例如0-2号分区在先executor上执行完毕，3-4号分区再次在同一个executor执行。其实都是同一个executor但是前后要串行读不同数据。与用repartition(2)在读partition上有较大不同（串行依次读0-4号partition 做%2处理）。

Spark算子：distinct去重的原理
在这里插入图片描述

package com.jieky.studySpark
import org.apache.spark.{SparkConf, SparkContext}

object App  {
  def main(args: Array[String]): Unit = {
    // 设置并行度，local[*]表示并行度为本地机器的最大虚拟核数（线程）
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    // 手动设置并行度，即能够同时运行的task数量（还是线程）
    sparkConf.set("spark.default.parallelism", "4")
    val sc = new SparkContext(sparkConf)

    val rdd = sc.parallelize(List("coffee panda","happy panda","happiest panda party"),20)

    val temp2 = rdd.flatMap(x=>x.split("\\s+")).distinct().collect()
    // 输出对象个数，对象数据类型
    println(temp2.size,temp2.getClass.getSimpleName())
    temp2.foreach(println(_))

    sc.stop()
  }
}

(5,String[])
coffee
panda
happiest
party
happy

Spark源码解析排序算子sortBy和sortByKey存在未排序的情况

package com.jieky.studySpark
import org.apache.spark.{SparkConf, SparkContext}

object App  {
  def main(args: Array[String]): Unit = {
    // 设置并行度，local[*]表示并行度为本地机器的最大虚拟核数（线程）
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    // 手动设置并行度，即能够同时运行的task数量（还是线程）
    sparkConf.set("spark.default.parallelism", "4")
    val sc = new SparkContext(sparkConf)

    val array_left = 1 until 4 //生成1到count的数组
    val array_right = Array("工单", "电力", "展示")
    val result = array_left.zip(array_right)

    // 调用foreach行动算子，分区内有序，分区间无序
    println(sc.parallelize(result,2).sortBy(_._1,true).foreach(println(_)))
    println(sc.parallelize(result,2).sortByKey(true).foreach(println(_)))

    println("*"*20)

    // 调用foreach行动算子，分区内有序，分区间无序
    println(sc.parallelize(result,1).sortBy(_._1,true).foreach(println(_)))
    println(sc.parallelize(result,1).sortByKey(true).foreach(println(_)))

    println("*"*20)

    // 调用collect行动算子，整体有序；这里的foreach是scala中的算子，不是spark中算子
    println(sc.parallelize(result,2).sortBy(_._1,true).collect().foreach(println(_)))
    println(sc.parallelize(result,2).sortByKey(true).collect().foreach(println(_)))

    sc.stop()
  }
}

(1,工单)
(3,展示)
(2,电力)
()
(3,展示)
(1,工单)
(2,电力)
()
********************
(1,工单)
(2,电力)
(3,展示)
()
(1,工单)
(2,电力)
(3,展示)
()
********************
(1,工单)
(2,电力)
(3,展示)
()
(1,工单)
(2,电力)
(3,展示)
()

Scala闭包定义及用法

def makeIncreaser(more:Int) = (x:Int) => x + more
// inc1、inc9999为闭包，可以取more的当前的值，也可以在函数内修改more的值
val inc1=makeIncreaser(1)
val inc9999=makeIncreaser(9999)
println(inc1(10))
println(inc9999(10))

11
10009

序列化 — Kryo序列化

package com.jieky.studySpark
import org.apache.spark.{SparkConf, SparkContext}

object App  {
  def main(args: Array[String]): Unit = {
    // 设置并行度，local[*]表示并行度为本地机器的最大虚拟核数（线程）
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    // 手动设置并行度，即能够同时运行的task数量（还是线程）
    sparkConf.set("spark.default.parallelism", "4")
    val sc = new SparkContext(sparkConf)

    val array_left = 1 until 4 //生成1到count的数组
    val array_right = Array("工单", "电力", "展示")
    val result = array_left.zip(array_right)

    val temp = sc.parallelize(result)
      .sortBy(_._1)
      .map(_._1)
      .filter(_%2==0)
      .map(_*2)

    // RDD 血缘关系
    println(temp.toDebugString)

    /*
    宽依赖：ShuffleDependency
    窄依赖：OneToOneDependency、RangeDependency、NarrowDependency
    PS：OneToOneDependency、RangeDependency为NarrowDependency的子类
     */
    // RDD 依赖关系
    println(temp.dependencies)
    sc.stop()
  }
}

(3) MapPartitionsRDD[8] at map at App.scala:20 []
 |  MapPartitionsRDD[7] at filter at App.scala:19 []
 |  MapPartitionsRDD[6] at map at App.scala:18 []
 |  MapPartitionsRDD[5] at sortBy at App.scala:17 []
 |  ShuffledRDD[4] at sortBy at App.scala:17 []
 +-(4) MapPartitionsRDD[1] at sortBy at App.scala:17 []
    |  ParallelCollectionRDD[0] at parallelize at App.scala:16 []
List(org.apache.spark.OneToOneDependency@e5cbff2)

【Spark源码】RDD阶段划分&任务划分

RDD 任务切分中间分为：Application、Job、Stage 和 Task
① Application（应用程序）：初始化一个 SparkContext 即生成一个Application；整个程序即为一个Application，代码中setAppName是为主程序起名字
② Job（作业）：一个Action（行动算子）算子就会生成一个Job；
③ Stage：Stage 等于宽依赖(ShuffleDependency)的个数加 1（+1为ResultStage，ResultStage是整个流程的最后一个阶段）；
④ Task：一个 Stage 阶段中，最后一个RDD 的分区个数就是Task 的个数。
PS：Application->Job->Stage->Task 每一层都是 1 对 n 的关系。

Spark – RDD数据分区(分区器)

Spark目前支持Hash分区和Range分区，用户也可以自定义分区，Hash分区为当前的默认分区，Spark中分区器直接决定了RDD中分区的个数、RDD中每条数据经过Shuffle过程属于哪个分区和Reduce的个数。
注意：
(1) 只有Key-Value类型的RDD才有分区器的，非Key-Value类型的RDD分区器的值是None
(2) 每个RDD的分区ID范围：0~numPartitions-1，决定这个值是属于那个分区的。

hash分区器：快，可能数据倾斜
range分区器：慢，一定程度避免数据倾斜

package com.jieky.studySpark
import org.apache.spark.{HashPartitioner, Partitioner, RangePartitioner, SparkConf, SparkContext}

object App  {
  def main(args: Array[String]): Unit = {
    // 设置并行度，local[*]表示并行度为本地机器的最大虚拟核数（线程）
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    // 手动设置并行度，即能够同时运行的task数量（还是线程）
    sparkConf.set("spark.default.parallelism", "4")
    val sc = new SparkContext(sparkConf)

    val array_left = 1 until 4 //生成1到count的数组
    val array_right = Array("工单", "电力", "展示")
    val result = array_left.zip(array_right)

    val temp = sc.parallelize(result,5)
    println("分区器：",temp.partitioner)
    println("分区：")
    temp.partitions.foreach(println(_))

    println("-"*20)

    // HashPartitioner构造参数3就是分区数量，也是启动的reduce task数量，
    // 也是reduceByKey结果返回的子RDD的partitions方法返回的数组的长度。
    val temp1 = sc.parallelize(result,2).partitionBy(new HashPartitioner(3))
    println("分区器：",temp1.partitioner)
    println("分区：")
    temp1.partitions.foreach(println(_))

    println("-"*20)

    val nopar = sc.parallelize(List((1,3),(1,2),(2,4),(2,3),(3,6),(3,8)),8)
    //val temp2 = nopar.mapPartitionsWithIndex((index,iter)=>{ Iterator(index.toString+" : "+iter.mkString("|")) }).collect()
    //temp2.foreach(println(_))

    /*
    如果没有显式指定分区器，按如下规则调用分区器:
    1、查看父RDD有无partitioner，若有则使用父partitioner
    2、查看sparkConf是否定义spark.default.parallelism，若有则返回new HashPartitioner(sc.defaultParallelism)
    3、以上都没有，则返回new HashPartitioner(rdd_parent.partitions.length)作为默认分区器
    * */
    val hashpar = nopar.partitionBy(new org.apache.spark.HashPartitioner(7))
    println(hashpar.count)
    println(hashpar.partitioner)

    println("-"*20)

    val pairs = sc.parallelize(List((1,1),(2,2),(3,3)))
    val Hashpartiton = pairs.partitionBy(new RangePartitioner(2,pairs))
    println("分区器："+Hashpartiton.partitioner)
    println("分区：")
    Hashpartiton.partitions.foreach(println)

    println("-"*20)

    // 自定义分区器，需重写函数：numPartitions、getPartition
    val listRDD = sc.makeRDD(List(("a",1),("b",2),("c",3))).partitionBy(new Partitioner{
      override def numPartitions: Int = {
        3
      }
      override def getPartition(key: Any): Int = {
        1
      }
    })
    println("分区器："+listRDD.partitioner)
    println("分区：")
    listRDD.partitions.foreach(println)
  }
}

Spark共享变量—累加器（及transformation和action回顾）
Spark 持久化（cache和persist的区别）

在Spark中如果想在Task计算的时候统计某些事件的数量，使用filter/reduce也可以，但是使用累加器是一种更方便的方式，累加器一个比较经典的应用场景是用来在Spark Streaming应用中记录某些事件的数量。使用累加器时需要注意只有Driver能够取到累加器的值，Task端进行的是累加操作。（可以认为在task端使用写锁，一次只能一个task写入，不会出现竞争导致数据出错）

Spark提供的Accumulator，主要用于多个节点对一个变量进行共享性的操作。Accumulator只提供了累加的功能，只能累加，不能减少。累加器只能在Driver端构建，并只能从Driver端读取结果，在Task端只能进行累加。执行算子被调用时，累加器变量才会被更新。

注意：在每个执行器上更新累加器，都会将累加数据转发回Driver驱动程序。（所以为了避免网络传输次数过大，可以将多次更新的值放入本地变量，到达指定数值后，更新给累加器，减少网络传输次数）

cache()和persist()的使用是有规则的：必须在transformation或者textfile等创建一个rdd之后，直接连续调用cache()或者persist()才可以，如果先创建一个rdd,再单独另起一行执行cache()或者persist()，是没有用的，而且会报错，大量的文件会丢失。通过源码可以看出cache()是persist()的简化方式，调用persist的无参版本。

package com.jieky.studySpark
import org.apache.spark.util.AccumulatorV2
import org.apache.spark.{HashPartitioner, Partitioner, RangePartitioner, SparkConf, SparkContext}

object App  {
  def main(args: Array[String]): Unit = {
    // 设置并行度，local[*]表示并行度为本地机器的最大虚拟核数（线程）
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    // 手动设置并行度，即能够同时运行的task数量（还是线程）
    sparkConf.set("spark.default.parallelism", "4")
    val sc = new SparkContext(sparkConf)

    // sc.collectionAccumulator[String]("")
    // sc.doubleAccumulator("")
    val accum = sc.longAccumulator("Error2 Accumulator")
    val numberRDD = sc.parallelize(1 to 10).map(n => {
      accum.add(1)
      n + 1
    })

    // 使用cache方法(或persist)，否则每执行一次执行算子都从头开始计算RDD，从而导致累加器被重复执行
    numberRDD.cache().count()
    println("accum1: " + accum.value)
    numberRDD.reduce(_+_)
    println("accum2: " + accum.value)

    println("-"*20)

    //自定义累加器
    val myAccum = new MyAccumulatorV2
    sc.register(myAccum,"DIY累加器")
    val sum: Int = sc.parallelize(
      Array("1", "2a", "3", "4f", "a5", "6", "2a"), 2)
      .filter(line => {
        val pattern = """^-?(\d+)"""
        val flag = line.matches(pattern)
        if (flag) {
          myAccum.add(line)
        }
        flag
      }
    ).map(_.toInt).reduce(_+_)
    println("计算："+sum+" = "+ myAccum.value.toArray().mkString("+"))

    sc.stop()
  }
}

class MyAccumulatorV2 extends AccumulatorV2[String, java.util.Set[String]]{

  private val set:java.util.Set[String] = new java.util.HashSet[String]()

  // 返回该累加器是否为零值
  override def isZero: Boolean = {
    set.isEmpty
  }

  // 用于重置累加器为初始状态
  override def reset(): Unit = {
    set.clear()
  }

  // 用于向累加器加一个值
  override def add(v: String): Unit = {
    set.add(v)
  }

  // 用于合并另一个同类型的累加器到当前累加器
  override def merge(other: AccumulatorV2[String, java.util.Set[String]]): Unit = {
    other match {
      case o:MyAccumulatorV2 => set.addAll(o.value)
    }
  }

  // 获取此累加器的当前值
  override def value: java.util.Set[String] = {
    // Returns an unmodifiable view of the specified set
    java.util.Collections.unmodifiableSet(set)
  }

  // 创建此累加器的新副本
  override def copy(): MyAccumulatorV2 = {
    val newAcc = new MyAccumulatorV2()
    // 对应set对象加锁
    set.synchronized{
      newAcc.set.addAll(set)
    }
    newAcc
  }

}

小哥哥咯

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
019 大数据之Spark

1、Spark概述Spark是一种基于内存的快速、通用、可扩展的大数据分析计算引擎。在绝大多数的数据计算场景中，Spark确实会比MapReduce更有优势。但是Spark是基于内存的，所以在实际的生产环境中，由于内存的限制，可能会由于内存资源不够导致Job执行失败，此时，MapReduce其实是一个更好的选择，所以Spark并不能完全替代MR。Spark Core：Spark Core中提供了Spark最基础与最核心的功能，Spark其他的功能如：Spark SQL，Spark Streaming
复制链接

扫一扫

专栏目录