SparkCore的简单使用

code@fzk

已于 2022-12-29 15:50:16 修改

阅读量495

点赞数

分类专栏：大数据文章标签： spark 大数据实时大数据

于 2021-08-06 09:54:41 首次发布

本文链接：https://blog.csdn.net/qq_44002865/article/details/119445943

版权

大数据专栏收录该内容

25 篇文章 0 订阅

订阅专栏

SparkCore

0. RDD简介

RDD（Resilient Distributed Dataset）叫做弹性分布式数据集，是 Spark 中最基本的数据处理模型。代码中是一个抽象类，它代表一个弹性的、不可变、可分区、里面的元素可并行计算的集合

弹性
- 存储的弹性：内存与磁盘的自动切换；
- 容错的弹性：数据丢失可以自动恢复；
- 计算的弹性：计算出错重试机制；
- 分片的弹性：可根据需要重新分片。
分布式：数据存储在大数据集群不同节点上
数据集：RDD 封装了计算逻辑，并不保存数据
数据抽象：RDD 是一个抽象类，需要子类具体实现
不可变：RDD 封装了计算逻辑，是不可以改变的，想要改变，只能产生新的RDD，在新的RDD 里面封装计算逻辑可分区、并行计算

1. RDD创建

从集合（内存）中创建 RDD

def main(args: Array[String]): Unit = {
    // setAppName: 程序名称
    // setMaster: 分区数
    val sparkConf: SparkConf = new SparkConf().setAppName("fzk").setMaster("local[2]")
    val sparkContext = new SparkContext(sparkConf)

    val data: List[Int] = List(1, 2, 3, 4)
    // 从本地文件读取数据
    val sourceData: RDD[Int] = sparkContext.makeRDD(data)

    sourceData.collect().foreach(println)

    sparkContext.stop()
}

从外部存储（文件）创建RDD

def main(args: Array[String]): Unit = {
    val sparkConf: SparkConf = new SparkConf().setAppName("fzk").setMaster("local[2]")
    val sparkContext = new SparkContext(sparkConf)

    // 从本地文件读取数据
    val sourceData01: RDD[String] = sparkContext.textFile("path/data")
    // 从分布式系统读取数据（例如 HDFS）
    val sourceData02: RDD[String] = sparkContext.textFile("hdfs://hadoop01:8020/data")

    sparkContext.stop()
}

2. RDD 算子

转换算子（Transformations）

转换算子官网文档

Spark 的一些常见转换

Transformation	Meaning
map(func)	通过函数func传递源的每个元素，返回一个新的分布式数据集。
filter(func)	返回通过选择func返回 true的源元素形成的新数据集。
flatMap(func)	类似于 map，但每个输入项可以映射到 0 个或更多输出项（因此func应该返回一个 Seq 而不是单个项）。
mapPartitions(func)	与map类似，但在RDD的每个分区（块）上单独运行，所以func在T类型的RDD上运行时必须是Iterator => Iterator类型。
mapPartitionsWithIndex(func)	与 mapPartitions 类似，但也为func提供了一个表示分区索引的整数值，因此在 T 类型的 RDD 上运行时， func必须是 (Int, Iterator) => Iterator 类型。
sample(withReplacement, fraction, seed)	使用给定的随机数生成器种子对数据的一小部分进行采样，无论是否替换。
union(otherDataset)	返回一个新数据集，其中包含源数据集中元素和参数的并集。
intersection(otherDataset)	返回一个新的 RDD，其中包含源数据集中元素和参数的交集。
distinct([numPartitions]))	返回一个包含源数据集不同元素的新数据集。
groupByKey([numPartitions])	在 (K, V) 对的数据集上调用时，返回 (K, Iterable) 对的数据集。注意：如果您正在分组以便对每个键执行聚合（例如求和或平均），则使用`reduceByKey`or`aggregateByKey`将产生更好的性能。注意：默认情况下，输出中的并行度取决于父 RDD 的分区数。您可以传递一个可选`numPartitions`参数来设置不同数量的任务。
reduceByKey(func, [numPartitions])	在 (K, V) 对的数据集上调用时，返回 (K, V) 对的数据集，其中每个键的值使用给定的 reduce 函数func聚合，该函数必须是 (V,V) => V. 与中一样`groupByKey`，reduce 任务的数量可以通过可选的第二个参数进行配置。
aggregateByKey(zeroValue)(seqOp, combOp, [numPartitions])	当在 (K, V) 对的数据集上调用时，返回 (K, U) 对的数据集，其中每个键的值使用给定的组合函数和中性“零”值聚合。允许与输入值类型不同的聚合值类型，同时避免不必要的分配。与中一样`groupByKey`，reduce 任务的数量可以通过可选的第二个参数进行配置。
sortByKey([ascending], [numPartitions])	当在 K 实现 Ordered 的 (K, V) 对数据集上调用时，返回按布尔`ascending`参数中指定的键按升序或降序排序的 (K, V) 对数据集。
join(otherDataset, [numPartitions])	当在 (K, V) 和 (K, W) 类型的数据集上调用时，返回 (K, (V, W)) 对的数据集，其中每个键的所有元素对。`leftOuterJoin`通过、`rightOuterJoin`和支持外连接`fullOuterJoin`。
cogroup(otherDataset, [numPartitions])	当在 (K, V) 和 (K, W) 类型的数据集上调用时，返回 (K, (Iterable, Iterable)) 元组的数据集。此操作也称为`groupWith`.
cartesian(otherDataset)	在 T 和 U 类型的数据集上调用时，返回 (T, U) 对（所有元素对）的数据集。
pipe(command, [envVars])	通过 shell 命令（例如 Perl 或 bash 脚本）对 RDD 的每个分区进行管道传输。RDD 元素被写入进程的标准输入，输出到标准输出的行作为字符串的 RDD 返回。
coalesce(numPartitions)	将 RDD 中的分区数减少到 numPartitions。对于过滤大型数据集后更有效地运行操作很有用。
repartition(numPartitions)	随机重新排列 RDD 中的数据以创建更多或更少的分区并在它们之间进行平衡。这总是对网络上的所有数据进行洗牌。
repartitionAndSortWithinPartitions(partitioner)	根据给定的分区器对 RDD 进行重新分区，并在每个生成的分区中，按记录的键对记录进行排序。这比`repartition`在每个分区中调用然后排序更有效，因为它可以将排序下推到 shuffle 机器中。

行动算子（Actions）

行动算子官网文档

Action	Meaning
reduce(func)	使用函数func聚合数据集的元素（它接受两个参数并返回一个）。该函数应该是可交换的和关联的，以便可以并行正确计算。
collect()	在驱动程序中将数据集的所有元素作为数组返回。这通常在过滤器或其他返回足够小的数据子集的操作之后很有用。
count()	返回数据集中元素的数量。
first()	返回数据集的第一个元素（类似于 take(1)）。
take(n)	返回包含数据集前n 个元素的数组。
takeSample(withReplacement, num, [seed])	返回一个数组，其中包含数据集的num个元素的随机样本，有或没有替换，可选地预先指定一个随机数生成器种子。
takeOrdered(n, [ordering])	使用自然顺序或自定义比较器返回 RDD 的前n 个元素。
saveAsTextFile(path)	将数据集的元素作为文本文件（或文本文件集）写入本地文件系统、HDFS 或任何其他 Hadoop 支持的文件系统的给定目录中。Spark 将对每个元素调用 toString 以将其转换为文件中的一行文本。
saveAsSequenceFile(path) (Java and Scala)	将数据集的元素作为 Hadoop SequenceFile 写入本地文件系统、HDFS 或任何其他 Hadoop 支持的文件系统中的给定路径中。这在实现 Hadoop 的 Writable 接口的键值对的 RDD 上可用。在 Scala 中，它也可用于可隐式转换为 Writable 的类型（Spark 包括基本类型的转换，如 Int、Double、String 等）。
saveAsObjectFile(path) (Java and Scala)	使用 Java 序列化以简单格式编写数据集的元素，然后可以使用 `SparkContext.objectFile()`.
countByKey()	仅适用于 (K, V) 类型的 RDD。返回 (K, Int) 对的哈希图以及每个键的计数。
foreach(func)	对数据集的每个元素运行函数func。这通常是针对副作用进行的，例如更新累加器或与外部存储系统交互。注意：修改除了累加器以外的变量`foreach()`可能会导致未定义的行为。有关更多详细信息，请参阅了解闭包。

3. Save输出

// 保存成 Text 文件
rdd.saveAsTextFile("output")

// 序列化成对象保存到文件
rdd.saveAsObjectFile("output1")

// 保存成 Sequencefile 文件（这种方式只能是 key-value 格式）
rdd.map((_,1)).saveAsSequenceFile("output2")

4. 序列化

Serializable ：java常用的序列化方式

Kryo ：是 Serializable 的 10 倍

当 RDD 在 Shuffle 数据的时候，简单数据类型、数组和字符串类型已经在 Spark 内部使用 Kryo 来序列化
⚠️：使用Kryo 序列化，也要继承Serializable 接口

val sparkConf: SparkConf = new SparkConf()
      .setAppName("fzk")
      .setMaster("local[2]")
      // 替换默认的序列化机制
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      // 注册需要使用 kryo 序列化的自定义类(Searcher ：自定义实体类)
      .registerKryoClasses(Array(classOf[Searcher]))

val sparkContext = new SparkContext(sparkConf)

5. 持久化

缓存

RDD 通过Cache 或者 Persist 方法将前面的计算结果缓存，默认情况下会把数据以缓存在 JVM 的堆内存中。但是并不是这两个方法被调用时立即缓存，而是触发后面的 action 算子时，该RDD 将会被缓存在计算节点的内存中，并供后面重用
缓存有可能丢失，或者存储于内存的数据由于内存不足而被删除，RDD 的缓存容错机制保证了即使缓存丢失也能保证计算的正确执行。通过基于RDD 的一系列转换，丢失的数据会被重算，由于RDD 的各个 Partition 是相对独立的，因此只需要计算丢失的部分即可，并不需要重算全部Partition
缓存有可能丢失，或者存储于内存的数据由于内存不足而被删除，RDD 的缓存容错机制保证了即使缓存丢失也能保证计算的正确执行。通过基于RDD 的一系列转换，丢失的数据会被重算，由于RDD 的各个 Partition 是相对独立的，因此只需要计算丢失的部分即可，并不需要重算全部Partition

cache

通过Cache 方法将前面的RDD计算结果缓存在 JVM 的堆内存中。
⚠️：但是并不是方法被调用时立即缓存，而是触发后面的 action 算子时，该RDD 将会被缓存在计算节点的内存中，并供后面重用
⚠️：cache内部调用的是persist(StorageLevel.MEMORY_ONLY)方法，将数据缓存到内存中
```
rdd.cache()
```

persist

通过 persist 方法将前面的RDD计算结果缓存

有以下几种缓存模式

模式	缓存	副本	序列化（内存占用少，cpu占用高）
DISK_ONLY	磁盘	1	否
DISK_ONLY_2	磁盘	2	否
MEMORY_ONLY	内存	1	否
MEMORY_ONLY_2	内存	2	否
MEMORY_ONLY_SER	内存	1	是
MEMORY_ONLY_SER_2	内存	2	是
MEMORY_AND_DISK	内存&磁盘	1	否
MEMORY_AND_DISK_2	内存&磁盘	2	否
MEMORY_AND_DISK_SER	内存&磁盘	1	是
MEMORY_AND_DISK_SER_2	内存&磁盘	2	是

rdd.persist(StorageLevel.MEMORY_AND_DISK)

checkpoint（检查点）

所谓的检查点其实就是通过将RDD 中间结果写入磁盘，由于血缘依赖过长会造成容错成本过高，这样就不如在中间阶段做检查点容错，如果检查点之后有节点出现问题，可以从检查点开始重做血缘，减少了开销

对 RDD 进行 checkpoint 操作并不会马上被执行，必须执行 Action 操作才能触发

def main(args: Array[String]): Unit = {
    val sparkConf: SparkConf = new SparkConf().setAppName("fzk").setMaster("local[2]")
    val sparkContext = new SparkContext(sparkConf)

    // 1. 配置检查点的存储路径
    sparkContext.setCheckpointDir("hdfs://hadoop:8020/spark/checkpoint")

    val data: List[Int] = List(1, 2, 3, 4, 5, 6)
    val sourceData: RDD[Int] = sparkContext.makeRDD(data)
    val rdd: RDD[(Int, Int)] = sourceData.map(data => (data, 1))

    // 2. 将需要使用到检查点的 RDD 调用 checkpoint 方法
    rdd.checkpoint()

    sparkContext.stop()
  }

6. 分区器

Spark 目前支持Hash 分区和 Range 分区，和用户自定义分区。Hash 分区为当前的默认分区。分区器直接决定了RDD 中分区的个数、RDD 中每条数据经过Shuffle 后进入哪个分区，进而决定了Reduce 的个数
只有Key-Value 类型的RDD 才有分区器，非 Key-Value 类型的RDD 分区的值是 None
每个RDD 的分区 ID 范围：0 ~ (numPartitions - 1)，决定这个值是属于那个分区的

Hash 分区(系统自带)

对于给定的 key，计算其hashCode,并除以分区个数取余

class HashPartitioner(partitions: Int) extends Partitioner {
  require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")

  def numPartitions: Int = partitions

  def getPartition(key: Any): Int = key match {
    case null => 0
    case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
  }

  override def equals(other: Any): Boolean = other match {
    case h: HashPartitioner =>
      h.numPartitions == numPartitions
    case _ =>
      false
  }

  override def hashCode: Int = numPartitions
}

Range 分区(系统自带)

将一定范围内的数据映射到一个分区中，尽量保证每个分区数据均匀，而且分区间有序

class RangePartitioner[K : Ordering : ClassTag, V](
    partitions: Int,
    rdd: RDD[_ <: Product2[K, V]],
    private var ascending: Boolean = true,
    val samplePointsPerPartitionHint: Int = 20)
  extends Partitioner {

  // A constructor declared in order to maintain backward compatibility for Java, when we add the
  // 4th constructor parameter samplePointsPerPartitionHint. See SPARK-22160.
  // This is added to make sure from a bytecode point of view, there is still a 3-arg ctor.
  def this(partitions: Int, rdd: RDD[_ <: Product2[K, V]], ascending: Boolean) = {
    this(partitions, rdd, ascending, samplePointsPerPartitionHint = 20)
  }

  // We allow partitions = 0, which happens when sorting an empty RDD under the default settings.
  require(partitions >= 0, s"Number of partitions cannot be negative but found $partitions.")
  require(samplePointsPerPartitionHint > 0,
    s"Sample points per partition must be greater than 0 but found $samplePointsPerPartitionHint")

  private var ordering = implicitly[Ordering[K]]

  // An array of upper bounds for the first (partitions - 1) partitions
  private var rangeBounds: Array[K] = {
    if (partitions <= 1) {
      Array.empty
    } else {
      // This is the sample size we need to have roughly balanced output partitions, capped at 1M.
      // Cast to double to avoid overflowing ints or longs
      val sampleSize = math.min(samplePointsPerPartitionHint.toDouble * partitions, 1e6)
      // Assume the input partitions are roughly balanced and over-sample a little bit.
      val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt
      val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)
      if (numItems == 0L) {
        Array.empty
      } else {
        // If a partition contains much more than the average number of items, we re-sample from it
        // to ensure that enough items are collected from that partition.
        val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)
        val candidates = ArrayBuffer.empty[(K, Float)]
        val imbalancedPartitions = mutable.Set.empty[Int]
        sketched.foreach { case (idx, n, sample) =>
          if (fraction * n > sampleSizePerPartition) {
            imbalancedPartitions += idx
          } else {
            // The weight is 1 over the sampling probability.
            val weight = (n.toDouble / sample.length).toFloat
            for (key <- sample) {
              candidates += ((key, weight))
            }
          }
        }
        if (imbalancedPartitions.nonEmpty) {
          // Re-sample imbalanced partitions with the desired sampling probability.
          val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)
          val seed = byteswap32(-rdd.id - 1)
          val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()
          val weight = (1.0 / fraction).toFloat
          candidates ++= reSampled.map(x => (x, weight))
        }
        RangePartitioner.determineBounds(candidates, math.min(partitions, candidates.size))
      }
    }
  }

  def numPartitions: Int = rangeBounds.length + 1

  private var binarySearch: ((Array[K], K) => Int) = CollectionsUtils.makeBinarySearch[K]

  def getPartition(key: Any): Int = {
    val k = key.asInstanceOf[K]
    var partition = 0
    if (rangeBounds.length <= 128) {
      // If we have less than 128 partitions naive search
      while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) {
        partition += 1
      }
    } else {
      // Determine which binary search method to use only once.
      partition = binarySearch(rangeBounds, k)
      // binarySearch either returns the match location or -[insertion point]-1
      if (partition < 0) {
        partition = -partition-1
      }
      if (partition > rangeBounds.length) {
        partition = rangeBounds.length
      }
    }
    if (ascending) {
      partition
    } else {
      rangeBounds.length - partition
    }
  }

  override def equals(other: Any): Boolean = other match {
    case r: RangePartitioner[_, _] =>
      r.rangeBounds.sameElements(rangeBounds) && r.ascending == ascending
    case _ =>
      false
  }

  override def hashCode(): Int = {
    val prime = 31
    var result = 1
    var i = 0
    while (i < rangeBounds.length) {
      result = prime * result + rangeBounds(i).hashCode
      i += 1
    }
    result = prime * result + ascending.hashCode
    result
  }

  @throws(classOf[IOException])
  private def writeObject(out: ObjectOutputStream): Unit = Utils.tryOrIOException {
    val sfactory = SparkEnv.get.serializer
    sfactory match {
      case js: JavaSerializer => out.defaultWriteObject()
      case _ =>
        out.writeBoolean(ascending)
        out.writeObject(ordering)
        out.writeObject(binarySearch)

        val ser = sfactory.newInstance()
        Utils.serializeViaNestedStream(out, ser) { stream =>
          stream.writeObject(scala.reflect.classTag[Array[K]])
          stream.writeObject(rangeBounds)
        }
    }
  }

  @throws(classOf[IOException])
  private def readObject(in: ObjectInputStream): Unit = Utils.tryOrIOException {
    val sfactory = SparkEnv.get.serializer
    sfactory match {
      case js: JavaSerializer => in.defaultReadObject()
      case _ =>
        ascending = in.readBoolean()
        ordering = in.readObject().asInstanceOf[Ordering[K]]
        binarySearch = in.readObject().asInstanceOf[(Array[K], K) => Int]

        val ser = sfactory.newInstance()
        Utils.deserializeViaNestedStream(in, ser) { ds =>
          implicit val classTag = ds.readObject[ClassTag[Array[K]]]()
          rangeBounds = ds.readObject[Array[K]]()
        }
    }
  }
}

自定义分区器

步骤：

继承 Partitioner
实现 numPartitions、getPartition
在 k-v 类型的 RDD 中使用

// 1. 继承 Partitioner
class Mypartition extends Partitioner {
    // 2.1 分区个数
    override def numPartitions: Int = 3

    /**
     * 2.2 按照数据来进行分区
     * @param key  数据
     * @return  返回数据在第几分区（分区从 0 开始）
     */
    override def getPartition(key: Any): Int = {
        // TODO 编写分区逻辑（这里列举对 分区数量 取模）
        key.toString.toInt % numPartitions
    }
}



def main(args: Array[String]): Unit = {
    val sparkConf: SparkConf = new SparkConf().setAppName("fzk").setMaster("local[2]")
    val sparkContext = new SparkContext(sparkConf)

    val data: List[Int] = List(1, 2, 3, 4, 5, 6)
    val sourceData: RDD[Int] = sparkContext.makeRDD(data)
    val rdd: RDD[(Int, Int)] = sourceData.map(data => (data, 1))

    // 3. 在 k-v 类型的 RDD 中使用
    rdd.partitionBy(new Mypartition)

    sparkContext.stop()
}

7. 广播变量

广播变量用来高效分发较大的对象。向所有工作节点发送一个较大的只读值，以供一个或多个 Spark 操作使用
通俗来说：所有分区读这一个对象数据（没有广播变量的话，不同分区读不同分区的数据，有了广播变量读的就是同一个值）

声明广播变量

使用广播变量

def main(args: Array[String]): Unit = {
    val sparkConf: SparkConf = new SparkConf().setAppName("fzk").setMaster("local[2]")
    val sparkContext = new SparkContext(sparkConf)

    // 1. 声明广播变量
    val mapBroadcast: Broadcast[mutable.Map[Int, String]] = sparkContext.broadcast(mutable.Map(1 -> "a", 2 -> "b", 3 -> "c"))

    val data: List[Int] = List(1, 2, 3)
    val sourceData: RDD[Int] = sparkContext.makeRDD(data)
    val rdd: RDD[(Int, String)] = sourceData.map(data => {
        // 2. 使用广播变量
        (data, mapBroadcast.value.getOrElse(data, "abc"))
    })

    sparkContext.stop()
}

Maven

<dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>3.0.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-yarn_2.12</artifactId>
            <version>3.0.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>3.0.0</version>
        </dependency>
    </dependencies>