RDD转换算子(二)-KV算子(键值类)及源码解析

转换算子

主要做的是就是将一个已有的RDD生成另外一个RDD。Transformation具有lazy特性(延迟加载)。Transformation算子的代码不会真正被执行。只有当我们的程序里面遇到一个action算子的时候,代码才会真正的被执行。这种设计让Spark更加有效率地运行。

KV算子为Rdd内部为键值对类型

官方文档

RDD Programming Guide - Spark 3.5.1 Documentation

1、groupByKey

groupByKey对每个相同的key执行聚合操作,针对key生成的value是一个集合。

val rdd = sc.parallelize(Array(("red",1), ("yellow",2), ("red", 3),("yellow", 4)))
rdd.groupByKey().collect()

Array[(String, Iterable[Int])] = Array((red,CompactBuffer(1, 3)), (yellow,CompactBuffer(2, 4)))

CompactBuffer:类似于arrayBuffer的只追加缓冲区,但对于小缓冲区内存效率更高

拓展:

List(("a", 50), ("b", 70), ("c", 30), ("c", 90),("b", 20), ("a", 80)) 进行分组排序,按字符串进行分组,对整数进行倒序排序。

sc.makeRDD(List(("a", 50), ("b", 70), ("c", 30), ("c", 90),("b", 20), ("a", 80))).groupByKey().map(x => (x._1, x._2.toList.sorted.reverse)).foreach(println)

     toList: 把CompactBuffer转换成List;
     sorted:对List进行正序排序
     reverse:对元素进行反转,变成倒序

源码解析

/**
 * Group the values for each key in the RDD into a single sequence. Hash-partitions the
 * resulting RDD with the existing partitioner/parallelism level. The ordering of elements
 * within each group is not guaranteed, and may even differ each time the resulting RDD is
 * evaluated.
 *
 * @note This operation may be very expensive. If you are grouping in order to perform an
 * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
 * or `PairRDDFunctions.reduceByKey` will provide much better performance.
 */
def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
  groupByKey(defaultPartitioner(self))
}
/**
 * Group the values for each key in the RDD into a single sequence. Allows controlling the
 * partitioning of the resulting key-value pair RDD by passing a Partitioner.
 * The ordering of elements within each group is not guaranteed, and may even differ
 * each time the resulting RDD is evaluated.
 *
 * @note This operation may be very expensive. If you are grouping in order to perform an
 * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
 * or `PairRDDFunctions.reduceByKey` will provide much better performance.
 *
 * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
 * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
 */
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
  // groupByKey shouldn't use map side combine because map side combine does not
  // reduce the amount of data shuffled and requires all map side data be inserted
  // into a hash table, leading to more objects in the old gen.
  val createCombiner = (v: V) => CompactBuffer(v)
  val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
  val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
  val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
    createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
  bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}

CompactBuffer解析
/**
 * An append-only buffer similar to ArrayBuffer, but more memory-efficient for small buffers.
 * ArrayBuffer always allocates an Object array to store the data, with 16 entries by default,
 * so it has about 80-100 bytes of overhead. In contrast, CompactBuffer can keep up to two
 * elements in fields of the main object, and only allocates an Array[AnyRef] if there are more
 * entries than that. This makes it more efficient for operations like groupBy where we expect
 * some keys to have very few elements.
 */
private[spark] class CompactBuffer[T: ClassTag] extends Seq[T] with Serializable {

2、reduceByKey(func)

在一个(K,V)的RDD上调用,返回一个(K,V)的RDD,使用指定的函数,将相同key的值聚合到一起 

val rdd = sc.parallelize(Array(("red",1), ("yellow",2), ("red", 3),("yellow", 4)))
rdd.reduceByKey((x,y) => x + y).collect

Array[(String, Int)] = Array((red,4), (yellow,6))

x 和 y 分别表示什么意思?

对于这个x,它代指的是函数返回值,而y是对rdd各元素的遍历。

刚开始,当y遍历到第一个元素时,因为还没有返回值,所以x可以忽略,相加后还是y本身,并把值返回

当遍历到第二个元素时,x即为刚才的返回值,再和y相加,以此类推。

如果我想对不是key-value形式的数据进行合并呢?

可以使用reduce:

val rdd5 = sc.parallelize(List(1, 2, 3, 4, 5))
rdd5.reduce(_ + _)

res4: Int = 15

源码解析 

/**
 * Group the values for each key in the RDD into a single sequence. Hash-partitions the
 * resulting RDD with the existing partitioner/parallelism level. The ordering of elements
 * within each group is not guaranteed, and may even differ each time the resulting RDD is
 * evaluated.
 *
 * @note This operation may be very expensive. If you are grouping in order to perform an
 * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
 * or `PairRDDFunctions.reduceByKey` will provide much better performance.
 */
def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
  groupByKey(defaultPartitioner(self))
}
/**
 * Group the values for each key in the RDD into a single sequence. Allows controlling the
 * partitioning of the resulting key-value pair RDD by passing a Partitioner.
 * The ordering of elements within each group is not guaranteed, and may even differ
 * each time the resulting RDD is evaluated.
 *
 * @note This operation may be very expensive. If you are grouping in order to perform an
 * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
 * or `PairRDDFunctions.reduceByKey` will provide much better performance.
 *
 * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
 * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
 */
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
  // groupByKey shouldn't use map side combine because map side combine does not
  // reduce the amount of data shuffled and requires all map side data be inserted
  // into a hash table, leading to more objects in the old gen.
  val createCombiner = (v: V) => CompactBuffer(v)
  val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
  val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
  val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
    createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
  bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}

3、aggregateByKey(value)(seqOp, combOp)

将每个分区里面的元素通过seqOp和初始值进行聚合,然后将每个分区的结果进行combine操作。

即对每一个分区,将key值相同的,先聚合;再对每个分区的结果进行聚合。

val pairRDD = sc.parallelize(List(("cat", 2), ("cat", 5), ("dog", 12),("mouse", 4), ("cat", 12), ("mouse", 2)), 2)
pairRDD.aggregateByKey(1)(_+_, _ + _).collect

res1: Array[(String, Int)] = Array((dog,13), (cat,21), (mouse,7))

我们来分析一下上面的过程:

1、第一个分区("cat",2), ("cat", 5),("dog", 12),聚合后: ("cat",8), ("dog", 13)
2、第二个分区("mouse", 4),("cat", 12),("mouse", 2),聚合后: ("mouse",7), ("cat", 13)
3、最终结果:Array[(String, Int)] = Array((dog,13), (cat,21), (mouse,7))

拓展

scala> val pairRDD = sc.parallelize(List(("cat", 2), ("cat", 5), ("dog", 12),("mouse", 4), ("cat", 12), ("mouse", 2)), 2)
scala> pairRDD.aggregateByKey(2)(_+_, _ * _).collect

 结果:Array[(String, Int)] = Array((dog,14), (cat,126), (mouse,8))

源码解析

/**
 * Aggregate the values of each key, using given combine functions and a neutral "zero value".
 * This function can return a different result type, U, than the type of the values in this RDD,
 * V. Thus, we need one operation for merging a V into a U and one operation for merging two U's,
 * as in scala.TraversableOnce. The former operation is used for merging values within a
 * partition, and the latter is used for merging values between partitions. To avoid memory
 * allocation, both of these functions are allowed to modify and return their first argument
 * instead of creating a new U.
 */
def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
    combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
  // Serialize the zero value to a byte array so that we can get a new clone of it on each key
  val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
  val zeroArray = new Array[Byte](zeroBuffer.limit)
  zeroBuffer.get(zeroArray)

  lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
  val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))

  // We will clean the combiner closure later in `combineByKey`
  val cleanedSeqOp = self.context.clean(seqOp)
  combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
    cleanedSeqOp, combOp, partitioner)
}

4、foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)]

foldByKey是aggregateByKey的简化操作,也是按照相同key进行func聚合,只有一个函数,对每个分区内的数据按照此函数进行聚合,然后每个分区的结果依然按照此函数再进行聚合。

zeroValue为初始值,在对每个分区内 相同key的value通过func聚合前,先和zeroValue进行聚合

val pairRDD = sc.parallelize(List(("cat", 2), ("cat", 5), ("dog", 12),("mouse", 4), ("cat", 12), ("mouse", 2)), 2)
pairRDD.foldByKey(1)(_+_).collect

分析:

第一个分区数据:(cat,2)(cat,5)(dog,12);cat:1+2+5=8;dog:1+12=13

第二个分区数据:(mouse,4)(cat,12)(mouse,2);mouse:1+4+2=7;cat:1+12=13

分区合并:cat:8+13=21;dog:13;mouse:7

源码解析

/**
 * Merge the values for each key using an associative function and a neutral "zero value" which
 * may be added to the result an arbitrary number of times, and must not change the result
 * (e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).
 */
def foldByKey(
    zeroValue: V,
    partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)] = self.withScope {
  // Serialize the zero value to a byte array so that we can get a new clone of it on each key
  val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
  val zeroArray = new Array[Byte](zeroBuffer.limit)
  zeroBuffer.get(zeroArray)

  // When deserializing, use a lazy val to create just one instance of the serializer per task
  lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
  val createZero = () => cachedSerializer.deserialize[V](ByteBuffer.wrap(zeroArray))

  val cleanedFunc = self.context.clean(func)
  combineByKeyWithClassTag[V]((v: V) => cleanedFunc(createZero(), v),
    cleanedFunc, cleanedFunc, partitioner)
}

对比aggregateByKey与foldByKey底层实现区别如下,发现foldByKey是aggregateByKey简化版本

combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
      cleanedSeqOp, combOp, partitioner)

combineByKeyWithClassTag[V]((v: V) => cleanedFunc(createZero(), v),
      cleanedFunc, cleanedFunc, partitioner)

5、sortByKey([ascending])

在一个(K,V)的RDD上调用,返回一个按照key进行排序的(K,V)的RDD。

val rdd = sc.parallelize(Array((3,"aa"),(6,"cc"),(2,"bb"),(1,"dd")))
rdd.sortByKey(true).collect
rdd.sortByKey(false).collect

 res2: Array[(Int, String)] = Array((1,dd), (2,bb), (3,aa), (6,cc))

res1: Array[(Int, String)] = Array((6,cc), (3,aa), (2,bb), (1,dd))

源码解析

/**
 * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
 * `collect` or `save` on the resulting RDD will return or output an ordered list of records
 * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
 * order of the keys).
 */
// TODO: this currently doesn't work on P other than Tuple2!
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
    : RDD[(K, V)] = self.withScope
{
  val part = new RangePartitioner(numPartitions, self, ascending)
  new ShuffledRDD[K, V, V](self, part)
    .setKeyOrdering(if (ascending) ordering else ordering.reverse)
}

6、sortBy

根据func的结果进行排序 

val rdd = sc.parallelize(List(2, 1, 4, 3))
rdd.sortBy(x => x).collect()
rdd.sortBy(x => x%2).collect()

res4: Array[Int] = Array(1, 2, 3, 4)

res5: Array[Int] = Array(2, 4, 1, 3)

源码解析

/**
 * Return this RDD sorted by the given key function.
 */
def sortBy[K](
    f: (T) => K,
    ascending: Boolean = true,
    numPartitions: Int = this.partitions.length)
    (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
  this.keyBy[K](f)
      .sortByKey(ascending, numPartitions)
      .values
}

7、mapValues

就是输入函数应用于RDD中KV(Key-Value)类型元素中的Value,原RDD中的Key保持不变,与新的Value一起组成新的RDD中的元素。因此,该函数只适用于元素为Key-Value对的RDD,针对于(K,V)形式的类型只对V进行操作。

val rdd1 =sc.parallelize(List("dog","tiger","lion","cat","panther","eagle"),2)
val rdd2 =rdd1.map(x =>(x.length,x))
rdd2.mapValues("|"+_ +"|").collect

Array[(Int, String)] = Array((3,|dog|), (5,|tiger|), (4,|lion|), (3,|cat|), (7,|panther|), (5,|eagle|))

源码解析

/**
 * Pass each value in the key-value pair RDD through a map function without changing the keys;
 * this also retains the original RDD's partitioning.
 */
def mapValues[U](f: V => U): RDD[(K, U)] = self.withScope {
  val cleanF = self.context.clean(f)
  new MapPartitionsRDD[(K, U), (K, V)](self,
    (context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },
    preservesPartitioning = true)
}

8、join(otherDataset)

在类型为(K,V)和(K,W)的RDD上调用,返回一个相同key对应的所有元素对在一起的(K,(V,W))的RDD。

val rdd1 = sc.parallelize(Array((1,"a"),(2,"b"),(3,"c"),(4, "f")))
val rdd2 = sc.parallelize(Array((1,4),(2,5),(3,6)))
rdd1.join(rdd2).collect()

 res7: Array[(Int, (String, Int))] = Array((1, (a, 4)), (2,(b,5)), (3,(c,6)))

val rdd1 = sc.parallelize(Array((1,"a"), (3,"b"), (3,"c")))
val rdd2 = sc.parallelize(Array((1,4), (1,5), (3,6), (3,7)))
rdd1.join(rdd2).collect()

res8: Array[(Int, (String, Int))] = Array((1,(a,4)), (1,(a,5)), (3,(b,6)), (3,(b,7)), (3,(c,6)), (3,(c,7)))

val a = sc.parallelize(List("dog", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val d = c.keyBy(_.length)
b.join(d).collect

res9: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (3,(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)), (3,(rat,gnu)), (3,(rat,bee)))

源码解析

/**
 * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
 * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
 * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
 */
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
  this.cogroup(other, partitioner).flatMapValues( pair =>
    for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
  )
}
/**
 * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
 * list of values for that key in `this` as well as `other`.
 */
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
    : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {
  if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
    throw new SparkException("HashPartitioner cannot partition array keys.")
  }
  val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
  cg.mapValues { case Array(vs, w1s) =>
    (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
  }
}

9、leftOutJoin(otherDataset)

相当于在join基础上判断一侧的RDD是否为空,如果为空,则填充空,如果有数据,则将数据进行连接计算,然后返回结果。

接上面代码执行

b.leftOuterJoin(d).collect

res94: Array[(Int, (String, Option[String]))] = Array((6,(salmon,Some(salmon))), (6,(salmon,Some(rabbit))), (6,(salmon,Some(turkey))), (3,(dog,Some(dog))), (3,(dog,Some(cat))), (3,(dog,Some(gnu))), (3,(dog,Some(bee))), (3,(rat,Some(dog))), (3,(rat,Some(cat))), (3,(rat,Some(gnu))), (3,(rat,Some(bee))), (8,(elephant,None)))

源码解析 

/**
 * Perform a left outer join of `this` and `other`. For each element (k, v) in `this`, the
 * resulting RDD will either contain all pairs (k, (v, Some(w))) for w in `other`, or the
 * pair (k, (v, None)) if no elements in `other` have key k. Uses the given Partitioner to
 * partition the output RDD.
 */
def leftOuterJoin[W](
    other: RDD[(K, W)],
    partitioner: Partitioner): RDD[(K, (V, Option[W]))] = self.withScope {
  this.cogroup(other, partitioner).flatMapValues { pair =>
    if (pair._2.isEmpty) {
      pair._1.iterator.map(v => (v, None))
    } else {
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, Some(w))
    }
  }
}

10、rightOutJoin(otherDataset)

相当于在join基础上判断一侧的RDD是否为空,如果为空,则填充空,如果有数据,则将数据进行连接计算,然后返回结果。

b.rightOuterJoin(d).collect

Array[(Int, (Option[String], String))] = Array((6,(Some(salmon),salmon)), (6,(Some(salmon),rabbit)), (6,(Some(salmon),turkey)), (3,(Some(dog),dog)), (3,(Some(dog),cat)), (3,(Some(dog),gnu)), (3,(Some(dog),bee)), (3,(Some(rat),dog)), (3,(Some(rat),cat)), (3,(Some(rat),gnu)), (3,(Some(rat),bee)), (4,(None,wolf)), (4,(None,bear)))

源码解析 

/**
 * Perform a right outer join of `this` and `other`. For each element (k, w) in `other`, the
 * resulting RDD will either contain all pairs (k, (Some(v), w)) for v in `this`, or the
 * pair (k, (None, w)) if no elements in `this` have key k. Uses the given Partitioner to
 * partition the output RDD.
 */
def rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner)
    : RDD[(K, (Option[V], W))] = self.withScope {
  this.cogroup(other, partitioner).flatMapValues { pair =>
    if (pair._1.isEmpty) {
      pair._2.iterator.map(w => (None, w))
    } else {
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (Some(v), w)
    }
  }
}

 代码调试

pom.xml

   <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.11.8 </version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.4.3</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.4.3</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>3.2.0</version>
        </dependency>

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>8.0.15</version>
        </dependency>
    </dependencies>

创建KVRDD对象

package com.soft863

import org.apache.spark.{SparkConf, SparkContext}

object KVRDD {

  def main(args: Array[String]): Unit = {
    //创建配置文件
    val conf: SparkConf = new SparkConf().setAppName("KV RDD DEMO").setMaster("local[*]")
    //创建SparkContext,该对象是提交的入口
    val sc = new SparkContext(conf)

    val rdd1 = sc.parallelize(Array(("red", 1), ("yellow", 2), ("red", 3), ("yellow", 4)))
    rdd1.groupByKey().collect()

    val rdd2 = sc.parallelize(Array(("red", 1), ("yellow", 2), ("red", 3), ("yellow", 4)))
    rdd2.reduceByKey((x, y) => x + y).collect

    val pairRDD3 = sc.parallelize(List(("cat", 2), ("cat", 5), ("dog", 12), ("mouse", 4), ("cat", 12), ("mouse", 2)), 2)
    pairRDD3.aggregateByKey(1)(_ + _, _ + _).collect

    val pairRDD4 = sc.parallelize(List(("cat", 2), ("cat", 5), ("dog", 12), ("mouse", 4), ("cat", 12), ("mouse", 2)), 2)
    pairRDD4.foldByKey(2)(_ + _).collect

    val rdd5 = sc.parallelize(Array((3, "aa"), (6, "cc"), (2, "bb"), (1, "dd")))
    rdd5.sortByKey(true).collect
    rdd5.sortByKey(false).collect

    val rdd6 = sc.parallelize(List(2, 1, 4, 3))
    rdd6.sortBy(x => x).collect()
    rdd6.sortBy(x => x % 2).collect()

    val rdd7 = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
    val rdd71 = rdd7.map(x => (x.length, x))
    rdd71.mapValues("|" + _ + "|").collect

    val rdd8 = sc.parallelize(Array((1, "a"), (2, "b"), (3, "c"), (4, "f")))
    val rdd81 = sc.parallelize(Array((1, 4), (2, 5), (3, 6)))
    rdd8.join(rdd81).collect()

    val a = sc.parallelize(List("dog", "salmon", "rat", "elephant"), 3)
    val b = a.keyBy(_.length)
    val c = sc.parallelize(List("dog", "cat", "gnu", "salmon", "rabbit", "turkey", "wolf", "bear", "bee"), 3)
    val d = c.keyBy(_.length)
    b.join(d).collect
    b.leftOuterJoin(d).collect
    b.rightOuterJoin(d).collect
  }
}

  • 11
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
电子图书资源服务系统是一款基于 Java Swing 的 C-S 应用,旨在提供电子图书资源一站式服务,可从系统提供的图书资源中直接检索资源并进行下载。.zip优质项目,资源经过严格测试可直接运行成功且功能正常的情况才上传,可轻松copy复刻,拿到资料包后可轻松复现出一样的项目。 本人系统开发经验充足,有任何使用问题欢迎随时与我联系,我会及时为你解惑,提供帮助。 【资源内容】:包含完整源码+工程文件+说明(若有),项目具体内容可查看下方的资源详情。 【附带帮助】: 若还需要相关开发工具、学习资料等,我会提供帮助,提供资料,鼓励学习进步。 【本人专注计算机领域】: 有任何使用问题欢迎随时与我联系,我会及时解答,第一时间为你提供帮助,CSDN博客端可私信,为你解惑,欢迎交流。 【适合场景】: 相关项目设计中,皆可应用在项目开发、毕业设计、课程设计、期末/期中/大作业、工程实训、大创等学科竞赛比赛、初期项目立项、学习/练手等方面中 可借鉴此优质项目实现复刻,也可以基于此项目进行扩展来开发出更多功能 【无积分此资源可联系获取】 # 注意 1. 本资源仅用于开源学习和技术交流。不可商用等,一切后果由使用者承担。 2. 部分字体以及插图等来自网络,若是侵权请联系删除。积分/付费仅作为资源整理辛苦费用。
Spark RDD(弹性分布式数据集)是Spark中最基本的数据抽象,它代表了一个不可变、可分区、可并行计算的数据集合。转换算子是用于对RDD进行转换操作的方法,可以通过转换算子RDD进行各种操作和变换,生成新的RDD。 以下是一些常见的Spark RDD转换算子: 1. map(func):对RDD中的每个元素应用给定的函数,返回一个新的RDD,新RDD中的每个元素都是原RDD中元素经过函数处理后的结果。 2. filter(func):对RDD中的每个元素应用给定的函数,返回一个新的RDD,新RDD中只包含满足条件的元素。 3. flatMap(func):对RDD中的每个元素应用给定的函数,返回一个新的RDD,新RDD中的每个元素都是原RDD中元素经过函数处理后生成的多个结果。 4. union(other):返回一个包含原RDD和另一个RDD中所有元素的新RDD。 5. distinct():返回一个去重后的新RDD,其中不包含重复的元素。 6. groupByKey():对键值RDD进行分组,返回一个新的键值RDD,其中每个键关联一个由具有相同键的所有值组成的迭代器。 7. reduceByKey(func):对键值RDD中具有相同键的值进行聚合操作,返回一个新的键值RDD,其中每个键关联一个经过聚合函数处理后的值。 8. sortByKey():对键值RDD中的键进行排序,返回一个新的键值RDD,按照键的升序排列。 9. join(other):对两个键值RDD进行连接操作,返回一个新的键值RDD,其中包含两个RDD中具有相同键的所有元素。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

数智侠

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值