Spark之RDD介绍

最新推荐文章于 2024-08-27 21:09:18 发布

SherlockYang、

最新推荐文章于 2024-08-27 21:09:18 发布

阅读量157

点赞数

文章标签： spark

概述

Resilient Distributed Datasets (RDDs)

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

RDD就是带有分区的集合类型

弹性分布式数据集（RDD），特点是可以并行操作，并且是容错的。有两种方法可以创建RDD：

1）执行Transform操作（变换操作），

2）读取外部存储系统的数据集，如HDFS，HBase，或任何与Hadoop有关的数据源。

RDD入门示例

案例一：

Parallelized collections are created by calling SparkContext’s parallelize method on an existing collection in your driver program (a Scala Seq). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel. For example, here is how to create a parallelized collection holding the numbers 1 to 5:

val data = Array(1, 2, 3, 4, 5)
val r1 = sc.parallelize(data)

val r2 = sc.parallelize(data,2)

你可以这样理解RDD：它是spark提供的一个特殊集合类。诸如普通的集合类型，如传统的Array：（1,2,3,4,5）是一个整体，但转换成RDD后，我们可以对数据进行Partition（分区）处理，这样做的目的就是为了分布式。

你可以让这个RDD有两个分区，那么有可能是这个形式：RDD(1,2) (3,4)。

这样设计的目的在于：可以进行分布式运算。

注：创建RDD的方式有多种，比如案例一中是基于一个基本的集合类型（Array）转换而来，像parallelize这样的方法还有很多，之后就会学到。此外，我们也可以在读取数据集时就创建RDD。

案例二：

Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

Text file RDDs can be created using SparkContext’s textFile method. This method takes an URI for the file (either a local path on the machine, or a hdfs://, s3n://, etc URI) and reads it as a collection of lines. Here is an example invocation:

val distFile = sc.textFile("data.txt")

查看RDD

scala>rdd.collect

收集rdd中的数据组成Array返回，此方法将会把分布式存储的rdd中的数据集中到一台机器中组建Array。

在生产环境下一定要慎用这个方法，容易内存溢出。

查看RDD的分区数量：scala>rdd.partitions.size

查看RDD每个分区的元素：scala>rdd.glom.collect

此方法会将每个分区的元素以Array形式返回

分区概念

在上图中，一个RDD有item1~item25个数据，共5个分区，分别在3台机器上进行处理。

此外，spark并没有原生的提供rdd的分区查看工具我们可以自己来写一个

示例代码：

import org.apache.spark.rdd.RDD

import scala.reflect.ClassTag

object su {

def debug[T: ClassTag](rdd: RDD[T]) = {

rdd.mapPartitionsWithIndex((i: Int, iter: Iterator[T]) => {

val m = scala.collection.mutable.Map[Int, List[T]]()

var list = List[T]()

while (iter.hasNext) {

list = list :+ iter.next

}

m(i) = list

m.iterator

}).collect().foreach((x: Tuple2[Int, List[T]]) => {

val i = x._1

println(s"partition:[$i]")

x._2.foreach { println }

})

}

SherlockYang、

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark之RDD介绍

概述Resilient Distributed Datasets (RDDs)Spark revolves around the concept of aresilient distributed dataset(RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs:parallelizingan...
复制链接

扫一扫