重拾Spark 之day05--创建RDD

最新推荐文章于 2024-08-26 17:41:15 发布

Nick-_-Song

最新推荐文章于 2024-08-26 17:41:15 发布

阅读量216

点赞数

分类专栏： Spark 文章标签： Spark Rdd 大数据

本文链接：https://blog.csdn.net/scl323/article/details/88543288

版权

Spark 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

RDD是什么东西已经初步了解了，现在我们要怎么创建RDD呢？

创建RDD

1. 从集合创建RDD
2. 从外部存储创建RDD
3. 从其它RDD创建
4. 创建空的RDD

1. 从集合创建RDD

1.1 parallelize

源码：

  /** Distribute a local Scala collection to form an RDD.
   *
   * @note Parallelize acts lazily. If `seq` is a mutable collection and is altered after the call
   * to parallelize and before the first action on the RDD, the resultant RDD will reflect the
   * modified collection. Pass a copy of the argument to avoid this.
   * @note avoid using `parallelize(Seq())` to create an empty `RDD`. Consider `emptyRDD` for an
   * RDD with no partitions, or `parallelize(Seq[T]())` for an RDD of `T` with empty partitions.
   * @param seq Scala collection to distribute
   * @param numSlices number of partitions to divide the collection into
   * @return RDD representing distributed collection
   */
  def parallelize[T: ClassTag](
      seq: Seq[T],
      numSlices: Int = defaultParallelism): RDD[T]

主要用来测试，第二个参数可以指定分区数，也可以不传。

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)

注意： 如果seq是一个可变集合，并且在调用parallelize以及对RDD的第一个Action算子操作之前进行了更改，那么结果RDD将反映修改后的集合。

1.2 range

源码:

  /**
   * Creates a new RDD[Long] containing elements from `start` to `end`(exclusive), increased by
   * `step` every element.
   *
   * @note if we need to cache this RDD, we should make sure each partition does not exceed limit.
   *
   * @param start the start value.
   * @param end the end value.
   * @param step the incremental step
   * @param numSlices number of partitions to divide the collection into
   * @return RDD representing distributed range
   */
  def range(
      start: Long,
      end: Long,
      step: Long = 1,
      numSlices: Int = defaultParallelism): RDD[Long]

创建一个新的RDD[Long]，包含从start到end（不包括end）的元素，步长为step 默认为1

val rangeRdd = sc.range(1, 1000, 2）

注意：如果我们需要缓存这个RDD，我们应该确保每个分区不超过限制，单个分区大小不能超过Integer.MAX_VALUE 大概就是2G的样子。

1.3 makeRDD

makeRDD有两种实现，第一种是两个参数的
这个是和arallelize方法是一样的, 里面就是直接调用的parallelize

/** Distribute a local Scala collection to form an RDD.
   *
   * This method is identical to `parallelize`.
   */
  def makeRDD[T: ClassTag](
      seq: Seq[T],
      numSlices: Int = defaultParallelism): RDD[T] = withScope {
    parallelize(seq, numSlices)
  }

第二种是一个参数

  /** Distribute a local Scala collection to form an RDD, with one or more
    * location preferences (hostnames of Spark nodes) for each object.
    * Create a new partition for each collection item. */
  def makeRDD[T: ClassTag](seq: Seq[(T, Seq[String])]): RDD[T] = withScope {
    assertNotStopped()
    val indexToPrefs = seq.zipWithIndex.map(t => (t._2, t._1._2)).toMap
    new ParallelCollectionRDD[T](this, seq.map(_._1), seq.size, indexToPrefs)
  }

分区数时根据seq的大小来定的，写个例子：

val conf = new SparkConf().setAppName("testMakeRDD").setMaster("local[1]")
    val sc = SparkContext.getOrCreate(conf)
    // 会根据传入的seq的size设置分区, 这里是2
    val rdd_with_loca_ref: RDD[Int] = sc.makeRDD(List(
      (1, List("nick", "song", "0723")),
      (2, List("nick", "song"))
    ))

    val rdd: RDD[String] = sc.makeRDD(List("aaa", "bbb", "ccc"), 2)

    println("rdd1的分区大小:" + rdd.partitions.length, "rdd_with_loca_ref的分区大小:" + rdd_with_loca_ref.partitions.length)

    println("rdd_with_loca_ref的第一个分区",rdd_with_loca_ref.preferredLocations(rdd_with_loca_ref.partitions(0)).mkString(", ")) //可以单独拿出rdd的某个分区
    println("rdd_with_loca_ref的第二个分区",rdd_with_loca_ref.preferredLocations(rdd_with_loca_ref.partitions(1)).mkString(", ")) //可以单独拿出rdd的某个分区
    println("rdd的第二个分区",rdd.preferredLocations(rdd.partitions(1)).mkString(", ")) // 也有两个分区, 但是不能单独拿出某个分区

    rdd_with_loca_ref.foreach(println)

注意：
这个makeRDD的实现不可以自己指定分区的数量，而是固定为seq参数的size大小。

rdd_with_loca_ref的类型是RDD[Int] 这个rdd有两个分区，每个分区只存了一条记录。看下面的打印结果，只打印了1和2。那另外的两个List存哪里去了？是要通过preferredLocations方法拿到。

2. 从外部存储创建RDD

2.1 textFile

  /**
   * Read a text file from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI, and return it as an RDD of Strings.
   */
  def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String]

注意：

这个path可以是集群上的文件（hdfs://…），也可以是本地文件(file:///…)项目中的话是以项目根目录为相对路径。如果是本地文件，得保证所有节点都能通过这个path访问到这个文件，还要注意权限问题。
文件生成RDD是通过CRLF或者LF分隔，即一行一条记录。暂时没找到怎么设置分隔符
关于textFile分区具体可看 https://blog.csdn.net/u014756380/article/details/78727386
这个path也可以是目录, 或者通配符。这样就是读取批量文件生成RDD，如果没有手动设置分区数，那么会按照文件数来分区，每个文件至少一个分区，如果超过阈值则更多分区。

2.2 wholeTextFiles

跟textFile很像，输入是多个文件（可以是通配符，也可以是目录）。

/**
   * Read a directory of text files from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI. Each file is read as a single record and returned in a
   * key-value pair, where the key is the path of each file, the value is the content of each file.
   *
   * <p> For example, if you have the following files:
   * {{{
   *   hdfs://a-hdfs-path/part-00000
   *   hdfs://a-hdfs-path/part-00001
   *   ...
   *   hdfs://a-hdfs-path/part-nnnnn
   * }}}
   *
   * Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
   *
   * <p> then `rdd` contains
   * {{{
   *   (a-hdfs-path/part-00000, its content)
   *   (a-hdfs-path/part-00001, its content)
   *   ...
   *   (a-hdfs-path/part-nnnnn, its content)
   * }}}
   *
   * @note Small files are preferred, large file is also allowable, but may cause bad performance.
   * @note On some filesystems, `.../path/&#42;` can be a more efficient way to read all files
   *       in a directory rather than `.../path/` or `.../path`
   * @note Partitioning is determined by data locality. This may result in too few partitions
   *       by default.
   *
   * @param path Directory to the input data files, the path can be comma separated paths as the
   *             list of inputs.
   * @param minPartitions A suggestion value of the minimal splitting number for input data.
   * @return RDD representing tuples of file path and the corresponding file content
   */
  def wholeTextFiles(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[(String, String)]

注意：

返回的类型是RDD[(String, String)]，元组的第一个元素是文件的名字，第二个元素是该文件的所有内容。
每个文件是一个RDD中的一条记录。
分区跟文件数量没有关系。
适用于大量小文件，如果是大文件会影响性能。

2.3 binaryFiles

跟wholeTextFiles一样，但返回值是RDD[(String, PortableDataStream)]

2.4 binaryRecords

读取二进制文件

  /**
   * Load data from a flat binary file, assuming the length of each record is constant.
   *
   * @note We ensure that the byte array for each record in the resulting RDD
   * has the provided record length.
   *
   * @param path Directory to the input data files, the path can be comma separated paths as the
   *             list of inputs.
   * @param recordLength The length at which to split the records
   * @param conf Configuration for setting up the dataset.
   *
   * @return An RDD of data with values, represented as byte arrays
   */
  def binaryRecords(
      path: String,
      recordLength: Int,
      conf: Configuration = hadoopConfiguration): RDD[Array[Byte]]

注意： RDD中每个元素，即每个array的大小都是一样的，就是第二个参数recordLength

2.5 hadoopRDD

2.6 hadoopFile

2.7 newAPIHadoopFile

2.8 newAPIHadoopRDD

2.9 sequenceFile

2.10 objectFile

3. 从其它RDD创建

RDD 通过各种Transformation算子操作返回新的RDD

4. 创建空的RDD

sc.emptyRDD()

Nick-_-Song

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
重拾Spark 之day05--创建RDD

RDD是什么东西已经初步了解了，现在我们要怎么创建RDD呢？创建RDD1. 从集合创建RDD1.1 parallelize1.2 range1.3 makeRDD1. 从集合创建RDD1.1 parallelize源码： /** Distribute a local Scala collection to form an RDD. * * @note Parallelize...
复制链接

扫一扫