重拾Spark 之day05--创建RDD

RDD是什么东西已经初步了解了,现在我们要怎么创建RDD呢?

1. 从集合创建RDD

1.1 parallelize

源码:

  /** Distribute a local Scala collection to form an RDD.
   *
   * @note Parallelize acts lazily. If `seq` is a mutable collection and is altered after the call
   * to parallelize and before the first action on the RDD, the resultant RDD will reflect the
   * modified collection. Pass a copy of the argument to avoid this.
   * @note avoid using `parallelize(Seq())` to create an empty `RDD`. Consider `emptyRDD` for an
   * RDD with no partitions, or `parallelize(Seq[T]())` for an RDD of `T` with empty partitions.
   * @param seq Scala collection to distribute
   * @param numSlices number of partitions to divide the collection into
   * @return RDD representing distributed collection
   */
  def parallelize[T: ClassTag](
      seq: Seq[T],
      numSlices: Int = defaultParallelism): RDD[T]

主要用来测试,第二个参数可以指定分区数,也可以不传。

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)

注意: 如果seq是一个可变集合,并且在调用parallelize以及对RDD的第一个Action算子操作之前进行了更改,那么结果RDD将反映修改后的集合。

1.2 range

源码:

  /**
   * Creates a new RDD[Long] containing elements from `start` to `end`(exclusive), increased by
   * `step` every element.
   *
   * @note if we need to cache this RDD, we should make sure each partition does not exceed limit.
   *
   * @param start the start value.
   * @param end the end value.
   * @param step the incremental step
   * @param numSlices number of partitions to divide the collection into
   * @return RDD representing distributed range
   */
  def range(
      start: Long,
      end: Long,
      step: Long = 1,
      numSlices: Int = defaultParallelism): RDD[Long]

创建一个新的RDD[Long],包含从startend(不包括end) 的元素,步长为step 默认为1

val rangeRdd = sc.range(1, 1000, 2)

注意:如果我们需要缓存这个RDD,我们应该确保每个分区不超过限制, 单个分区大小不能超过Integer.MAX_VALUE 大概就是2G的样子。

1.3 makeRDD

makeRDD有两种实现,第一种是两个参数的
这个是和arallelize方法是一样的, 里面就是直接调用的parallelize

/** Distribute a local Scala collection to form an RDD.
   *
   * This method is identical to `parallelize`.
   */
  def makeRDD[T: ClassTag](
      seq: Seq[T],
      numSlices: Int = defaultParallelism): RDD[T] = withScope {
    parallelize(seq, numSlices)
  }

第二种是一个参数

  /** Distribute a local Scala collection to form an RDD, with one or more
    * location preferences (hostnames of Spark nodes) for each object.
    * Create a new partition for each collection item. */
  def makeRDD[T: ClassTag](seq: Seq[(T, Seq[String])]): RDD[T] = withScope {
    assertNotStopped()
    val indexToPrefs = seq.zipWithIndex.map(t => (t._2, t._1._2)).toMap
    new ParallelCollectionRDD[T](this, seq.map(_._1), seq.size, indexToPrefs)
  }

分区数时根据seq的大小来定的,写个例子:

val conf = new SparkConf().setAppName("testMakeRDD").setMaster("local[1]")
    val sc = SparkContext.getOrCreate(conf)
    // 会根据传入的seq的size设置分区, 这里是2
    val rdd_with_loca_ref: RDD[Int] = sc.makeRDD(List(
      (1, List("nick", "song", "0723")),
      (2, List("nick", "song"))
    ))

    val rdd: RDD[String] = sc.makeRDD(List("aaa", "bbb", "ccc"), 2)

    println("rdd1的分区大小:" + rdd.partitions.length, "rdd_with_loca_ref的分区大小:" + rdd_with_loca_ref.partitions.length)

    println("rdd_with_loca_ref的第一个分区",rdd_with_loca_ref.preferredLocations(rdd_with_loca_ref.partitions(0)).mkString(", ")) //可以单独拿出rdd的某个分区
    println("rdd_with_loca_ref的第二个分区",rdd_with_loca_ref.preferredLocations(rdd_with_loca_ref.partitions(1)).mkString(", ")) //可以单独拿出rdd的某个分区
    println("rdd的第二个分区",rdd.preferredLocations(rdd.partitions(1)).mkString(", ")) // 也有两个分区, 但是不能单独拿出某个分区

    rdd_with_loca_ref.foreach(println)

注意:
这个makeRDD的实现不可以自己指定分区的数量,而是固定为seq参数的size大小。

rdd_with_loca_ref的类型是RDD[Int] 这个rdd有两个分区,每个分区只存了一条记录。看下面的打印结果,只打印了1和2。那另外的两个List存哪里去了?是要通过preferredLocations方法拿到。
打印结果

2. 从外部存储创建RDD

2.1 textFile

  /**
   * Read a text file from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI, and return it as an RDD of Strings.
   */
  def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String]

注意:

  • 这个path可以是集群上的文件(hdfs://…), 也可以是本地文件(file:///…)项目中的话是以项目根目录为相对路径。如果是本地文件, 得保证所有节点都能通过这个path访问到这个文件,还要注意权限问题。
  • 文件生成RDD是通过CRLF或者LF分隔,即一行一条记录。暂时没找到怎么设置分隔符
  • 关于textFile分区具体可看 https://blog.csdn.net/u014756380/article/details/78727386
  • 这个path也可以是目录, 或者通配符。这样就是读取批量文件生成RDD,如果没有手动设置分区数,那么会按照文件数来分区,每个文件至少一个分区,如果超过阈值则更多分区。

2.2 wholeTextFiles

textFile很像, 输入是多个文件(可以是通配符,也可以是目录)。

/**
   * Read a directory of text files from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI. Each file is read as a single record and returned in a
   * key-value pair, where the key is the path of each file, the value is the content of each file.
   *
   * <p> For example, if you have the following files:
   * {{{
   *   hdfs://a-hdfs-path/part-00000
   *   hdfs://a-hdfs-path/part-00001
   *   ...
   *   hdfs://a-hdfs-path/part-nnnnn
   * }}}
   *
   * Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
   *
   * <p> then `rdd` contains
   * {{{
   *   (a-hdfs-path/part-00000, its content)
   *   (a-hdfs-path/part-00001, its content)
   *   ...
   *   (a-hdfs-path/part-nnnnn, its content)
   * }}}
   *
   * @note Small files are preferred, large file is also allowable, but may cause bad performance.
   * @note On some filesystems, `.../path/&#42;` can be a more efficient way to read all files
   *       in a directory rather than `.../path/` or `.../path`
   * @note Partitioning is determined by data locality. This may result in too few partitions
   *       by default.
   *
   * @param path Directory to the input data files, the path can be comma separated paths as the
   *             list of inputs.
   * @param minPartitions A suggestion value of the minimal splitting number for input data.
   * @return RDD representing tuples of file path and the corresponding file content
   */
  def wholeTextFiles(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[(String, String)]

注意:

  • 返回的类型是RDD[(String, String)], 元组的第一个元素是文件的名字,第二个元素是该文件的所有内容。
  • 每个文件是一个RDD中的一条记录。
  • 分区跟文件数量没有关系。
  • 适用于大量小文件,如果是大文件会影响性能。

2.3 binaryFiles

wholeTextFiles一样,但返回值是RDD[(String, PortableDataStream)]

2.4 binaryRecords

读取二进制文件

  /**
   * Load data from a flat binary file, assuming the length of each record is constant.
   *
   * @note We ensure that the byte array for each record in the resulting RDD
   * has the provided record length.
   *
   * @param path Directory to the input data files, the path can be comma separated paths as the
   *             list of inputs.
   * @param recordLength The length at which to split the records
   * @param conf Configuration for setting up the dataset.
   *
   * @return An RDD of data with values, represented as byte arrays
   */
  def binaryRecords(
      path: String,
      recordLength: Int,
      conf: Configuration = hadoopConfiguration): RDD[Array[Byte]]

注意: RDD中每个元素,即每个array的大小都是一样的,就是第二个参数recordLength

2.5 hadoopRDD

2.6 hadoopFile

2.7 newAPIHadoopFile

2.8 newAPIHadoopRDD

2.9 sequenceFile

2.10 objectFile

3. 从其它RDD创建

RDD 通过各种Transformation算子操作返回新的RDD

4. 创建空的RDD

sc.emptyRDD()
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值