RDD是什么东西已经初步了解了,现在我们要怎么创建RDD呢?
创建RDD
1. 从集合创建RDD
1.1 parallelize
源码:
/** Distribute a local Scala collection to form an RDD.
*
* @note Parallelize acts lazily. If `seq` is a mutable collection and is altered after the call
* to parallelize and before the first action on the RDD, the resultant RDD will reflect the
* modified collection. Pass a copy of the argument to avoid this.
* @note avoid using `parallelize(Seq())` to create an empty `RDD`. Consider `emptyRDD` for an
* RDD with no partitions, or `parallelize(Seq[T]())` for an RDD of `T` with empty partitions.
* @param seq Scala collection to distribute
* @param numSlices number of partitions to divide the collection into
* @return RDD representing distributed collection
*/
def parallelize[T: ClassTag](
seq: Seq[T],
numSlices: Int = defaultParallelism): RDD[T]
主要用来测试,第二个参数可以指定分区数,也可以不传。
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
注意: 如果seq是一个可变集合,并且在调用parallelize
以及对RDD的第一个Action
算子操作之前进行了更改,那么结果RDD将反映修改后的集合。
1.2 range
源码:
/**
* Creates a new RDD[Long] containing elements from `start` to `end`(exclusive), increased by
* `step` every element.
*
* @note if we need to cache this RDD, we should make sure each partition does not exceed limit.
*
* @param start the start value.
* @param end the end value.
* @param step the incremental step
* @param numSlices number of partitions to divide the collection into
* @return RDD representing distributed range
*/
def range(
start: Long,
end: Long,
step: Long = 1,
numSlices: Int = defaultParallelism): RDD[Long]
创建一个新的RDD[Long],包含从start
到end
(不包括end
) 的元素,步长为step
默认为1
val rangeRdd = sc.range(1, 1000, 2)
注意:如果我们需要缓存这个RDD,我们应该确保每个分区不超过限制, 单个分区大小不能超过Integer.MAX_VALUE 大概就是2G的样子。
1.3 makeRDD
makeRDD有两种实现,第一种是两个参数的
这个是和arallelize
方法是一样的, 里面就是直接调用的parallelize
/** Distribute a local Scala collection to form an RDD.
*
* This method is identical to `parallelize`.
*/
def makeRDD[T: ClassTag](
seq: Seq[T],
numSlices: Int = defaultParallelism): RDD[T] = withScope {
parallelize(seq, numSlices)
}
第二种是一个参数
/** Distribute a local Scala collection to form an RDD, with one or more
* location preferences (hostnames of Spark nodes) for each object.
* Create a new partition for each collection item. */
def makeRDD[T: ClassTag](seq: Seq[(T, Seq[String])]): RDD[T] = withScope {
assertNotStopped()
val indexToPrefs = seq.zipWithIndex.map(t => (t._2, t._1._2)).toMap
new ParallelCollectionRDD[T](this, seq.map(_._1), seq.size, indexToPrefs)
}
分区数时根据seq的大小来定的,写个例子:
val conf = new SparkConf().setAppName("testMakeRDD").setMaster("local[1]")
val sc = SparkContext.getOrCreate(conf)
// 会根据传入的seq的size设置分区, 这里是2
val rdd_with_loca_ref: RDD[Int] = sc.makeRDD(List(
(1, List("nick", "song", "0723")),
(2, List("nick", "song"))
))
val rdd: RDD[String] = sc.makeRDD(List("aaa", "bbb", "ccc"), 2)
println("rdd1的分区大小:" + rdd.partitions.length, "rdd_with_loca_ref的分区大小:" + rdd_with_loca_ref.partitions.length)
println("rdd_with_loca_ref的第一个分区",rdd_with_loca_ref.preferredLocations(rdd_with_loca_ref.partitions(0)).mkString(", ")) //可以单独拿出rdd的某个分区
println("rdd_with_loca_ref的第二个分区",rdd_with_loca_ref.preferredLocations(rdd_with_loca_ref.partitions(1)).mkString(", ")) //可以单独拿出rdd的某个分区
println("rdd的第二个分区",rdd.preferredLocations(rdd.partitions(1)).mkString(", ")) // 也有两个分区, 但是不能单独拿出某个分区
rdd_with_loca_ref.foreach(println)
注意:
这个makeRDD的实现不可以自己指定分区的数量,而是固定为seq参数的size大小。
rdd_with_loca_ref的类型是RDD[Int] 这个rdd有两个分区,每个分区只存了一条记录。看下面的打印结果,只打印了1和2。那另外的两个List存哪里去了?是要通过preferredLocations
方法拿到。
2. 从外部存储创建RDD
2.1 textFile
/**
* Read a text file from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI, and return it as an RDD of Strings.
*/
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String]
注意:
- 这个path可以是集群上的文件(hdfs://…), 也可以是本地文件(file:///…)项目中的话是以项目根目录为相对路径。如果是本地文件, 得保证所有节点都能通过这个path访问到这个文件,还要注意权限问题。
- 文件生成RDD是通过CRLF或者LF分隔,即一行一条记录。暂时没找到怎么设置分隔符
- 关于
textFile
分区具体可看 https://blog.csdn.net/u014756380/article/details/78727386 - 这个path也可以是目录, 或者通配符。这样就是读取批量文件生成RDD,如果没有手动设置分区数,那么会按照文件数来分区,每个文件至少一个分区,如果超过阈值则更多分区。
2.2 wholeTextFiles
跟textFile
很像, 输入是多个文件(可以是通配符,也可以是目录)。
/**
* Read a directory of text files from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI. Each file is read as a single record and returned in a
* key-value pair, where the key is the path of each file, the value is the content of each file.
*
* <p> For example, if you have the following files:
* {{{
* hdfs://a-hdfs-path/part-00000
* hdfs://a-hdfs-path/part-00001
* ...
* hdfs://a-hdfs-path/part-nnnnn
* }}}
*
* Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
*
* <p> then `rdd` contains
* {{{
* (a-hdfs-path/part-00000, its content)
* (a-hdfs-path/part-00001, its content)
* ...
* (a-hdfs-path/part-nnnnn, its content)
* }}}
*
* @note Small files are preferred, large file is also allowable, but may cause bad performance.
* @note On some filesystems, `.../path/*` can be a more efficient way to read all files
* in a directory rather than `.../path/` or `.../path`
* @note Partitioning is determined by data locality. This may result in too few partitions
* by default.
*
* @param path Directory to the input data files, the path can be comma separated paths as the
* list of inputs.
* @param minPartitions A suggestion value of the minimal splitting number for input data.
* @return RDD representing tuples of file path and the corresponding file content
*/
def wholeTextFiles(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[(String, String)]
注意:
- 返回的类型是RDD[(String, String)], 元组的第一个元素是文件的名字,第二个元素是该文件的所有内容。
- 每个文件是一个RDD中的一条记录。
- 分区跟文件数量没有关系。
- 适用于大量小文件,如果是大文件会影响性能。
2.3 binaryFiles
跟wholeTextFiles
一样,但返回值是RDD[(String, PortableDataStream)]
2.4 binaryRecords
读取二进制文件
/**
* Load data from a flat binary file, assuming the length of each record is constant.
*
* @note We ensure that the byte array for each record in the resulting RDD
* has the provided record length.
*
* @param path Directory to the input data files, the path can be comma separated paths as the
* list of inputs.
* @param recordLength The length at which to split the records
* @param conf Configuration for setting up the dataset.
*
* @return An RDD of data with values, represented as byte arrays
*/
def binaryRecords(
path: String,
recordLength: Int,
conf: Configuration = hadoopConfiguration): RDD[Array[Byte]]
注意: RDD中每个元素,即每个array的大小都是一样的,就是第二个参数recordLength
2.5 hadoopRDD
2.6 hadoopFile
2.7 newAPIHadoopFile
2.8 newAPIHadoopRDD
2.9 sequenceFile
2.10 objectFile
3. 从其它RDD创建
RDD 通过各种Transformation算子操作返回新的RDD
4. 创建空的RDD
sc.emptyRDD()