Spark操作——创建操作

  1. 并行化创建操作
  2. 外部存储创建操作
 

并行化创建操作

  • parallelize[T](seq: Seq[T], numSlices: Int=defaultParallelism):RDD[T]
# 并行化操作1到10数据集,根据能启动的Executor数据来进行切分多个分区,每个分区启动一个任务来进行处理
scala> var rdd = sc.parallelize(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> rdd.collect
res5: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> rdd.partitions.size
res6: Int = 4

# 同上,只不过指定了分区数量
scala> var rdd = sc.parallelize(1 to 10, 5)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:24

scala> rdd.collect
res7: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> rdd.partitions.size
res8: Int = 5
  • makeRDD[[T](seq: Seq[(T, Seq[String])]):RDD[T]
  • makeRDD[T](seq: Seq[T], numSlices:Int=defaultParallelism):RDD[T]
该方法和parallelize方法类似,区别在于makeRDD可以指定每一个分区的首选位置
scala> var collect = Seq((1 to 10, Seq("master","slave1")),(11 to 15, Seq("slave2","slave3")))
collect: Seq[(scala.collection.immutable.Range.Inclusive, Seq[String])] = List((Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),List(master, slave1)), (Range(11, 12, 13, 14, 15),List(slave2, slave3)))

scala> var rdd = sc.makeRDD(collect)
rdd: org.apache.spark.rdd.RDD[scala.collection.immutable.Range.Inclusive] = ParallelCollectionRDD[5] at makeRDD at <console>:26

scala> rdd.partitions.size
res9: Int = 2

scala> rdd.preferredLocations(rdd.partitions(0))
res10: Seq[String] = List(master, slave1)

scala> rdd.preferredLocations(rdd.partitions(1))
res11: Seq[String] = List(slave2, slave3)

外部存储创建操作

Spark可以将任何Hadoop支持的存储资源转化成RDD,如本地文件、HDFS文件、Cassandra、HBase、Amazon S3deng。支持文件文件、SequenceFiles和Hadoop InputFormat格式
  • textFile(path:String, minPartitions:Int=defaultMinPartitions):RDD[String]
textFile可以指定第二个参数用于指定更多的分片,但是不能使用少于HDFS数据块的分片数,默认情况下一个数据块对应一个分片。
scala> var rdd = sc.textFile("/Users/lyf/Desktop/data.txt")
rdd: org.apache.spark.rdd.RDD[String] = /Users/lyf/Desktop/data.txt MapPartitionsRDD[7] at textFile at <console>:24

scala> rdd.collect
res12: Array[String] = Array(Hello World, Hello Tom, Hello Jerry)

scala> rdd.count
res13: Long = 3
  • wholeTextFile(path:String, minPartitions:Int=defaultMinPartitions):RDD[(K,V)]
读取目录里面的小文件,返回(文件路径,内容)对
scala> var rdd = sc.wholeTextFiles("/Users/lyf/Desktop/test")
rdd: org.apache.spark.rdd.RDD[(String, String)] = /Users/lyf/Desktop/test MapPartitionsRDD[9] at wholeTextFiles at <console>:24

scala> rdd.collect
res14: Array[(String, String)] =
Array((file:/Users/lyf/Desktop/test/data1.txt,"Hello World
Hello Tom
Hello Jerry
"), (file:/Users/lyf/Desktop/test/data2.txt,"This is a spark test
Hello World
"))
  • sequenceFile[K,V](path:String, minPartitions:Int=defaultMinPartitions):RDD[(K,V)]
  • sequenceFile[K,V](path:String, keyClass:Class[K], valueClass:Class[V]):RDD[(K,V)]
  • sequenceFile[K,V](path:String, keyClass:Class[K], valueClass:Class[V], minPartitions:Int):RDD[(K,V)]
sequenceFile[K,V]()操作可以将SequenceFile转换成RDD。
 
  • hadoopFile[K,V,F<:InputFormat[K,V]](path:String):RDD[(K,V)]
  • hadoopFile[K,V,F<:InputFormat[K,V]](path:String, minPartitions:Int):RDD[(K,V)]
  • hadoopFile[K,V](path:String, inputFormatClass:Cclass[_<:InputFormat[K,V]], keyClass:Class[K], valueClass:Class[V], minPartitions:Int=defaultMinPartitions):RDD[(K,V)]
  • newAPIHadoopFile[K,V,F<:InputFormat[K,V]](path:String, fClass:Class[F],kClass:Class[K], vClass:Class[V], conf:Configuration=hadoopConfiguration):RDD[(K,V)]
  • newAPIHadoopFile[K,V,F<:InputFormat[K,V]](path:String)(implicit km:ClassTag[K], vm:ClassTag[V], fm:ClassTag[F]):RDD[(K,V)]
  • hadoopRDD[K,V](conf:JobConf, inputFormatClass:Class[_ <:InputFormat[K,V]], keyClass:Class[K], valueClass:Class[V], minPartitions:Int=defaultMinPartitions):RDD[(K,V)]
  • newAPIHadoopRDD[K,V,F<:InputFormat[K,V]](conf:Configuration=hadoopConfiguration, fClass:Class[F], kClass:Class[K], vClass:Class[V]):RDD[(K,V)]
hadoopRDD操作可以将其他任何Hadoop输入类型转换成RDD使用操作。
 
 
 
 
 

参考:

[1] 郭景瞻. 图解Spark:核心技术与案例实战[M]. 北京:电子工业出版社, 2017.

 
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值