SparkCore:RDD两种创建方式及一些注意事项

1 parallelize创建RDD

官网:Resilient Distributed Datasets (RDDs)
http://spark.apache.org/docs/2.4.2/rdd-programming-guide.html#resilient-distributed-datasets-rdds
There are two ways to create RDDs:

  • parallelizing an existing collection in your driver program
    并行一个存在的集合,在driver端。这种用于开发测试,数据量非常小

  • referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
    引用外部存储系统中的数据集,如 shared filesystem,HDFS, HBase, or any data source offering a Hadoop InputFormat.。这种用于生产实际环境中。

打开spark-shell,并结合WebUI查看执行过程

[hadoop@vm01 bin]$ ./spark-shell --master local[2]
scala> val data = Array(1, 2, 3, 4, 5)
data: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26

scala> distData.collect
res0: Array[Int] = Array(1, 2, 3, 4, 5)
 http://vm01:4040

2 Patition和Task关系

  • Spark will run one task for each partition of the cluster.
    Spark将为集群的每个分区运行一个任务。一个Partition就是一个Task

  • Typically you want 2-4 partitions for each CPU in your cluster.Normally, Spark tries to set the number of partitions automatically based on your cluster.
    通常,集群中的每个CPU可以设置2-4个分区。Spark会根据集群自动设置分区的数量。
    注意:Task数据多的话,那么相应的就会产生非常多文件,如果文件较小,分区数比较多,就会产生非常多的小文件,很占内存,这里是一个调优点。

scala> val distData = sc.parallelize(data,5)  #这里指定5个分区
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:26

scala> distData.collect
res2: Array[Int] = Array(1, 2, 3, 4, 5)

在UI界面上可以看到,生成了5个Task
在这里插入图片描述

3 基于External Datasets创建RDD

生产上基本都是这种方式。

  • Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.
    Spark可以从Hadoop支持的任何存储源创建RDD,包括本地文件系统、HDFS、Cassandra、HBase、Amazon S3等。Spark支持文本文件、sequencefile和任何其他Hadoop InputFormat。

textFile读取本地系统文件:

scala> val distFile = sc.textFile("/home/hadoop/data/test.txt")
distFile: org.apache.spark.rdd.RDD[String] = /home/hadoop/data/test.txt MapPartitionsRDD[3] at textFile at <console>:24

scala> distFile.collect
res3: Array[String] = Array(hello spark, hello mr, hello yarn, hello hive, hello spark)

scala> distFile.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect
res4: Array[(String, Int)] = Array((hive,1), (hello,5), (yarn,1), (spark,2), (mr,1))

或者

scala> val distFile = sc.textFile("file:///home/hadoop/data/test.txt")
distFile: org.apache.spark.rdd.RDD[String] = file:///home/hadoop/data/test.txt MapPartitionsRDD[8] at textFile at <console>:24

scala> distFile.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).foreach(println)
(hive,1)
(hello,5)
(yarn,1)
(spark,2)
(mr,1)

textFile读取Hdfs系统文件:

[hadoop@vm01 data]$ hdfs dfs -copyFromLocal /home/hadoop/data/test.txt  /
[hadoop@vm01 data]$ hdfs dfs -ls /
-rw-r--r--   3 hadoop supergroup         55 2019-07-31 06:56 /test.txt
scala> val distFile = sc.textFile("hdfs://192.168.137.130:9000/test.txt")
distFile: org.apache.spark.rdd.RDD[String] = hdfs://192.168.137.130:9000/test.txt MapPartitionsRDD[32] at textFile at <console>:24

scala> distFile.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).foreach(println)
(spark,2)
(mr,1)
(hive,1)
(hello,5)
(yarn,1)

4 关于Spark读取文件的一些注意事项

Some notes on reading files with Spark

  • If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
    如果使用本地文件系统上的路径,则必须在所有worker节点上的相同路径上访问该文件。或者将文件复制到所有worker节点上,要么使用挂载于网络的共享文件系统。

  • All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").
    Spark的所有基于文件的输入方法(包括textFile)都支持在目录、压缩文件和通配符上运行。例如,您可以使用textFile(“/my/directory”)、textFile(“/my/directory/.txt”)和textFile(“/my/directory/.gz”)。

  • The textFile method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.
    textFile方法还接受一个可选的第二个参数(minPartitions),用于控制文件的分区数。默认情况下,Spark为文件的每个块创建一个分区(在HDFS中,块的默认大小为128MB),但是您也可以通过传递更大的值来请求更多的分区。注意,分区不能少于块。

[hadoop@vm01 data]$ hdfs dfs -mkdir /data
[hadoop@vm01 data]$ hdfs dfs -copyFromLocal test.txt  /data/test1.txt
[hadoop@vm01 data]$ hdfs dfs -copyFromLocal test.txt  /data/test2.txt
[hadoop@vm01 data]$ hdfs dfs -copyFromLocal test.txt  /data/
[hadoop@vm01 data]$ hdfs dfs -ls /data
-rw-r--r--   3 hadoop supergroup         55 2019-07-31 07:28 /data/test.txt
-rw-r--r--   3 hadoop supergroup         55 2019-07-31 07:28 /data/test1.txt
-rw-r--r--   3 hadoop supergroup         55 2019-07-31 07:28 /data/test2.txt
scala> val distFile = sc.textFile("hdfs://192.168.137.130:9000/data/")
distFile: org.apache.spark.rdd.RDD[String] = hdfs://192.168.137.130:9000/data/ MapPartitionsRDD[37] at textFile at <console>:24

scala> distFile.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).foreach(println)
(spark,6)
(hive,3)
(mr,3)
(hello,15)
(yarn,3)

scala> distFile.collect
res15: Array[String] = Array(hello spark, hello mr, hello yarn, hello hive, hello spark, hello spark, hello mr, hello yarn, hello hive, hello spark, hello spark, hello mr, hello yarn, hello hive, hello spark)

5 支持的其他几种数据格式

Apart from text files, Spark’s Scala API also supports several other data formats:

  • SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with textFile, which would return one record per line in each file. Partitioning is determined by data locality which, in some cases, may result in too few partitions. For those cases, wholeTextFiles provides an optional second argument for controlling the minimal number of partitions.
    SparkContext.wholeTextFiles允许您读取包含多个小文本文件的目录,并以(filename, content)对的形式返回每个小文本文件。这与textFile相反,textFile在每个文件中每行返回一条记录。分区由数据位置决定,在某些情况下,数据位置可能导致分区太少。对于这些情况,wholeTextFiles提供了一个可选的第二个参数,用于控制最小的分区数量。

  • For SequenceFiles, use SparkContext’s sequenceFile[K, V] method where K and V are the types of key and values in the file. These should be subclasses of Hadoop’s Writable interface, like IntWritable and Text. In addition, Spark allows you to specify native types for a few common Writables; for example, sequenceFile[Int, String] will automatically read IntWritables and Texts.
    对于sequenceFile,使用SparkContext的sequenceFile[K, V]方法,其中K和V是文件中的键和值的类型。这些应该是Hadoop的可写接口的子类,比如IntWritable和Text。此外,Spark允许您为一些常见的可写文件指定本机类型;例如,sequenceFile[Int, String]将自动读取IntWritables和文本。

  • For other Hadoop InputFormats, you can use the SparkContext.hadoopRDD method, which takes an arbitrary JobConf and input format class, key class and value class. Set these the same way you would for a Hadoop job with your input source. You can also use SparkContext.newAPIHadoopRDD for InputFormats based on the “new” MapReduce API (org.apache.hadoop.mapreduce).

  • RDD.saveAsObjectFile and SparkContext.objectFile support saving an RDD in a simple format consisting of serialized Java objects. While this is not as efficient as specialized formats like Avro, it offers an easy way to save any RDD.
    RDD.saveAsObjectFile 和 SparkContext.objectFile支持序列化Java对象保存简单格式的RDD。虽然这不如Avro这样的专门格式有效,但是它提供了一种简单的方法来保存任何RDD。

scala> val distFile = sc.wholeTextFiles("hdfs://192.168.137.130:9000/data/")
distFile: org.apache.spark.rdd.RDD[(String, String)] = hdfs://192.168.137.130:9000/data/ MapPartitionsRDD[45] at wholeTextFiles at <console>:24

scala> distFile.collect
res17: Array[(String, String)] =
Array((hdfs://192.168.137.130:9000/data/test.txt,"hello spark
hello mr
hello yarn
hello hive
hello spark
"), (hdfs://192.168.137.130:9000/data/test1.txt,"hello spark
hello mr
hello yarn
hello hive
hello spark
"), (hdfs://192.168.137.130:9000/data/test2.txt,"hello spark
hello mr
hello yarn
hello hive
hello spark
"))

可以将读取的文件内容保存至另一个地方

scala> val distFile = sc.textFile("hdfs://192.168.137.130:9000/test.txt")
scala> distFile.saveAsTextFile("hdfs://192.168.137.130:9000/out/")
[hadoop@vm01 data]$ hdfs dfs -ls /out
Found 3 items
-rw-r--r--   3 hadoop supergroup          0 2019-07-31 07:49 /out/_SUCCESS
-rw-r--r--   3 hadoop supergroup         32 2019-07-31 07:49 /out/part-00000
-rw-r--r--   3 hadoop supergroup         23 2019-07-31 07:49 /out/part-00001

[hadoop@vm01 data]$ hdfs dfs -text /out/*
hello spark
hello mr
hello yarn
hello hive
hello spark
[hadoop@vm01 data]$ 
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值