若泽数据B站视频Spark基础篇05-Spark-RDD的创建

最新推荐文章于 2023-06-05 12:10:52 发布

zhikanjiani

最新推荐文章于 2023-06-05 12:10:52 发布

阅读量266

点赞数

分类专栏：高级班Spark RDD

本文链接：https://blog.csdn.net/zhikanjiani/article/details/90613976

版权

高级班Spark RDD 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

此处为本章学习视频连接：
内容出处：若泽数据	http://www.ruozedata.com/	
有一起学习的也可以联系下我QQ：2032677340
链接：https://pan.baidu.com/s/1QYHbZsiTAK9O1S86ocDYIw 
提取码：jqd4

一、上次课回顾

若泽数据B站视频Spark基础篇04-SparkContext、SparkConf详解：

https://blog.csdn.net/zhikanjiani/article/details/90602529

回顾：

1、主要讲的是SparkContext、SparkConf、Spark shell的一些常用参数介绍

开发spark应用程序的第一点是构建SparkContext，构建sparkContext之前首先就要构建SparkConf，在SparkConf中可以设置与Spark程序相关的一些东西，如appName、Master，最佳实践是不要采用硬编码的方式，以Spark-submit的方式进行提交，因为我们的作业写死的话（比如每天每小时运行的Spark应用程序，如何通过作业名称知道跑的是哪个应用程序呢，正常是在shell中通过时间把他拼接上去）。
每一个Spark应用程序都有一个SparkContext，在Spark使用过程中提供给我们的shell脚本就是在/spark/bin目录下的spark-shell，spark-shell就相当于是一个Spark Application。
Spark shell启动过程中会给我们创建一个SparkContext，会给我们取一个别名叫sc。

spark-shell、spark-sql、spark-submit底层都是spark-submit的方式进行提交的

二、RDD创建的两种方式

官网定义：

1、There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

有两种方式来创建RDD：1、把一个集合转换为RDD；2、可以使用你的一些数据集在外部共享存储系统：HDFS、Hbase、任何支持hadoop格式的输入系统。

2.1、RDD的第一种创建方式：

1、Parallelized Collections:

parallelizing（并行化） an existing in your driver program, 把一个集合变成RDD

调用SparkContext’s parallelize方法：

Parallelized collections are created by calling SparkContext’s parallelize method.
parallelize方法介绍：Distribute a local Scala collection to form an RDD 将一个本地的scala集合抓换成RDD.

def parallelize[T: ClassTag](
      seq: Seq[T],
      numSlices: Int = defaultParallelism): RDD[T] = withScope {
    assertNotStopped()
    new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
  }
  
  parallelize方法介绍：
  两个参数：第一个参数是sequence的集合，第二个参数numslices:Int类型，它有个默认的值，传递使用的时候这个参数是可以不用传递的；所以parallelize方法只需要传一个seq就行了。

在Spark-shell中实操：

1、 [hadoop@hadoop002 spark-2.4.2-bin-2.6.0-cdh5.7.0]$ spark-shell --master local[2]

2、 scala> val data = Array(1,2,3,4,5) //定义一个数组
data: Array[Int] = Array(1, 2, 3, 4, 5)

3、scala> val distData = sc.parallelize(1,2,3,4,5)
distdata: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :26

4、 scala> distData.collect //只有遇到action的动作才会触发作业的执行
在这里插入图片描述
作业1：观察UI界面，发现task的数量是2，问题：2的由来？

、Once created, the distributed dataset can be operated on in parallel.（一旦被创建，就能够以并行化的方式被操作）.

for example, we might call distData.reduce((a,b) => a + b) to add up the elements of the array. we describe operations on distributed datasets later on.

One important parameter for parallel collections is the number of partitions to cut the dataset into. Spark will run one task for each partition of the cluster.

我们设置的Partition数量将会将我们的数据集拆分成number of partition。
如下操作我们执行的结果还是（1,2，3,4，5），

：Spark will run one task for each partition of the cluster（spark将会运行一个task对应于一个partition），我们设置了5个partition，所以是5个task.
5 partitions = 5 tasks
在spark中，partition的数量默认就是等于task的数量的，partition = task
```
val data = Array(1,2,3,4,5)
val distdata = sc.parallelize(data,5)
```

在这里插入图片描述

5)： Typically, you want 2-4 partitions for each CPU in your cluster.
why?

如果一个core对应一个task，一个core设置2~4个partition，一个作业跑完另一个接着跑，避免产生过多的浪费。

Spark会尝试设置分区数基于你的集群，比如我们处理HDFS的文件：文件 < BlockSize(128M)，文件 > BlockSize(128M)。Note：task数量如果多的话，输出的小文件也会多；还会涉及到小文件的合并。

在这里插入图片描述

2.2、RDD的第二种创建方式（工作中占比95%）

External Datasets(外部数据集)

1、Spark can create distributed datasets from any storage source supported by Hadoop（Spark能够创建分布式的数据集从任何支持hadoop的存储源）, including your local file system,HDFS,Cassandra,HBase,Amazon S3（这些系统上的文件Spark都能直接支持创建过来），etc. Spark supports text files,SequenceFiles, and any hadoop inputFormat.

如上文件系统上的文件都能使用spark创建，支持格式：文本格式、sequencefile

Q：spark支不支持ORC格式文件.

2、Text file can be created using SparkContext’s textFile method .this method takes an URI for the filr(either a local path on the machine , or a hdfs://,s3n://,etc URI) and reads it as a collection of lines

SparkContext中textfile方法：
Read a text file from HDFS, a local file system (available on all nodes), or any
Hadoop-supported file system URI, and return it as an RDD of Strings.

//读一个text file文件从hdfs上，如果文件是本地系统上的话（一定要保证在任意节点能够访问到），我们搭建的是standalone模式：有1 master+100worker，我们计算一个wordcount，wc inputsource(local)，我们通过分布式计算框架并不确定它会在那个机器上去运行，假设在第50台机器上去运行（第50台机器上没有这个文件），就会报错：file not found.

def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }

如何理解本地文件系统要在所有节点被访问到？
比如搭建了一个standalone模式，有 1个Master + 100 Worker   ==>  wc inputsource(local)	
因为分布式计算框架，不知道会把你的作业调到哪个节点去运行；如果分布式计算框架跑作业的那个节点没有inputsource，那么必然会报file not found这个错误。

建议：如果是集群，跑hdfs文件；本地少用standalone测试，不知道会在哪台机器上执行.

运行测试：

1、在本地测试读取文件：

1.1：scala> val distFile = sc.textFile(“file:///home/hadoop/data/ruozeinput.txt”)
distFile: org.apache.spark.rdd.RDD[String] = file:///home/hadoop/data/ruozeinput.txt MapPartitionsRDD[1] at textFile at :24

本地的文件使用sc.textFIle读取进来后，返回了一个RDD，这个RDD是string类型。

如上我们已经创建完成：

once created, The distributed dataset(distdata) can be operated on in paralle, For example, we might call distdata.reduce((a,b) => a+b) to add up the elements of the Array.

1.2：scala> distFile.collect
res2: Array[String] = Array(hello world john, hello world, hello)

1.3、Once created, distFile can be acted on by dataset operations.
For example, we can add up the sizes of all the lines using the map and reduce operations as follows:
scala> distFile.map(s => s.length).reduce((a, b) => a + b)
res5: Int = 32

计算的是字符长度
它返回的是一个String类型，验证了这一点： reads it as a collection of lines ，ruozeinput.txt中是有3行，所以这个集合有两个逗号分隔开。

2、在hdfs上读取文件：

scala> val distfile = sc.textFile("hdfs://10.0.0.132/wordcount/input/ruozeinput.txt")
distfile: org.apache.spark.rdd.RDD[String] = hdfs://10.0.0.132/wordcount/input/ruozeinput.txt MapPartitionsRDD[6] at textFile at <console>:24

三、RDD创建注意事项：

Some notes on reading files with Spark:
一些注意事项：

1）、if using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.

如果这个文件在本地上：“file://home/hadoop/data/*.txt”，这个文件必须在各个节点上被访问的到；或者拷贝文件到所有的worker上或者使用挂载系统。
local path：所有的节点都要有这个file

2）、All of Spark’s file-based input methods, including textFile, support running on directories,compressed files, and wildcards（通配符） as well. For example, you can use textFile("/my/directory"),textFile("/my/directory/.txt"),and textFile("/my/directory/.gz")

测试：hadoop fs -mkdir /data/
hadoop fs -put ruozeinput.txt /data/1
hadoop fs -put ruozeinput.txt /data/2

val distFile = sc.textFile(“hdfs://10.0.0.132:8020/data/”)
distFile.collect
小结：不仅仅能指定到一个文件，也能指定到一个文件夹。

3）、TextFile minPartitions
The textFile method also takes an optional second argument for controlling the number of partitions of the file(控制partition的数量). By Default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks,（不能设置比block小的partition）

Apart from text files, Spark’s Scala API also supports several other data formats:

SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with textFile, which would return one record per line in each file. Partitioning is determined by data locality which, in some cases, may result in too few partitions. For those cases, wholeTextFiles provides an optional second argument for controlling the minimal number of partitions.

val data = sc.wholeTextFiles(“hdfs://10.0.0.132:8020/data/1”)
data.collect
第一个返回值是文件的名称，能够去读取目录，返回的是键值对。
For SequenceFiles, use SparkContext’s sequenceFile[K, V] method where K and V are the types of key and values in the file. These should be subclasses of Hadoop’s Writable interface, like IntWritable and Text. In addition, Spark allows you to specify native types for a few common Writables; for example, sequenceFile[Int, String] will automatically read IntWritables and Texts.
For other Hadoop InputFormats, you can use the SparkContext.hadoopRDD method, which takes an arbitrary JobConf and input format class, key class and value class. Set these the same way you would for a Hadoop job with your input source. You can also use SparkContext.newAPIHadoopRDD for InputFormats based on the “new” MapReduce API (org.apache.hadoop.mapreduce).

实操：

val distFile = sc.textFile("hdfs://10.0.0.132:8020/data/ruozeinput.txt")
distFIle.saveAsTextFile("hdfs://10.0.0.132:8020/data/output")
老问题：输出为什么又是有两个？
有一个输出文件为空，调优点：设置分区数量，避免生成太多为空文件

作业一：使用Spark读取Sequence的文件

zhikanjiani

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
若泽数据B站视频Spark基础篇05-Spark-RDD的创建

RDD创建的两种方式、1）parallelizing（并行化） an existing in your driver program, 把一个集合变成RDD2）or referencing a dataset in an external storage system,such as a shared filesystem, HDFS, HBase, or any data source of...
复制链接

扫一扫