什么是RDD:Spark提供了一个抽象的弹性分布式数据集,是一个由集群中各个节点以分区的方式排列的集合,用以支持并行计算。RDD在驱动程序调用hadoop的文件系统的时候就创建(其实就是读取文件的时候就创建),或者通过驱动程序中scala集合转化而来,用户也可以用spark将RDD放入缓存中,来为集群中某台机器宕掉后,确保这些RDD数据可以有效的被复用。
总之,RDD能自动从宕机的节点中恢复过来。
摘抄自官网的说明:
At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.
RDD的的操作类型(以下为个人从官网翻译过来)
对于RDD的操作支持两种类型的,一种是transformation,一种是action.
对于transformation,是将一个数据集从一个结构转换成另外一个结构。
对于Action来说,是需要在数据集的计算任务之后,返回给驱动程序一个结果。
比如map函数就是一个transformation操作,它对数据集合中的每