Spark 中RDD和DataSet之间的转换

最新推荐文章于 2023-12-13 11:10:45 发布

Rachel_Channing

最新推荐文章于 2023-12-13 11:10:45 发布

阅读量6.4k

点赞数 1

分类专栏： Spark

本文链接：https://blog.csdn.net/sinat_37513998/article/details/82740706

版权

本文介绍了Spark的RDD概念，包括RDD的创建、转换和恢复机制。RDD支持transformation和action操作，transformation用于数据结构转换，action触发计算并返回结果。RDD可通过persist方法缓存，以提高重复计算效率。内容还涉及RDD的持久化、排序和DataFrame操作，以及如何通过Spark SQL查询数据。

摘要由CSDN通过智能技术生成

什么是RDD:Spark提供了一个抽象的弹性分布式数据集，是一个由集群中各个节点以分区的方式排列的集合，用以支持并行计算。RDD在驱动程序调用hadoop的文件系统的时候就创建（其实就是读取文件的时候就创建），或者通过驱动程序中scala集合转化而来，用户也可以用spark将RDD放入缓存中，来为集群中某台机器宕掉后，确保这些RDD数据可以有效的被复用。
总之，RDD能自动从宕机的节点中恢复过来。

摘抄自官网的说明：

At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.

RDD的的操作类型（以下为个人从官网翻译过来）

对于RDD的操作支持两种类型的，一种是transformation,一种是action.
对于transformation,是将一个数据集从一个结构转换成另外一个结构。
对于Action来说，是需要在数据集的计算任务之后，返回给驱动程序一个结果。
比如map函数就是一个transformation操作，它对数据集合中的每

最低0.47元/天解锁文章

Rachel_Channing

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
1
评论
Spark 中RDD和DataSet之间的转换

什么是RDD:Spark提供了一个抽象的弹性分布式数据集，是一个由集群中各个节点以分区的方式排列的集合，用以支持并行计算。RDD在驱动程序调用hadoop的文件系统的时候就创建（其实就是读取文件的时候就创建），或者通过驱动程序中scala集合转化而来，用户也可以用spark将RDD放入缓存中，来为集群中某台机器宕掉后，确保这些RDD数据可以有效的被复用。总之，RDD能自动从宕机的节点中恢复过来。...
复制链接

扫一扫