spark-RDD vs DataFrame vs DataSet

最新推荐文章于 2024-07-25 13:50:54 发布

leibnitz09

最新推荐文章于 2024-07-25 13:50:54 发布

阅读量120

点赞数

分类专栏： spark 文章标签：大数据

本文链接：https://blog.csdn.net/leibnitz09/article/details/84828808

版权

spark 专栏收录该内容

28 篇文章 0 订阅

订阅专栏

In summation, the choice of when to use RDD or DataFrame and/or Dataset seems obvious. While the former offers you low-level functionality and control, the latter allows custom view and structure, offers high-level and domain specific operations, saves space, and executes at superior speeds.

As we examined the lessons we learned from early releases of Spark—how to simplify Spark for developers, how to optimize and make it performant—we decided to elevate the low-level RDD APIs to a high-level abstraction as DataFrame and Dataset and to build this unified data abstraction across  libraries atop Catalyst optimizer and Tungsten.

Pick one—DataFrames and/or Dataset or RDDs APIs—that meets your needs and use-case, but I would not be surprised if you fall into the camp of most developers who work with structure and semi-structured data.

Note that you can always seamlessly interoperate or convert from DataFrame and/or Dataset to an RDD, by simple method call .rdd. For instance,

that is:

--------------------|

| Dataset |

|- - - - - - - - - - |

| DataFrame |

--------------------|

--------------------

| RDD |

--------------------

ref:

[1]A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets

When to use them and why

[2]Spark SQL: Relational Data Processing in Spark

leibnitz09

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark-RDD vs DataFrame vs DataSet

In summation, the choice of when to use RDD or DataFrame and/or Dataset seems obvious. While the former offers you low-level functionality and control, the latter allows custom view and structure...
复制链接

扫一扫

专栏目录