RDDs, DataFrames and Datasets in Apache Spark - NE Scala 2016

最新推荐文章于 2023-09-22 09:06:12 发布

ThisIsNobody

最新推荐文章于 2023-09-22 09:06:12 发布

阅读量236

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/weixin_42129080/article/details/80968071

版权

Spark 专栏收录该内容

36 篇文章 0 订阅

订阅专栏

https://www.youtube.com/watch?v=pZQsDloGB4w

RDDs

RDDs is the basic block of Spark and it's what you see under the covers

是Spark提供的抽象，Spark提供的original low-level API，其他higher level API(SQL/DataFrame/Dataset/Structured Streaming/Dstreams)都可以分解成RDD

很多人还在使用RDDs编程

...compile-time type-safe

...lazy

...based on Scala collections API: 熟悉Scala collections API的话应该可以看懂Spark API

1 compile-time type-safe is a real win, RDD of String, RDD of case class

2 one of the problems with RDD library that people complain about is this is opaque to know what you are trying to accomplish, RDDs are low-level，and they suffer from some problems, including:

> RDDs express how of a solution better than the what

> RDDs cannot be optimized by Spark

> RDDs are slow on non-JVM languages like Python

> It's too easy to build an inefficient RDD transformation chain

The DataFrame API

provides a higher-level abstraction(DSL Domain-Specific Language), 使用DSL query language建立一个query plan，相同的query plan也可以通过SQL statement操作数据

express what you are trying to accomplish rather than how you are trying to do it

can be optimized better by Spark

type safe DSL?

RDDs convert to DataFrame -> Use Schemas to Reflect -> Schema make DataFrame have columns and columns have names and types

DataFrame Queries are Optimized-Catalyst Optimizer

Parquet File：Column Based Storage. Spark has builtin intelligence to read parquet file

Faster

Lost Type Safety

org.apache.spark.sql.Row, Row isn‘t Typesafe

Datasets

We'd like to get back our compile-time type safety without giving up the optimizations Catalyst can provide us.

Datasets Are

An Extension to the DataFrame API

概念上与RDDs类似，可以使用lambdas和types

Use Tungsten's fast in- memory encoding(as opposed to JVM objects or serialized objects on the heap)

Expose expressions and fields to the DataFrame query planner, where the optimizer can use them to make decisions.（This cant happen with RDDs）

Interoperate more easily with the DataFrame API

Datasets: A bit of both RDDs and DataFrame

Efficient tungsten based column store while RDDs deal with pure JVM objects

Datasets and Memory

Datasets use less memory

generate encoders to translate in-memory objects back and forth to compact format, and dont need to serialize or deserialize when you ne

Datasets and Serialization

Spark has to serialize data ...a lot

Because of the efficiency of the code-generated encoders, serialization can be significantly faster than either native Java or Kryo serialzation

The resulting serialized data will often be up to 2X smaller which reduces disk use and network use

2016 Dataset Limitations

experimental, APIs might change

APIs incomplete. lacks some aggrerators(like sum()) and lacks a sortBy() function

ThisIsNobody

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
RDDs, DataFrames and Datasets in Apache Spark - NE Scala 2016

https://www.youtube.com/watch?v=pZQsDloGB4wRDDs compile-time type safelazybased on Scala collections API compile time type saftyCatalyst Optimizer
复制链接

扫一扫