https://www.youtube.com/watch?v=pZQsDloGB4w
RDDs
RDDs is the basic block of Spark and it's what you see under the covers
是Spark提供的抽象,Spark提供的original low-level API,其他higher level API(SQL/DataFrame/Dataset/Structured Streaming/Dstreams)都可以分解成RDD
很多人还在使用RDDs编程
...compile-time type-safe
...lazy
...based on Scala collections API: 熟悉Scala collections API的话应该可以看懂Spark API
1 compile-time type-safe is a real win, RDD of String, RDD of case class
2 one of the problems with RDD library that people complain about is this is opaque to know what you are trying to accomplish, RDDs are low-level,and they suffer from some problems, including:
> RDDs express how of a solution better than the what
> RDDs cannot be optimized by Spark
> RDDs are slow on non-JVM languages like Python
> It's too easy to build an inefficient RDD transformation chain
The DataFrame API
provides a higher-level abstraction(DSL Domain-Specific Language), 使用DSL query language建立一个query plan,相同的query plan也可以通过SQL statement操作数据
express what you are trying to accomplish rather than how you are trying to do it
can be optimized better by Spark
type safe DSL?
RDDs convert to DataFrame -> Use Schemas to Reflect -> Schema make DataFrame have columns and columns have names and types
DataFrame Queries are Optimized-Catalyst Optimizer
Parquet File:Column Based Storage. Spark has builtin intelligence to read parquet file
Faster
Lost Type Safety
org.apache.spark.sql.Row, Row isn‘t Typesafe
Datasets
We'd like to get back our compile-time type safety without giving up the optimizations Catalyst can provide us.
Datasets Are
An Extension to the DataFrame API
概念上与RDDs类似,可以使用lambdas和types
Use Tungsten's fast in- memory encoding(as opposed to JVM objects or serialized objects on the heap)
Expose expressions and fields to the DataFrame query planner, where the optimizer can use them to make decisions.(This cant happen with RDDs)
Interoperate more easily with the DataFrame API
Datasets: A bit of both RDDs and DataFrame
Efficient tungsten based column store while RDDs deal with pure JVM objects
Datasets and Memory
Datasets use less memory
generate encoders to translate in-memory objects back and forth to compact format, and dont need to serialize or deserialize when you ne
Datasets and Serialization
Spark has to serialize data ...a lot
Because of the efficiency of the code-generated encoders, serialization can be significantly faster than either native Java or Kryo serialzation
The resulting serialized data will often be up to 2X smaller which reduces disk use and network use
2016 Dataset Limitations
experimental, APIs might change
APIs incomplete. lacks some aggrerators(like sum()) and lacks a sortBy() function