RDDs, DataFrames and Datasets in Apache Spark - NE Scala 2016

https://www.youtube.com/watch?v=pZQsDloGB4w

RDDs 

    RDDs is the basic block of Spark and it's what you see under the covers

    是Spark提供的抽象,Spark提供的original low-level API,其他higher level API(SQL/DataFrame/Dataset/Structured Streaming/Dstreams)都可以分解成RDD

    很多人还在使用RDDs编程

...compile-time type-safe

...lazy

...based on Scala collections API: 熟悉Scala collections API的话应该可以看懂Spark API

1 compile-time type-safe is a real win, RDD of String, RDD of case class

2 one of the problems with RDD library that people complain about is this is opaque to know what you are trying to accomplish, RDDs are low-level,and they suffer from some problems, including:

    > RDDs express how of a solution better than the what

    > RDDs cannot be optimized by Spark

    > RDDs are slow on non-JVM languages like Python

    > It's too easy to build an inefficient RDD transformation chain


The DataFrame API

    provides a higher-level abstraction(DSL Domain-Specific Language), 使用DSL query language建立一个query plan,相同的query plan也可以通过SQL statement操作数据

    express what you are trying to accomplish rather than how you are trying to do it

    can be optimized better by Spark

    type safe DSL?

    RDDs convert to DataFrame -> Use Schemas to Reflect -> Schema make DataFrame have columns and columns have names and types


    DataFrame Queries are Optimized-Catalyst Optimizer


    Parquet File:Column Based Storage. Spark has builtin intelligence to read parquet file

    Faster 

    Lost Type Safety

    org.apache.spark.sql.Row, Row isn‘t Typesafe



Datasets

We'd like to get back our compile-time type safety without giving up the optimizations Catalyst can provide us. 

Datasets Are

    An Extension to the DataFrame API

    概念上与RDDs类似,可以使用lambdas和types

    Use Tungsten's fast in- memory encoding(as opposed to JVM objects or serialized objects on the heap)

    Expose expressions and fields to the DataFrame query planner, where the optimizer can use them to make decisions.(This cant happen with RDDs)

    Interoperate more easily with the DataFrame API

Datasets: A bit of both RDDs and DataFrame

     Efficient tungsten based column store while RDDs deal with pure JVM objects

Datasets and Memory

    Datasets use less memory

    generate encoders to translate in-memory objects back and forth to compact format, and dont need to serialize or deserialize when you ne

Datasets and Serialization

    Spark has to serialize data ...a lot

    Because of the efficiency of the code-generated encoders, serialization can be significantly faster than either native Java or Kryo serialzation

    The resulting serialized data will often be up to 2X smaller  which reduces disk use and network use


2016 Dataset Limitations

    experimental, APIs might change

    APIs incomplete. lacks some aggrerators(like sum()) and lacks a sortBy() function

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值