RDD的属性

RDD resilient Distributed Dataset
  • properties:
  1. Immutable 
  2. lazy evaluated
  3. Cacheable
  4. Type inferred

What's Immutable?
  1. once created never changes
  2. Big data by default immutable in nature
  3. Immutability helps to: (1) Parallelize; (2) Caching

Why Big Data is immutable?
  1. Parallelize for free, no need to lock;
  2. Caching is safe, no worry for other change
  3. immutability is about value not about reference

Immutability in collections
  1. uses transformation for change.  e.g. MAP
  2. creates a new copy of collection leaves collection intact.
  3. uses loop for updating mutable collections in place

Chanllenges of  Immutability
  1. good for parallelism but no good for space
  2. multiple transformations result in: (1) Multiple of copies of data; (2) multiple passes of data
  3. poor performance for multiple of copies and passes of data.

Get lazy for the chanllenges
  1. don't computing transformations till it's need
  2. defers evaluation
  3. separate execution from evaluation
  4. multiple transformations are combined in one

Laziness and immutability
  1. you can be lazy only if the underneath data is immutable
  2. you cannot combine transformation if transformation has side effect
  3. combining laziness and immutability gives better performance and distributed processing

Chanllenges of Laziness   :type inference
  1. Laziness poses chanllenges in terms of data type
  2. if laziness deters execution, determining the type of variable becomes chanllenging
  3. if we can't determine the right type, it allows to have semantic issues
  4. running big data programs and getting semantics errors are not fun.

Type inference
  1. part of compiler to determining the type by value
  2. as all the transformation are side effect free, we can determine the type by operation;  v1.count() inferred as Int
  3. every transformation has specific return type; map array gets array
  4. having type inference relieves you think about representation for many transforms

Caching
  1. immutable data allows you to cache data for long time
  2. lazy transformation allows to recreate data on failure; from linear
  3. transformations can be saved also; as linear 
  4. caching data improves execution engine performance

RDD means big collection of data with above properties.


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值