RDD的属性

lacknow

于 2016-01-10 16:14:04 发布

阅读量381

点赞数

分类专栏： big dat 学习笔记文章标签： RDD

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/lacknow/article/details/50492570

版权

big dat 学习笔记专栏收录该内容

2 篇文章 0 订阅

订阅专栏

RDD resilient Distributed Dataset

properties：

Immutable
lazy evaluated
Cacheable
Type inferred

What's Immutable?

once created never changes
Big data by default immutable in nature
Immutability helps to： (1) Parallelize; (2) Caching

Why Big Data is immutable?

Parallelize for free, no need to lock;
Caching is safe， no worry for other change
immutability is about value not about reference

Immutability in collections

uses transformation for change. e.g. MAP
creates a new copy of collection leaves collection intact.
uses loop for updating mutable collections in place

Chanllenges of Immutability

good for parallelism but no good for space
multiple transformations result in: (1) Multiple of copies of data; (2) multiple passes of data
poor performance for multiple of copies and passes of data.

Get lazy for the chanllenges

don't computing transformations till it's need
defers evaluation
separate execution from evaluation
multiple transformations are combined in one

Laziness and immutability

you can be lazy only if the underneath data is immutable
you cannot combine transformation if transformation has side effect
combining laziness and immutability gives better performance and distributed processing

Chanllenges of Laziness :type inference

Laziness poses chanllenges in terms of data type
if laziness deters execution, determining the type of variable becomes chanllenging
if we can't determine the right type, it allows to have semantic issues
running big data programs and getting semantics errors are not fun.

Type inference

part of compiler to determining the type by value
as all the transformation are side effect free, we can determine the type by operation; v1.count() inferred as Int
every transformation has specific return type; map array gets array
having type inference relieves you think about representation for many transforms

Caching

immutable data allows you to cache data for long time
lazy transformation allows to recreate data on failure; from linear
transformations can be saved also; as linear
caching data improves execution engine performance

RDD means big collection of data with above properties.

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
RDD的属性

RDD resilient Distributed Datasetproperties：Immutable lazy evaluatedCacheableType inferredWhat's Immutable?once created never changesBig data by default immutable in natureImmutabili
复制链接

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。