RDD resilient Distributed Dataset
- properties:
- Immutable
- lazy evaluated
- Cacheable
- Type inferred
What's Immutable?
- once created never changes
- Big data by default immutable in nature
- Immutability helps to: (1) Parallelize; (2) Caching
Why Big Data is immutable?
- Parallelize for free, no need to lock;
- Caching is safe, no worry for other change
- immutability is about value not about reference
Immutability in collections
- uses transformation for change. e.g. MAP
- creates a new copy of collection leaves collection intact.
- uses loop for updating mutable collections in place
Chanllenges of Immutability
- good for parallelism but no good for space
- multiple transformations result in: (1) Multiple of copies of data; (2) multiple passes of data
- poor performance for multiple of copies and passes of data.
Get lazy for the chanllenges
- don't computing transformations till it's need
- defers evaluation
- separate execution from evaluation
- multiple transformations are combined in one
Laziness and immutability
- you can be lazy only if the underneath data is immutable
- you cannot combine transformation if transformation has side effect
- combining laziness and immutability gives better performance and distributed processing
Chanllenges of Laziness :type inference
- Laziness poses chanllenges in terms of data type
- if laziness deters execution, determining the type of variable becomes chanllenging
- if we can't determine the right type, it allows to have semantic issues
- running big data programs and getting semantics errors are not fun.
Type inference
- part of compiler to determining the type by value
- as all the transformation are side effect free, we can determine the type by operation; v1.count() inferred as Int
- every transformation has specific return type; map array gets array
- having type inference relieves you think about representation for many transforms
Caching
- immutable data allows you to cache data for long time
- lazy transformation allows to recreate data on failure; from linear
- transformations can be saved also; as linear
- caching data improves execution engine performance
RDD means big collection of data with above properties.