Understanding Spark Caching

Spark excels at processing in-memory data.  We are going to look at various caching options and their effects, and (hopefully) provide some tips for optimizing Spark memory caching.

When caching in Spark, there are two options

1. Raw storage

2. Serialized

Here are some differences between the two options

Raw caching

Serialized Caching

Pretty fast to processSlower processing than raw caching
Can take up 2x-4x more spaceFor example, 100MB data cached could consume 350MB memoryOverhead is minimal
can put pressure in JVM and JVM garbage collectionless pressure

usage:rdd.persist(StorageLevel.MEMORY_ONLY)  or  rdd.cache()

usage:rdd.persist(StorageLevel.MEMORY_ONLY_SER

So what are the trade offs?

Here is a quick experiment.  I cache a bunch of RDDs using both options and measure memory footprint and processing time.  My RDDs range in size from 100MB to 1GB.

Testing environment:

3 node spark cluster running on Amazon EC2 (m1.large type with 8G memory per node)

Reading data files from S3 bucket

102919_QwD5_1244803.png

Testing method:

$   ./bin/spark-shell  --driver-memory 8g
> val f = sc.textFile("s3n://bucket_path/1G.data")
> f.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY) // specify the cache option
> f.count()  // do this a few times and measure times
// also look at RDD memory size from Spark application UI, under 'Storage' tab

On to the results:

Data Size
100M500M1000M (1G)
Memory Footprint (MB)




raw 373.81,869.203788.8

serialized107.5537.61075.1
count() time (ms)




cached raw90 ms130 ms178 ms

cached serialized610 ms1,802 ms3,448 ms

before caching3,220 ms27,063 ms105,618 ms


102901_X8SR_1244803.png

102902_CcY7_1244803.png

Conclusions

raw caching consumes has a bigger footprint in  in memory – about 2x – 4x (e.g. 100MB RDD becomes 370MB)

Serialized caching consumes almost the same amount of memory as RDD (plus some overhead)

Raw cache is very fast to process, and it scales pretty well

Processing serialized cached data takes longer

So what does all this mean?

For small data sets (few hundred megs) we can use raw caching.  Even though this will consume more memory, the small size won’t put too much pressure on Java garbage collection.

Raw caching is also good for iterative work loads (say we are doing a bunch of iterations over data).  Because the processing is very fast

For medium / large data sets (10s of Gigs or 100s of Gigs) serialized caching would be helpful.  Because this will not consume too much memory.  And garbage collecting gigs of memory can be taxing


转载于:https://my.oschina.net/duanfangwei/blog/535256

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值