Databricks IO (DBIO) cache

92 篇文章 0 订阅
89 篇文章 1 订阅

Databricks IO Cache

  • The Databricks IO cache accelerates data reads by creating copies of remote files in nodes’ local storage using fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then executed locally, which results in significantly improved reading speed.
  • The Databricks IO cache supports reading Parquet files from DBFS, Amazon S3, HDFS, Azure Blob Storage, Azure Data Lake Storage Gen1, and Azure Data Lake Storage Gen2 (on Databricks Runtime 5.1 and above). It does not support other storage formats such as CSV, JSON, and ORC.

Databricks IO and RDD cache comparison

There are two types of cache available in Databricks:

  • Databricks IO (DBIO) cache
  • Apache Spark RDD (RDD) cache

You can use the DBIO cache and the RDD cache at the same time. This section outlines the key differences between them so that you can choose the best tool for your workflow.

Type of stored data

The DBIO cache contains local copies of remote data. It can improve the performance of a wide range of queries, but cannot be used to store results of arbitrary subqueries.

The RDD cache can store the result of any subquery data and data stored in formats other than Parquet (such as CSV, JSON, and ORC).

Performance

The data stored in the DBIO cache can be read and operated on faster than the data in the RDD cache. This is because the DBIO cache uses efficient decompression algorithms and outputs data in the optimal format for further processing using whole-stage code generation.

Automatic vs manual control

When the DBIO cache is enabled, data that has to be fetched from a remote source is automatically added to the cache. This process is fully transparent and does not require any action. However, to preload data into the cache beforehand, you can use the CACHE command (see Cache a subset of the data).

When you use the RDD cache, you must manually specify the tables and queries to cache.

Disk vs memory-based

The DBIO cache is stored entirely on the local disk, so that memory is not taken away from other operations within Spark. Due to the high read speeds of modern SSDs, the DBIO cache can be fully disk-resident without a negative impact on its performance. In contrast, the RDD cache uses memory.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值