hdfs随机读取，seek源码

最新推荐文章于 2022-10-15 08:53:14 发布

zhuwentaolove

最新推荐文章于 2022-10-15 08:53:14 发布

阅读量796

点赞数

分类专栏：大数据源码分析

本文链接：https://blog.csdn.net/zhuwentaolove/article/details/103417050

版权

大数据同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

源码分析

4 篇文章 0 订阅

订阅专栏

The default HDFS block size is 128 MB. So you cannot read one line here, one line there. You always read and write 128 MB blocks. This is fine when you want to process the whole file. But it makes HDFS unsuitable for some applications, like where you want to use an index to look up small records.

HBase on the other hand is great for this. If you want to read a small record, you will only read that small record.

HBase uses HDFS as its backing store. So how does it provide efficient record-based access?

HBase loads the tables from HDFS to memory or local disk, so most reads do not go to HDFS. Mutations are stored first in an append-only journal. When the journal gets large, it is built into an "addendum" table. When there are too many addendum tables, they all get compacted into a brand new primary table. For reads, the journal is consulted first, then the addendum tables, and at last the primary table. This system means that we only write a full HDFS block when we have a full HDFS block's worth of changes.

A more thorough description of this approach is in the Bigtable whitepaper.

这个seek它主要移动pos这个游标：如果在当前block中，就移动到正确位置，否则，就把pos设成目标位置，但是blockEnd置成-1.这样其实最终的seek任务是在后面的read里面实现的。

zhuwentaolove

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hdfs随机读取，seek源码

The default HDFS block size is 128 MB. So you cannot read one line here, one line there. You always read and write 128 MB blocks. This is fine when you want to process the whole file. But it makes HDF...
复制链接

扫一扫

专栏目录