AVRO文件结构分析

最新推荐文章于 2023-12-13 23:05:08 发布

pany8125

最新推荐文章于 2023-12-13 23:05:08 发布

阅读量2.3k

点赞数

分类专栏： avro 文章标签： hadoop avro

avro 专栏收录该内容

0 篇文章 0 订阅

订阅专栏

AVRO文件结构分析
guibin.beijing@gmail.com

研究了AVRO的规范，比较形象的图形表达了文件中内容布局，仅做参考。详细说明在图形下方。

使用AVRO标准系列化生成二进制的文件，该文件总体上由文件头(Header)和数据块(Data Block)及同步标识(Synchronization marker)三部分组成。

文件头为标识为Header的青色大框部分。
数据块为文件头下方紧邻的灰色的Data Block部分。
同步标识为数据块下方紧接着的橘色的Synchronization marker部分。

AVRO通过使用同步标识，将大块数据分割成小块，连续存储在同一个文件中，便于并发处理，即不同线程可以相互无影响的同时操作不同的数据块。因此，在上图最下方的数据块之后，根据情况，会有更多的同步标识和数据块。

AVRO的文件头由三部分组成，如上图所示。

文件头由四个字节'O', 'b', 'j'开始，后面紧接着1，一般称这四个字节为魔术字符(magic)
紧接着文件头的是AVRO的Meta Data
文件头的最后由同步标识结尾

----------------------------------问题分割线------------------------------

what is “sync marker” used for in avro format

I have been struggling with the "sync marker" part in avro. The doc says it's used for splitting files. Not sure what it really means. Some questions:

1 How does it use this part to split files. Does it scan the whole file and file such parts and split? If yes, won't it be more efficient if it just get the size in each data block and jump ahead to next block and do same thing?

2 when the data block is compressed, is the sync-marker compressed?

3 why does it have multiple data blocks rather then put into one single data block, is it because the size of the block is of long type. Which has limit of length it could hold?

4 the data block is logical view? all the data blocks will still be in a single file in filesystem?

Thanks for information for any point above.

link：http://stackoverflow.com/questions/27360727/what-is-sync-marker-used-for-in-avro-format