ReiserFS 文件系统_reiserfs partition是什么文件系统-CSDN博客

本文描述ReiserFS文件系统在磁盘上的结构. 本文可能考虑不周, 有疑问请参考文件系统最初说明.

预备知识

假定你知道分区结构, 即:

· Start of the partition in bytes

· Size of the partition in bytes

· That the partition is indeed a reiserfs partition

将会用到以下类型定义:

__u16 – 16位 unsigned integer

__u32 – 32位 unsigned integer

__u64 - 64位 unsigned integer

超级块

超级块是ReiserFS文件系统的根信息. 它存储在分区起始的固定偏移处：

REISERFS_DISK_OFFSET_IN_BYTES = (64*1024) = 65536

这个超级块定义在 reiserfs_fs_sb.h 文件里, 结构如下:

struct reiserfs_super_block

{

__u32 s_block_count;

__u32 s_free_blocks; /* 空块个数 */

__u32 s_root_block; /* 根块 number */

__u32 s_journal_block; /* 日志块 number */

__u32 s_journal_dev; /* 日志设备 number */

/* 由于日志大小目前是在头文件里 #define 的, 如果创建一个16M的日志磁盘

** 并把它移到默认日志大小为32M的系统里, 挂载该磁盘时将会溢出.

** s_orig_journal_size, 挂载时增加一些检查(在journal_init里)

** 避免上述情况发生

__u32 s_orig_journal_size;

__u32 s_journal_trans_max ; /* 一次事务的最大块数 */

__u32 s_journal_block_count ; /* 日志总大小. 可以改变over time */

__u32 s_journal_max_batch ; /* 最大块数 to batch into a trans */

__u32 s_journal_max_commit_age ; /* 单位秒, 一次commit最长时间*/

__u32 s_journal_max_trans_age ; /* 单位秒, 一次事务最长时间 */

__u16 s_blocksize; /* 块大小 */

__u16 s_oid_maxsize; /* max size of object id array, 看 get_objectid() 注释 */

__u16 s_oid_cursize; /* 当前object id array大小 */

__u16 s_state; /* 有效或错误 */

char s_magic[12]; /* reiserfs magic string 指示文件系统是 reiserfs */

__u32 s_hash_function_code; /* 指示使用哪个 hash function to sort names in a directory*/

__u16 s_tree_height; /* height of disk tree */

__u16 s_bmap_nr; /* amount of bitmap blocks needed to address 文件系统的每个块 */

__u16 s_version; /* I'd prefer it if this was a string,

something like "3.6.4", and maybe

16 bytes long mostly unused. We

don't need to save bytes in the

superblock. */

__u16 s_reserved;

__u32 s_inode_generation;

char s_unused[124] ; /* zero filled by mkreiserfs */

} __attribute__ ((__packed__));

("__attribute__ ((__packed__))"意思是它为1-byte 对齐).

s_blocksize信息非常重要. 所有ReiserFS数据以块的形式组织, 并且每个块都是这个大小(字节). 本文后面提到的"blocks", 就是指这种块. not whatever your disk information says a blocksize is. 此外, 假定REISERFS_BLOCKSIZE 即这个size.

注: s_magic should contain "ReIsErFs" 或 "ReIsEr2Fs".

Used/Free 块位图

文件系统需要知道哪些块空闲或被使用. 空块通常没有格式化, 即: 下面描述的块头结构缺失或无效. 基于此, ReiserFS 使用一个位图. It consists of a series of blocks, where each byte represents 8 free/used bits. 例如, 下面的字节:

FF FF FF C7

0xC7 二进制为11000111, 意味着开始的 8+8+8+3 个块被使用("1"), 然后三个空块 ("0"), 之后另外两个被使用的块("1"). 注: 位序根据系统不同而大小端可能不同. 在一个 linux 系统里, 使用命令:

debugreiserfs -b < reiserfs分区名>

查看 reiserfs 具体按怎样顺序理解位图.

有多少个位图块?

You can find out how many bitmap blocks by looking at the superblock structure

reiserfs_super_block.s_bmap_nr – 位图块总数

查找第一个位图块

第一个位图块紧跟超级块, 在超级块后面的块里. 所以, 这样计算第一个位图块的位置:(字节)

REISERFS_DISK_OFFSET_IN_BYTES + REISERFS_BLOCKSIZE

查找下一个位图块

剩余的位图块按下面的公式定位:

REISERFS_BLOCKSIZE * REISERFS_BLOCKSIZE * REISERFS_BLOCKSIZE * 8 * n

n 大于等于1. 例如, 如果块大小是 4096, 第二个位图块定位在偏移处 4096*4096*8*1 = 134217728 (in bytes) , 它即是块32768.

b-tree

注: 本节仅非正式介绍. 要获取更多技术详述尝试搜索 "btree" .

A btree is a linked structure that looks like a regular tree. There is a parent block, and it has child blocks, and they in turn have child blocks and so on.

与 "normal" 树不同之处在于 each block in the tree is identified by a key. A key can be anything that is unique and can be compared with other keys; 我们的例子里它是一个整数, 在ReiserFS 里它是包含四个整数的组合. What is important is that you have a defined sort ordering for the keys – 任何两个keys 都不相同, 但可以比较和分类.

例子: 假定这些 keys 是十进制数, 并且它们可以按照典型整数的比较来比较. The root block 有如下 keys

100 200

This means, it needs three child blocks, one for keys in the decimal range 0..99, one for 100..199 and one for 200..max. Lets look at an imaginary set of child blocks:

0 23 87 105 130 170 175 180 205 209 280

前三个元素是block 100 的孩子, 接下来的五个属于 block 200, 最后三个are for anything above 200. 可能你已经注意到了:不是所有的keys 都被使用了 – 例子里 only three keys are used for the range 0..99. This type of ordering goes on for more levels (I think the default for reiserfs is four tree levels).

叶结点 are nodes that do not have any childs. In ReiserFS, their data format is different from internal nodes (see section on "Leaf Nodes" below)

怎样通过key找到一个item

Again, no algorithm here, just an informal note. You start from the top node. You look for the section that is larger than or equal to the desired key. If you got that, you go to its child node, and so on, until you reach a leaf node.

ReiserFS使用的keys

注: This section is a bit lacking, since I don't quite understand the concept used fully.

有两种类型的 keys, 每个都含有四个元素:

( directory-id, object-id, offset, type )

每个元素都是十进制数. 最大的不同是, "normal" keys, 每个数字是32位的整数; 但这里的keys 的 "offset" 是60位的数, type 是个4位的数.

该 key 结构如下定义:

struct offset_v1 {

__u32 k_offset;
__u32 k_uniqueness;

};

struct offset_v2 {

__u64 k_offset: 60;
__u64 k_type: 4;

};

struct key {

__u32 k_dir_id; /* packing locality: by default parent directory object id */
__u32 k_objectid; /* object identifier */

    union {
        struct offset_v1 k_offset_v1;
        struct offset_v2 k_offset_v2;
    } u;

} ;

所以, 按 version 1 理解, 将有四个 32-bit 整数:

( k_dir_id, k_object_id, u.k_offset_v1.k_offset, u.k_offset_v1.k_uniqueness )

按 version 2 理解, you have four integers like this:

( k_dir_id, k_object_id, u.k_offset_v2.k_offset, u.k_offset_v2.k_type )

It is best practice to expand this key to a "cpu key" in memory, that looks something like this (my definition):

typedef struct reiserfs_cpu_key

{

    __u32 k_dir_id;
    __u32 k_objectid;
    __u64 k_offset;
    __u32 k_type;

} REISERFS_CPU_KEY, *LPREISERFS_CPU_KEY;

This way it is easier to combine keys of the two types.

如何区分 key类型

Well, I have to admit, I don't know for sure. 似乎是这样: 通常, 所有keys 是类型1. 但是, for Item headers an ih_version of 2 specifies keys of type 2. I will explain more on this in the section on "Leaf Nodes" below, but for now just think that most keys on-disk are of type 1, and in memory should always be expanded to a cpu-key.

如何比较 keys

You need to be able to compare keys to search through the btree. Well, think of the keys as an array of integers. 顺序地检查每个整数, 并使用 normal "less-than" relation.

这里有个 C 函数比较两个cpu keys, 很清晰:

int CompareCpuKeys(REISERFS_CPU_KEY* a, REISERFS_CPU_KEY *b)

{

    // compare 1. integer
    if( a->k_dir_id < b->k_dir_id )
        return -1;

if( a->k_dir_id > b->k_dir_id )
return 1;

    // compare 2. integer
    if( a->k_objectid < b->k_objectid )
        return -1;

if( a->k_objectid > b->k_objectid )
return 1;

    // compare 3. integer
    if( a->k_offset < b->k_offset )
        return -1;

if( a->k_offset > b->k_offset )
return 1;

    // compare 4. integer
    if( a->k_type < b->k_type )
        return -1;

if( a->k_type > b->k_type )
return 1;

return 0;

}

块操作

访问ReiserFS磁盘数据是基于块的. reiserfs块和类似概念"磁头/扇区/柱面"之间没有任何联系. reiserfs 分区的块大小是由superblock里定的.

有三种块类型:

· Formatted internal blocks: 这种块类型用于btree data. 它包括keys 和相关的磁盘块. 在下面章节"Internal nodes" 详细描述.

· Formatted leaf blocks: 这种块类型用于文件信息. 在下面章节" Leaf nodes " 详细描述.

· Unformatted blocks: 一个未格式化的块仅含raw file data. You cannot strictly recognize an unformatted block by looking at it, 一个未格式化的块不被reiserfs 的btree 结构使用.

严格的说, 还有一种块类型 - 超级块, 但这是个特殊情况. (see above).

标准块头

每个格式化的块有个块头. 它的定义如下:

struct block_head {

__u16 blk_level;        /* 块在树中的Level 1:leaf; 2 higher:internal */
__u16 blk_nr_item;      /* 块里 keys/items 的number */
__u16 blk_free_space;   /* 块的空闲空间, 单位字节 */
__u16 blk_reserved;     /* dump this in v4/planA */
struct key blk_right_delim_key; /* 仅用于Leaf 结点 */

};

通过blk_level 查看一个格式化块是否一个 leaf 结点.

内部结点

An internal node is an element in the btree.它仅含指向其孩子结点的指针.

Block Header Key 1 Key 2 ... Key n Ptr. 1 Ptr. 2 ... Ptr. n+1 Free Space

这些 keys 数组仅含标准的reiserfs key 结构 (V1). 这些指针数组包含以下的结构元素:

struct disk_child {

__u32       dc_block_number; /* Disk child's block number. */
__u16       dc_size;          /* Disk child's used space.   */
__u16       dc_reserved;

};

As you can see, this is the pointer to the child node. Note that there is one more disk_child than keys, as is familiar from the short introduction on btrees above.

块头blk_level值大于1. 块头元素blk_nr_item 指定 keys数组的数目.

下面代码列举所有 keys 和它们相关的 "disk-child" 对象:

// 假定 bMemory 是块数据

    // 获得keys数组的指针
    LPBYTE lpbHeaderData = bMemory+sizeof(REISERFS_BLOCK_HEAD);

// 获得disk childs数组的指针
LPBYTE lpbPointerData = bMemory+sizeof(REISERFS_BLOCK_HEAD)+(pH->blk_nr_item*sizeof(REISERFS_KEY));

// 列举数组
for( int i = 0; i < pH->blk_nr_item; i++ )

{

REISERFS_KEY* key = (REISERFS_KEY*)(lpbHeaderData+i*sizeof(REISERFS_KEY));
REISERFS_DISK_KEY* pointer = (REISERFS_DISK_KEY*) (lpbPointerData+i*sizeof(REISERFS_DISK_KEY));

// TODO: add evaluation
}

// 最后一个指针 (note: no key!)
REISERFS_DISK_KEY* pointer = (REISERFS_DISK_KEY*) (lpbPointerData+i*sizeof(REISERFS_DISK_KEY));

// TODO: add evaluation for last pointer

叶结点

一个叶结点有如下结构:

Block Header Header 1 Header 2 ... Header n Free Space Data 1 Data 2 ... Data n

块头blk_level值为1. items数量在块头blk_nr_item 域给出.

每个 item 头使用如下结构:

struct item_head

{

struct key ih_key; /* 树的一切都是基于key来查找.*/
union {

__u16 ih_free_space_reserved; /* 如果是间接项, 指示最后未格式化结点的空闲空间. 如果是直接项或stat数据项则等于0xFFFF. 注意是key, 不是该域, 用来决定item类型, and thus which field this union contains. */
__u16 ih_entry_count; /* 如果这是个目录项, 该域就是目录项里的 number of directory entries. */

} u;

__u16 ih_item_len;           /* item body的总大小 */
__u16 ih_item_location;      /* item body 在块内的偏移 */
__u16 ih_version;           /* 0: all old items, 2: new ones.
                       最高位由fsck临时设置, all done之后清除 */

} ;

成员ih_item_location 和 ih_item_len 指定该item在哪里(当前块). When, in the sections below, the item types are described in detail, it is expected that you analyze the data located with these two members.

以下代码枚举叶结点的所有items.

// 假定: bMemory 是当前块数据
// 假定: pH 是块头指针

// 查找 item 头数组开始处

LPBYTE lpbHeaderData = bMemory+sizeof(REISERFS_BLOCK_HEAD);

for( int i = 0; i < pH->blk_nr_item; i++ )
{

// 这是 item 头
LPREISERFS_ITEM_HEAD iH = (LPREISERFS_ITEM_HEAD)lpbHeaderData;

    // 这是 item 数据
    LPBYTE lpbItemData = bMemory + iH->ih_item_location;
    DWORD dwItemSize = iH->ih_item_len

// TODO: add implementation

// skip to next item
lpbHeaderData += sizeof(REISERFS_ITEM_HEAD);
}

叶结点有四种类型:

·        Directory items: for directory file names and keys
·        Stat items: for file information
·        Direct items: content of small files that fits in a block
·        Indirect items: array of block numbers for the content of larger files

下面更详细地描述它们. 每个item 的类型由 ih_key.k_type 的值决定. 注意, 在ReiserFS 里这是V2 类型 keys 可以出现的唯一地方. 使用下面代码区分v1 和 v2 keys:

// 假定: iH is the pointer to the item header

#define ITEM_VERSION_1 0
#define ITEM_VERSION_2 1

REISERFS_KEY* key = &(iH->ih_key);
REISERFS_CPU_KEY cpukey;
cpukey.k_dir_id = key->k_dir_id;
cpukey.k_objectid = key->k_objectid;

if( iH->ih_version == ITEM_VERSION_1 )

{
cpukey.k_type = iH->ih_key.u.k_offset_v1.k_uniqueness;
cpukey.k_offset = iH->ih_key.u.k_offset_v1.k_offset;
}

else if ( iH->ih_version == ITEM_VERSION_2 )

{
cpukey.k_type = (int) iH->ih_key.u.k_offset_v2.k_type;
cpukey.k_offset = iH->ih_key.u.k_offset_v2.k_offset;
}
else assert(false);

This code generates a proper cpu-key from the on-disk structure.

目录 items

目录是个在叶结点的item, k_type是500. It represents entries in a directory, 也就是, 一个目录清单/列表. (但它不包含文件stats, 必需 read them separatly). Its data looks like this:

Dir Entry 1 Dir Entry 2 ... Dir Entry N Filename N ... Filename 2 Filename 1

每个 directory entry 使用如下结构:

struct reiserfs_de_head

{
__u32 deh_offset;     /* third component of the directory entry key */
__u32 deh_dir_id;     /* object的父目录的objectid, 被目录 entry 引用 */
__u32 deh_objectid;   /* object 的objectid, 被目录 entry 引用 */
__u16 deh_location;   /* 在整个item中name的偏移 */
__u16 deh_state;      /* whether 1) entry contains stat data (for future), and 2) whether entry is hidden (unlinked) */

} __attribute__ ((__packed__));

filenames 不以0结尾, 你必需自己计算 filename 的大小. 第一个filename开始于第一个 reiserfs_de_head的偏移deh_location, 结束于item数据结尾. 第二个filename 开始于其head的偏移 deh_location, 结束于第一个head的偏移deh_location, 等等. 要知道有多少reiserfs_de_head entries 必需枚举所有 headers, 直到 header的开始等于最后一个已知的deh_location.

如果觉得迷惑, 这有样例代码, 它takes磁盘上一个目录 item 并dumps 它的filenames:

// 已知:
// SIZE_OF_BLOCK 磁盘上 item 的大小.
// DATA_OF_BLOCK 磁盘上directory item的数据. 需要分配一个字节
// 并确保缓冲区是以0结尾 for 以下代码工作.

int dh_offset = 0; // 文件的偏移
int dh_strpos = SIZE_OF_BLOCK; // 大小

while( dh_offset < dh_strpos )
{
// 获取current item的指针
REISERFS_DIRECTORY_HEAD* pDH = (REISERFS_DIRECTORY_HEAD*) (DATA_OF_BLOCK+dh_offset);

// filename开始于deh_location and is zeroterminated automatically
printf("name='%s'/n",bMemory+pDH->deh_location);

// make the next string zero-terminated, too.
(bMemory+pDH->deh_location)[0] = 0;

    // this is the max sized, used for the loop ending criteria
    dh_strpos = pDH->deh_location;

    // increase array offset.
    dh_offset += sizeof(REISERFS_DIRECTORY_HEAD);

}

Here is an annotated hexdump of such a directory block

0000000: 01000000 | 05000000 | efa40100 | 66000400 ............f... <-- direntry #1
0000010: 02000000 | 04000000 | 05000000 | 64000400 ............d... <-- direntry #2
0000020: 805e2900 | efa40100 | f2a40100 | 61000400 .^).........a... <-- direntry #3
0000030: 8032e44f | efa40100 | f0a40100 | 58000400 .2.O........X... <-- direntry #4
0000040: 80eeb365 | efa40100 | f1a40100 | 50000400 ...e........P... <-- direntry #5
0000050: 70726f66 | 696c6573 | 64656663 | 6f6e6669 profilesdefconfi
0000060: 67746d70 | 2e2e2e gtmp...

You can see that the filenames are, indeed, not zero-terminated. When properly analyzed, this structure reads:

deh_offset deh_dir_id deh_objectid deh_location deh_state filename

1 5 107759 102 4 '.'

2 4 5 100 4 '..'

2711168 107759 107762 97 4 'tmp'

1340355200 107759 107760 88 4 'defconfig'

1706290816 107759 107761 80 4 'profiles'

You can see that the current directory (".") has an object id 107759, which is being used as the directory id for the dependant files.

Large Directories

对于大的目录, 其叶结点右邻居使用叶结点延续目录描述, 仅k_offset不同.

例如: Block 12041 is an internal node with the following pointers:

[ 0 ] --> 18291
[ 1 ] --> 19011
[ 2 ] --> 20010

此外, 假定 Block 18291 包含first part of the directory "/usr/include", specified by the key
( 131, 137, 1L, 500 )

然后, block 19011 will contain a directory description with the key
( 131, 137, < some large number here >, 500 )

that is the continuation of the directory listing.

Stat items

一个stat item存于叶结点, k_type是0. 它描绘文件和目录的stat 细节. 每个 stat item 与一个文件或目录相关. stat items有两个类型, 主要区别是它们在内存中的大小.

Stat Version 1

这是旧的stat 数据, 32 字节长. 它的结构声明如下:

struct stat_data_v1
{
    __u16 sd_mode;      /* file type, permissions */
    __u16 sd_nlink;     /* number of hard links */
    __u16 sd_uid;       /* owner */
    __u16 sd_gid;       /* group */
    __u32 sd_size;      /* file size */
    __u32 sd_atime;     /* time of last access */
    __u32 sd_mtime;     /* time file was last modified */
    __u32 sd_ctime;     /* time inode (stat data) was last changed (except changes to sd_atime and sd_mtime) */

    union {
        __u32 sd_rdev;
        __u32 sd_blocks;/* number of blocks file uses */
    } u;

__u32 sd_first_direct_byte;

};

Stat Version 2

This is the one used by newer versions of the ReiserFS 文件系统, 最大不同是对大文件的支持(filesize is now 64 bit). The size is 44 bytes.

struct stat_data {
    __u16 sd_mode;      /* file type, permissions */
    __u16 sd_reserved;
    __u32 sd_nlink;     /* number of hard links */
    __u64 sd_size;      /* file size */
    __u32 sd_uid;               /* owner */
    __u32 sd_gid;               /* group */
    __u32 sd_atime;     /* time of last access */
    __u32 sd_mtime;     /* time file was last modified */
    __u32 sd_ctime;     /* time inode (stat data) was last changed (except changes to sd_atime and sd_mtime) */
    __u32 sd_blocks;

    union {
        __u32 sd_rdev;
        __u32 sd_generation;
      //__u32 sd_first_direct_byte;

/*存储于direct item的文件的首字节。如果为 1 它是一个链接; 如果它为~(__u32)0则没有direct item. 使用一个宏取代它,

该宏基于sd_size 和 our tail suppression 策略 */

} u;

} ;

Two comments on structure members:
· 使用标准 c库 localtime() 函数可以分析sd_atime/sd_mtime/sd_ctime域
· The sd_mode permissions (Unix 术语) field contains the file attributes (Windows 术语).

Direct Items

一个 direct item 存在于叶结点, k_type 是-1. 对于小文件, 它包含了整个文件的二进制内容. 对于大文件, 它可以存在 (但不需存在), 如果它存在, 它包含文件的 tail.

Indirect Items

一个 indirect item 存在于叶结点, k_type 是-2. 对于大文件, 它包含一个ULONG数组, 值为包含文件内容的block 号. 可以认为它们是文件的索引. 例如, 一个indirect item如下:

16482 16483 16484

意思是未格式化的块 16482,16483,16484, 按此顺序, 包含了文件的数据.

注: 最后一个块可能不会被完全使用. 必需先确定文件大小来确保读的正确性. 例如, 如果块大小为4096, 一个文件大小为 12232 (=4096+4096+4040)字节, 则必需从最后一个索引中只读4040字节.

注: 有可能存在另外的 direct item, 附加在文件数据(indirect 块中)尾部.