Inode TableIn a regular UNIX filesystem, the inode stores all the metadata pertaining to the file (time stamps, block maps, extended attributes, etc), not the directory entry. To find the information associated with a file, one must traverse the directory files to find the directory entry associated with a file, then load the inode to find the metadata for that file. ext4 appears to cheat (for performance reasons) a little bit by storing a copy of the file type (normally stored in the inode) in the directory entry. (Compare all this to FAT, which stores all the file information directly in the directory entry, but does not support hard links and is in general more seek-happy than ext4 due to its simpler block allocator and extensive use of linked lists.)
The inode table is a linear array of
struct ext4_inode
. The table is sized to have enough blocks to store at leastsb.s_inode_size * sb.s_inodes_per_group
bytes. The number of the block group containing an inode can be calculated as(inode_number - 1) / sb.s_inodes_per_group
, and the offset into the group's table is(inode_number - 1) % sb.s_inodes_per_group
. There is no inode 0.The inode checksum is calculated against the FS UUID, the inode number, and the inode structure itself.
The inode table entry is laid out in
struct ext4_inode
.
Offset | Size | Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0x0 | __le16 | i_mode | File
mode. Any of:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x2 | __le16 | i_uid | Lower 16-bits of Owner UID. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x4 | __le32 | i_size_lo | Lower 32-bits of size in bytes. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x8 | __le32 | i_atime | Last access time, in seconds since the epoch. However, if the EA_INODE inode flag is set, this inode stores an extended attribute value and this field contains the checksum of the value. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0xC | __le32 | i_ctime | Last inode change time, in seconds since the epoch. However, if the EA_INODE inode flag is set, this inode stores an extended attribute value and this field contains the lower 32 bits of the attribute value's reference count. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x10 | __le32 | i_mtime | Last data modification time, in seconds since the epoch. However, if the EA_INODE inode flag is set, this inode stores an extended attribute value and this field contains the number of the inode that owns the extended attribute. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x14 | __le32 | i_dtime | Deletion Time, in seconds since the epoch. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x18 | __le16 | i_gid | Lower 16-bits of GID. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x1A | __le16 | i_links_count | Hard link count. Normally, ext4 does not permit an inode to have more than 65,000 hard links. This applies to files as well as directories, which means that there cannot be more than 64,998 subdirectories in a directory (each subdirectory's '..' entry counts as a hard link, as does the '.' entry in the directory itself). With the DIR_NLINK feature enabled, ext4 supports more than 64,998 subdirectories by setting this field to 1 to indicate that the number of hard links is not known. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x1C | __le32 | i_blocks_lo | Lower 32-bits of "block" count. If
the huge_file feature flag is not set on the filesystem, the file consumes i_blocks_lo 512-byte blocks on disk. If huge_file is set and EXT4_HUGE_FILE_FL is NOT set in inode.i_flags , then the file consumes i_blocks_lo + (i_blocks_hi
<< 32) 512-byte blocks on disk. If huge_file is set and EXT4_HUGE_FILE_FL IS set in inode.i_flags , then this file consumes (i_blocks_lo + i_blocks_hi << 32) filesystem blocks on disk. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x20 | __le32 | i_flags | Inode
flags. Any of:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x24 | 4 bytes | Union osd1:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x28 | 60 bytes | i_block[EXT4_N_BLOCKS=15] | Block map or extent tree. See the section "The Contents of inode.i_block". | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x64 | __le32 | i_generation | File version (for NFS). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x68 | __le32 | i_file_acl_lo | Lower 32-bits of extended attribute block. ACLs are of course one of many possible extended attributes; I think the name of this field is a result of the first use of extended attributes being for ACLs. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x6C | __le32 | i_size_high / i_dir_acl | Upper 32-bits of file/directory size. In ext2/3 this field was named i_dir_acl, though it was usually set to zero and never used. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x70 | __le32 | i_obso_faddr | (Obsolete) fragment address. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x74 | 12 bytes | Union osd2:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x80 | __le16 | i_extra_isize | Size of this inode - 128. Alternately, the size of the extended inode fields beyond the original ext2 inode, including this field. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x82 | __le16 | i_checksum_hi | Upper 16-bits of the inode checksum. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x84 | __le32 | i_ctime_extra | Extra change time bits. This provides sub-second precision. See Inode Timestamps section. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x88 | __le32 | i_mtime_extra | Extra modification time bits. This provides sub-second precision. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x8C | __le32 | i_atime_extra | Extra access time bits. This provides sub-second precision. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x90 | __le32 | i_crtime | File creation time, in seconds since the epoch. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x94 | __le32 | i_crtime_extra | Extra file creation time bits. This provides sub-second precision. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x98 | __le32 | i_version_hi | Upper 32-bits for version number. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x9C | __le32 | i_projid | Project ID. |
Inode Size
In ext2 and ext3, the inode structure size was fixed at 128 bytes (EXT2_GOOD_OLD_INODE_SIZE
) and each inode had a disk record size of 128 bytes. Starting with ext4, it is possible to allocate a larger on-disk inode at format time for all inodes in the filesystem to provide space beyond the end of the original ext2 inode. The on-disk inode record size is recorded in the superblock as s_inode_size
. The number of bytes actually used by struct ext4_inode beyond the original 128-byte ext2 inode is recorded in the i_extra_isize
field for each inode, which allows struct ext4_inode to grow for a new kernel without having to upgrade all of the on-disk inodes. Access to fields beyond EXT2_GOOD_OLD_INODE_SIZE should be verified to be within i_extra_isize
. By default, ext4 inode records are 256 bytes, and (as of October 2013) the inode structure is 156 bytes (i_extra_isize = 28
). The extra space between the end of the inode structure and the end of the inode record can be used to store extended attributes. Each inode record can be as large as the filesystem block size, though this is not terribly efficient.
Finding an Inode
Each block group contains sb->s_inodes_per_group
inodes. Because inode 0 is defined not to exist, this formula can be used to find the block group that an inode lives in: bg = (inode_num - 1) / sb->s_inodes_per_group
. The particular inode can be found within the block group's inode table at index = (inode_num - 1) % sb->s_inodes_per_group
. To get the byte address within the inode table, use offset = index * sb->s_inode_size
.
Inode Timestamps
Four timestamps are recorded in the lower 128 bytes of the inode structure -- inode change time (ctime), access time (atime), data modification time (mtime), and deletion time (dtime). The four fields are 32-bit signed integers that represent seconds since the Unix epoch (1970-01-01 00:00:00 GMT), which means that the fields will overflow in January 2038. For inodes that are not linked from any directory but are still open (orphan inodes), the dtime field is overloaded for use with the orphan list. The superblock field s_last_orphan
points to the first inode in the orphan list; dtime is then the number of the next orphaned inode, or zero if there are no more orphans.
If the inode structure size sb->s_inode_size
is larger than 128 bytes and the i_inode_extra
field is large enough to encompass the respective i_[cma]time_extra
field, the ctime, atime, and mtime inode fields are widened to 64 bits. Within this "extra" 32-bit field, the lower two bits are used to extend the 32-bit seconds field to be 34 bit wide; the upper 30 bits are used to provide nanosecond timestamp accuracy. Therefore, timestamps should not overflow until May 2446. dtime was not widened. There is also a fifth timestamp to record inode creation time (crtime); this field is 64-bits wide and decoded in the same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible through the regular stat() interface, though debugfs will report them.
We use the 32-bit signed time value plus (2^32 * (extra epoch bits)). In other words:
Extra epoch bits | MSB of 32-bit time | Adjustment for signed 32-bit to 64-bit tv_sec | Decoded 64-bit tv_sec | valid time range |
---|---|---|---|---|
0 0 | 1 | 0 | -0x80000000 - -0x00000001 | 1901-12-13 to 1969-12-31 |
0 0 | 0 | 0 | 0x000000000 - 0x07fffffff | 1970-01-01 to 2038-01-19 |
0 1 | 1 | 0x100000000 | 0x080000000 - 0x0ffffffff | 2038-01-19 to 2106-02-07 |
0 1 | 0 | 0x100000000 | 0x100000000 - 0x17fffffff | 2106-02-07 to 2174-02-25 |
1 0 | 1 | 0x200000000 | 0x180000000 - 0x1ffffffff | 2174-02-25 to 2242-03-16 |
1 0 | 0 | 0x200000000 | 0x200000000 - 0x27fffffff | 2242-03-16 to 2310-04-04 |
1 1 | 1 | 0x300000000 | 0x280000000 - 0x2ffffffff | 2310-04-04 to 2378-04-22 |
1 1 | 0 | 0x300000000 | 0x300000000 - 0x37fffffff | 2378-04-22 to 2446-05-10 |
This is a somewhat odd encoding since there are effectively seven times as many positive values as negative values. There have also been long-standing bugs decoding and encoding dates beyond 2038, which don't seem to be fixed as of kernel 3.12 and e2fsprogs 1.42.8. 64-bit kernels incorrectly use the extra epoch bits 1,1 for dates between 1901 and 1970. At some point the kernel will be fixed and e2fsck will fix this situation, assuming that it is run before 2310.