EXT4文件系统磁盘分布和部分特性介绍

1.ext4 磁盘布局

先说一下使用的emmc配置:大小:30253514752bytes,28G,块大小4096。以下布局皆以此为基础
在这里插入图片描述

2.sparse_super

效果为:超级块和GDT备份只出现在3,5,7的幂

按照226组块,如果开启该特性,超级块和GDT只会出现在:组块0(主超级块),1,3,5,7,9,21,27,49,81,125组块中,除主超级块其余皆为备份
在这里插入图片描述

3.flex_bg

在一个flex_bg中,几个块组在一起组成一个逻辑块组flex_bg。Flex_bg的第一个块组中的位图空间和inode表空间扩大为包含了flex_bg中其他块组上位图和inode表。

Flexible块组的作用是:
(1) 聚集元数据,加速元数据载入;
(2) 使得大文件在磁盘上尽量连续;

即使开启flex_bg特性,超级块和块组描述符的冗余备份仍然位于块组的开头。 Flex_bg中块组的个数由2^ext4_super_block.s_log_groups_per_flex 给出。
在这里插入图片描述
在这里插入图片描述

4.meta_bg

按照块大小=4096,组块大小=128M= 2 ^27 。组块描述符=64 如果整个一个组块全部放组描述符,可以管理:2 ^27 / 2 ^6(64)=2 ^21个组块 即 2 ^21 * 128M = 2 ^48 = 256TB,当然实际上也不可能整个组块全用于存组描述符,所以实际上能够管理的空间是小于这个值的

如果开启meta_bg就没有组描述符限制,而是根据寻址大小限制(按块寻址)

目前Ext4最大支持的是48bits block寻址方式,所以最大卷大小为2 ^ 48 个block,2 ^ 48*2 ^ 12(4096)=2 ^ 60B=1EB,而每个group为128M=2 ^ 27B,所以有2 ^ 60 / 2 ^ 27=2 ^ 33个group

这里不开启64bit特性,因为emmc的总大小很小,没有必要

在这里插入图片描述

5.flex_bg+meta_bg

在这里插入图片描述

6.resize_inode

效果为:如果开启则有预留组块描述符,为以后可能扩展文件系统大小预留

开启的map如下(使用flex,^64bit,uninit_bg,sparse_super等特性)
在这里插入图片描述
不开启如下(使用flex,^64bit,uninit_bg,sparse_super等特性)
在这里插入图片描述

7.uninit_bg

效果为:不完全初始化组块数据,如果不开启该特性,所有未使用数据块皆为未使用状态,取而代之的是每个组块会计算校验
如下为不开启
在这里插入图片描述

如果开启该特性,所有非柔性组块皆为未初始化
在这里插入图片描述
在这里插入图片描述

8.lazy_xx_init

1.lazy_journal_init:默认为0,如果不开启,会初始化全部日志区,如果设置为1,日志超级块和提交日志会被初始化

2.lazy_itable_init:必须开启uninit_bg才有效,默认为开启,如果设置为0,即使开启uninit_bg特性,也会初始化全部未使用inode,如果inode数量多,则明显会延长格式化时间

下面是 -E lazy_itable_init=0的情况
在这里插入图片描述

9.超级块

超级块就不用介绍了吧?文件系统最关键的一些信息都在这里

struct ext4_super_block {
    /*00*/ __le32 s_inodes_count;      /* Inodes count 整个ext4文件系统拥有的inode的数量*/
    __le32 s_blocks_count_lo;          /* Blocks count 整个ext4文件系统管理的块数量(低32位)*/
    __le32 s_r_blocks_count_lo;        /* Reserved blocks count 预留的块数量(低32位)*/
    __le32 s_free_blocks_count_lo;     /* Free blocks count 空闲块数量(低32位)*/
    /*10*/ __le32 s_free_inodes_count; /* Free inodes count 空闲inode数量*/
    __le32 s_first_data_block;         /* First Data Block */
    __le32 s_log_block_size;           /* Block size ,Block size is 2 ^ (10 + s_log_block_size). */
    __le32 s_log_cluster_size;         /* Allocation cluster size 如果启用了bigalloc,则集群大小为(2^s_log_Cluster_size)块。否则,s_log_cluster_size必须等于s_log_block_size */
    /*20*/ __le32 s_blocks_per_group;  /* # Blocks per group 每个组块的块数量*/
    __le32 s_clusters_per_group;       /* # Clusters per group 如果启用了bigalloc,每个组的群集数。否则必须等于每个组块的块数*/
    __le32 s_inodes_per_group;         /* # Inodes per group 每个组块的inode数量*/
    __le32 s_mtime;                    /* Mount time */
    /*30*/ __le32 s_wtime;             /* Write time */
    __le16 s_mnt_count;                /* Mount count */
    __le16 s_max_mnt_count;            /* Maximal mount count */
    __le16 s_magic;                    /* Magic signature 如果不是=0xEF53 就是有问题 */
    /*
	File system state. Valid values are:
	0x0001	Cleanly umounted
	0x0002	Errors detected
	0x0004	Orphans being recovered
	*/
    __le16 s_state;                    /* File system state */
    /*
    Behaviour when detecting errors. One of:
	1	Continue
	2	Remount read-only
	3	Panic
    */
    __le16 s_errors;                   /* Behaviour when detecting errors */
    __le16 s_minor_rev_level;          /* minor revision level */
    /*40*/ __le32 s_lastcheck;         /* time of last check */
    __le32 s_checkinterval;            /* max. time between checks */
    __le32 s_creator_os;               /* OS */
    __le32 s_rev_level;                /* Revision level 一般都是 dynamic的 */
    /*50*/ __le16 s_def_resuid;        /* Default uid for reserved blocks */
    __le16 s_def_resgid;               /* Default gid for reserved blocks */
    /*
     * These fields are for EXT4_DYNAMIC_REV superblocks only.
     *
     * Note: the difference between the compatible feature set and
     * the incompatible feature set is that if there is a bit set
     * in the incompatible feature set that the kernel doesn't
     * know about, it should refuse to mount the filesystem.
     *
     * e2fsck's requirements are more strict; if it doesn't know
     * about a feature in either the compatible or incompatible
     * feature set, it must abort and not try to meddle with
     * things it doesn't understand...
     */
    __le32 s_first_ino;                     /* First non-reserved inode 第一个非预留inode号一般都是11,lost+found目录的节点 */
    __le16 s_inode_size;                    /* size of inode structure inode的大小*/
    __le16 s_block_group_nr;                /* block group # of this superblock */

	/*
	Compatible feature set flags. Kernel can still read/write this fs even if it doesn't understand a flag; e2fsck will
	not attempt to fix a filesystem with any unknown COMPAT flags. Any of:
	
	0x1	Directory preallocation (COMPAT_DIR_PREALLOC).
	
	0x2	"imagic inodes". Used by AFS to indicate inodes that are not linked into the directory namespace. Inodes marked with
	this flag will not be added to lost+found by e2fsck. (COMPAT_IMAGIC_INODES).
	
	0x4	Has a journal (COMPAT_HAS_JOURNAL).
	
	0x8	Supports extended attributes (COMPAT_EXT_ATTR).
	
	0x10 Has reserved GDT blocks for filesystem expansion. Requires RO_COMPAT_SPARSE_SUPER. (COMPAT_RESIZE_INODE).
	
	0x20 Has indexed directories. (COMPAT_DIR_INDEX).
	
	0x40 "Lazy BG". Not in Linux kernel, seems to have been for uninitialized block groups? (COMPAT_LAZY_BG).
	
	0x80 "Exclude inode". Intended for filesystem snapshot feature, but not used. (COMPAT_EXCLUDE_INODE).
	
	0x100 "Exclude bitmap". Seems to be used to indicate the presence of snapshot-related exclude bitmaps? Not defined in
	kernel or used in e2fsprogs. (COMPAT_EXCLUDE_BITMAP).
	
	0x200 Sparse Super Block, v2. If this flag is set, the SB field s_backup_bgs points to the two block groups that
	contain backup superblocks. (COMPAT_SPARSE_SUPER2).
	*/
	#define EXT4_FEATURE_COMPAT_DIR_PREALLOC 0x0001
	#define EXT4_FEATURE_COMPAT_IMAGIC_INODES 0x0002
	#define EXT4_FEATURE_COMPAT_HAS_JOURNAL 0x0004
	#define EXT4_FEATURE_COMPAT_EXT_ATTR 0x0008
	#define EXT4_FEATURE_COMPAT_RESIZE_INODE 0x0010
	#define EXT4_FEATURE_COMPAT_DIR_INDEX 0x0020
	#define EXT4_FEATURE_COMPAT_LAZY_BG 0x0040
	#define EXT4_FEATURE_COMPAT_EXCLUDE_INODE 0x0080
	#define EXT4_FEATURE_COMPAT_EXCLUDE_BITMAP 0x0100
	#define EXT4_FEATURE_COMPAT_SPARSE_SUPER2 0x0200
    __le32 s_feature_compat;                /* compatible feature set 兼容特性,都是按位的*/

	/*
	Incompatible feature set. If the kernel or e2fsck doesn't understand one of these bits, it will refuse to mount or
	attempt to repair the filesystem. Any of:
	
	0x1	Compression. Not implemented. (INCOMPAT_COMPRESSION).
	
	0x2	Directory entries record the file type. See ext4_dir_entry_2 below. (INCOMPAT_FILETYPE).
	
	0x4	Filesystem needs journal recovery. (INCOMPAT_RECOVER).
	
	0x8	Filesystem has a separate journal device. (INCOMPAT_JOURNAL_DEV).
	
	0x10	Meta block groups. See the earlier discussion of this feature. (INCOMPAT_META_BG).
	
	0x40	Files in this filesystem use extents. (INCOMPAT_EXTENTS).
	
	0x80	Enable a filesystem size over 2^32 blocks. (INCOMPAT_64BIT).
	
	0x100	Multiple mount protection. Prevent multiple hosts from mounting the filesystem concurrently by updating a
	reserved block periodically while mounted and checking this at mount time to determine if the filesystem is in use on
	another host. (INCOMPAT_MMP).
	
	0x200	Flexible block groups. See the earlier discussion of this feature. (INCOMPAT_FLEX_BG).
	
	0x400	Inodes can be used to store large extended attribute values (INCOMPAT_EA_INODE).
	
	0x1000	Data in directory entry. Allow additional data fields to be stored in each dirent, after struct ext4_dirent. The
	presence of extra data is indicated by flags in the high bits of ext4_dirent file type flags (above EXT4_FT_MAX). The
	flag EXT4_DIRENT_LUFID = 0x10 is used to store a 128-bit File Identifier for Lustre. The flag EXT4_DIRENT_IO64 = 0x20 is
	used to store the high word of 64-bit inode numbers. Feature still in development. (INCOMPAT_DIRDATA).
	
	0x2000	Metadata checksum seed is stored in the superblock. This feature enables the administrator to change the UUID of
	a metadata_csum filesystem while the filesystem is mounted; without it, the checksum definition requires all metadata
	blocks to be rewritten. (INCOMPAT_CSUM_SEED).
	
	0x4000	Large directory >2GB or 3-level htree. Prior to this feature, directories could not be larger than 4GiB and
	could not have an htree more than 2 levels deep. If this feature is enabled, directories can be larger than 4GiB and
	have a maximum htree depth of 3. (INCOMPAT_LARGEDIR).
	
	0x8000	Data in inode. Small files or directories are stored directly in the inode i_blocks and/or xattr space.
	(INCOMPAT_INLINE_DATA).
	
	0x10000	Encrypted inodes are present on the filesystem (INCOMPAT_ENCRYPT).
	*/
	#define EXT4_FEATURE_INCOMPAT_COMPRESSION 0x0001
	#define EXT4_FEATURE_INCOMPAT_FILETYPE 0x0002
	#define EXT4_FEATURE_INCOMPAT_RECOVER 0x0004     /* Needs recovery */
	#define EXT4_FEATURE_INCOMPAT_JOURNAL_DEV 0x0008 /* Journal device */
	#define EXT4_FEATURE_INCOMPAT_META_BG 0x0010
	#define EXT4_FEATURE_INCOMPAT_EXTENTS 0x0040 /* extents support */
	#define EXT4_FEATURE_INCOMPAT_64BIT 0x0080
	#define EXT4_FEATURE_INCOMPAT_MMP 0x0100
	#define EXT4_FEATURE_INCOMPAT_FLEX_BG 0x0200
	#define EXT4_FEATURE_INCOMPAT_EA_INODE 0x0400 /* EA in inode */
	#define EXT4_FEATURE_INCOMPAT_DIRDATA 0x1000  /* data in dirent */
	#define EXT4_FEATURE_INCOMPAT_CSUM_SEED 0x2000
	#define EXT4_FEATURE_INCOMPAT_LARGEDIR 0x4000    /* >2GB or 3-lvl htree */
	#define EXT4_FEATURE_INCOMPAT_INLINE_DATA 0x8000 /* data in inode */
	#define EXT4_FEATURE_INCOMPAT_ENCRYPT 0x10000
    /*60*/ __le32 s_feature_incompat;       /* incompatible feature set 非兼容特性*/

	/*
	Readonly-compatible feature set. If the kernel doesn't understand one of these bits, it can still mount read-only, but
	e2fsck will refuse to modify the filesystem. Any of:
	
	0x1	Sparse superblocks. See the earlier discussion of this feature. (RO_COMPAT_SPARSE_SUPER).
	
	0x2	Allow storing files larger than 2GiB (RO_COMPAT_LARGE_FILE).
	
	0x4	Was intended for use with htree directories, but was not needed. Not used in kernel or e2fsprogs
	(RO_COMPAT_BTREE_DIR).
	
	0x8	This filesystem has files whose space usage is stored in i_blocks in units of filesystem blocks, not 512-byte
	sectors. Inodes using this feature will be marked with EXT4_INODE_HUGE_FILE. (RO_COMPAT_HUGE_FILE)
	
	0x10	Group descriptors have checksums. In addition to detecting corruption, this is useful for lazy formatting with
	uninitialized groups (RO_COMPAT_GDT_CSUM).
	
	0x20	Indicates that the old ext3 32,000 subdirectory limit no longer applies. A directory's i_links_count will be set
	to 1 if it is incremented past 64,999. (RO_COMPAT_DIR_NLINK).
	
	0x40	Indicates that large inodes exist on this filesystem, storing extra fields after EXT2_GOOD_OLD_INODE_SIZE.
	(RO_COMPAT_EXTRA_ISIZE).
	
	0x80	This filesystem has a snapshot. Not implemented in ext4. (RO_COMPAT_HAS_SNAPSHOT).
	
	0x100	Quota is handled transactionally with the journal (RO_COMPAT_QUOTA).
	
	0x200	This filesystem supports "bigalloc", which means that filesystem block allocation bitmaps are tracked in units
	of clusters (of blocks) instead of blocks (RO_COMPAT_BIGALLOC).
	
	0x400	This filesystem supports metadata checksumming. (RO_COMPAT_METADATA_CSUM; implies RO_COMPAT_GDT_CSUM, though
	GDT_CSUM must not be set)
	
	0x800	Filesystem supports replicas. This feature is neither in the kernel nor e2fsprogs. (RO_COMPAT_REPLICA).
	
	0x1000	Read-only filesystem image; the kernel will not mount this image read-write and most tools will refuse to write
	to the image. (RO_COMPAT_READONLY).
	
	0x2000	Filesystem tracks project quotas. (RO_COMPAT_PROJECT)
	*/
	#define EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER 0x0001
	#define EXT4_FEATURE_RO_COMPAT_LARGE_FILE 0x0002
	#define EXT4_FEATURE_RO_COMPAT_BTREE_DIR 0x0004
	#define EXT4_FEATURE_RO_COMPAT_HUGE_FILE 0x0008
	#define EXT4_FEATURE_RO_COMPAT_GDT_CSUM 0x0010
	#define EXT4_FEATURE_RO_COMPAT_DIR_NLINK 0x0020
	#define EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE 0x0040
	#define EXT4_FEATURE_RO_COMPAT_QUOTA 0x0100
	#define EXT4_FEATURE_RO_COMPAT_BIGALLOC 0x0200
    __le32 s_feature_ro_compat;             /* readonly-compatible feature set 只读特性*/
    
    /*68*/ __u8 s_uuid[16];                 /* 128-bit uuid for volume */
    /*78*/ char s_volume_name[16];          /* volume name */
    /*88*/ char s_last_mounted[64];         /* directory where last mounted */
    /*C8*/ __le32 s_algorithm_usage_bitmap; /* For compression */

    /*
     * Performance hints.  Directory preallocation should only
     * happen if the EXT4_FEATURE_COMPAT_DIR_PREALLOC flag is on.
     */
    __u8 s_prealloc_blocks;       /* Nr of blocks to try to preallocate*/
    __u8 s_prealloc_dir_blocks;   /* Nr to preallocate for dirs */
    __le16 s_reserved_gdt_blocks; /* Per group desc for online growth 预留的gdt的块数 */

    /*
     * Journaling support valid if EXT4_FEATURE_COMPAT_HAS_JOURNAL set.
     */
    /*D0*/ __u8 s_journal_uuid[16]; /* uuid of journal superblock */
    /*E0*/ __le32 s_journal_inum;   /* inode number of journal file 日志的inode号,如果开启日志那就是确定的8 */
    __le32 s_journal_dev;           /* device number of journal file */
    __le32 s_last_orphan;           /* start of list of inodes to delete */
    __le32 s_hash_seed[4];          /* HTREE hash seed */
    __u8 s_def_hash_version;        /* Default hash version to use */
    __u8 s_jnl_backup_type;
    __le16 s_desc_size; /* size of group descriptor */
    /*100*/ __le32 s_default_mount_opts;
    __le32 s_first_meta_bg;  /* First metablock block group */
    __le32 s_mkfs_time;      /* When the filesystem was created */
    __le32 s_jnl_blocks[17]; /* Backup of the journal inode */

    /* 64bit support valid if EXT4_FEATURE_COMPAT_64BIT */
    /*150*/ __le32 s_blocks_count_hi; /* Blocks count */
    __le32 s_r_blocks_count_hi;       /* Reserved blocks count */
    __le32 s_free_blocks_count_hi;    /* Free blocks count */
    __le16 s_min_extra_isize;         /* All inodes have at least # bytes */
    __le16 s_want_extra_isize;        /* New inodes should reserve # bytes */
    __le32 s_flags;                   /* Miscellaneous flags */
    __le16 s_raid_stride;             /* RAID stride */
    __le16 s_mmp_update_interval;     /* # seconds to wait in MMP checking */
    __le64 s_mmp_block;               /* Block for multi-mount protection */
    __le32 s_raid_stripe_width;       /* blocks on all data disks (N*stride)*/
    __u8 s_log_groups_per_flex;       /* FLEX_BG group size */
    __u8 s_checksum_type;             /* metadata checksum algorithm used */
    __u8 s_encryption_level;          /* versioning level for encryption */
    __u8 s_reserved_pad;              /* Padding to next 32bits */
    __le64 s_kbytes_written;          /* nr of lifetime kilobytes written */
    __le32 s_snapshot_inum;           /* Inode number of active snapshot */
    __le32 s_snapshot_id;             /* sequential ID of active snapshot */
    __le64 s_snapshot_r_blocks_count; /* reserved blocks for active
                         snapshot's future use */
    __le32 s_snapshot_list;           /* inode number of the head of the
                             on-disk snapshot list */
#define EXT4_S_ERR_START offsetof(struct ext4_super_block, s_error_count)
    __le32 s_error_count;        /* number of fs errors */
    __le32 s_first_error_time;   /* first time an error happened */
    __le32 s_first_error_ino;    /* inode involved in first error */
    __le64 s_first_error_block;  /* block involved of first error */
    __u8 s_first_error_func[32]; /* function where the error happened */
    __le32 s_first_error_line;   /* line number where error happened */
    __le32 s_last_error_time;    /* most recent time of an error */
    __le32 s_last_error_ino;     /* inode involved in last error */
    __le32 s_last_error_line;    /* line number where error happened */
    __le64 s_last_error_block;   /* block involved of last error */
    __u8 s_last_error_func[32];  /* function where the error happened */
#define EXT4_S_ERR_END offsetof(struct ext4_super_block, s_mount_opts)
    __u8 s_mount_opts[64];
    __le32 s_usr_quota_inum;    /* inode for tracking user quota */
    __le32 s_grp_quota_inum;    /* inode for tracking group quota */
    __le32 s_overhead_clusters; /* overhead blocks/clusters in fs */
    __le32 s_backup_bgs[2];     /* groups with sparse_super2 SBs */
    __u8 s_encrypt_algos[4];    /* Encryption algorithms in use  */
    __u8 s_encrypt_pw_salt[16]; /* Salt used for string2key algorithm */
    __le32 s_lpf_ino;           /* Location of the lost+found inode */
    __le32 s_prj_quota_inum;    /* inode for tracking project quota */
    __le32 s_checksum_seed;     /* crc32c(uuid) if csum_seed set */
    __le32 s_reserved[98];      /* Padding to the end of the block */
    __le32 s_checksum;          /* crc32c(superblock) */
};

使用 dumpe2fs 工具也可以快速的查看超级块信息
在这里插入图片描述

10.组块描述符

文件系统每一个块组都对应有一个块组描述符,它是块组中的第二个内容。除了稀疏超级块,在标准配置中每个块组都有一个描述。
块组描述符记录了位图和inode表的位置信息。

/*
 * Structure of a blocks group descriptor Total size is 64 bytes.
 */
struct ext4_group_desc {
    __le32 bg_block_bitmap_lo;      /* Blocks bitmap block BB所在的起始块*/
    __le32 bg_inode_bitmap_lo;      /* Inodes bitmap block IB所在的起始块*/
    __le32 bg_inode_table_lo;       /* Inodes table block IT所在的起始块*/
    __le16 bg_free_blocks_count_lo; /* Free blocks count 该组块未使用的块数*/
    __le16 bg_free_inodes_count_lo; /* Free inodes count 该组块未使用的inode数*/
    __le16 bg_used_dirs_count_lo;   /* Directories count 有几个目录*/
    __le16 bg_flags;                /* EXT4_BG_flags (INODE_UNINIT, etc) */
    __le32 bg_exclude_bitmap_lo;    /* Exclude bitmap for snapshots */
    __le16 bg_block_bitmap_csum_lo; /* crc32c(s_uuid+grp_num+bbitmap) LE */
    __le16 bg_inode_bitmap_csum_lo; /* crc32c(s_uuid+grp_num+ibitmap) LE */
    __le16 bg_itable_unused_lo;     /* Unused inodes count IT未使用情况*/
    __le16 bg_checksum;             /* crc16(sb_uuid+group+desc) */

    /* These fields only exist if the 64bit feature is enabled and s_desc_size > 32. */
    /* 只有开启了64bit才会使用以下字节 */
    /*0x20*/ __le32 bg_block_bitmap_hi; /* Blocks bitmap block MSB */
    __le32 bg_inode_bitmap_hi;          /* Inodes bitmap block MSB */
    __le32 bg_inode_table_hi;           /* Inodes table block MSB */
    __le16 bg_free_blocks_count_hi;     /* Free blocks count MSB */
    __le16 bg_free_inodes_count_hi;     /* Free inodes count MSB */
    __le16 bg_used_dirs_count_hi;       /* Directories count MSB */
    __le16 bg_itable_unused_hi;         /* Unused inodes count MSB */
    __le32 bg_exclude_bitmap_hi;        /* Exclude bitmap block MSB */
    __le16 bg_block_bitmap_csum_hi;     /* crc32c(s_uuid+grp_num+bbitmap) BE */
    __le16 bg_inode_bitmap_csum_hi;     /* crc32c(s_uuid+grp_num+ibitmap) BE */
    __u32 bg_reserved;
};

11.数据块位图和inode位图

数据块位图跟踪块组中数据块使用情况。Inode位图跟踪块组中Inode使用情况。每个位图一个数据块,每一位用0或1表示一个块组中数据块或inode表中inode的使用情况

Blockbitmap包含了超级块等使用情况,即使用了就是1,(1288+108+7=1111)
下图为定位到的BB的块然后dump出来,查看一下。
在这里插入图片描述
与上图对照,是一致的,free block从1111开始 即使用的是0-1110
在这里插入图片描述

12.Inode表

Inode表中存储的就是inode的信息。Inode的大小是在超级块中的s_inode_size字段决定。
i_extra_isize:超出128部分的长度,包括该字段 现阶段该值一般都为32 也就是说ext4的inode实际使用的长度为128+32=160,总大小256
如果开启inline_data特性,未使用的部分会填充数据

struct ext4_inode {
    /* File mode */
	#define EXT4_S_IXOTH 0x1  //(Others may execute)
	#define EXT4_S_IWOTH 0x2  //(Others may write)
	#define EXT4_S_IROTH 0x4  // Others may read)
	
	#define EXT4_S_IXGRP 0x8   //(Group members may execute)
	#define EXT4_S_IWGRP 0x10  //(Group members may write)
	#define EXT4_S_IRGRP 0x20  //(Group members may read)
	
	#define EXT4_S_IXUSR 0x40   //(Owner may execute)
	#define EXT4_S_IWUSR 0x80   //(Owner may write)
	#define EXT4_S_IRUSR 0x100  //(Owner may read)
	
	#define EXT4_S_ISVTX 0x200  //(Sticky bit)
	#define EXT4_S_ISGID 0x400  //(Set GID)
	#define EXT4_S_ISUID 0x800  //(Set UID)
	
	/* These are mutually-exclusive file types: */
	#define EXT4_S_IFIFO 0x1000   //(FIFO)
	#define EXT4_S_IFCHR 0x2000   //(Character device)
	#define EXT4_S_IFDIR 0x4000   //(Directory)
	#define EXT4_S_IFBLK 0x6000   //(Block device)
	#define EXT4_S_IFREG 0x8000   //(Regular file)
	#define EXT4_S_IFLNK 0xA000   //(Symbolic link)
	#define EXT4_S_IFSOCK 0xC000  //(Socket)
    __le16 i_mode;        /* File mode 文件的类型和权限*/
    
    __le16 i_uid;         /* Low 16 bits of Owner Uid */
    __le32 i_size_lo;     /* Size in bytes 该inode对应的文件的大小*/
    __le32 i_atime;       /* Access time */
    __le32 i_ctime;       /* Inode Change time */
    __le32 i_mtime;       /* Modification time */
    __le32 i_dtime;       /* Deletion Time */
    __le16 i_gid;         /* Low 16 bits of Group Id */
    __le16 i_links_count; /* Links count 硬连接数 */
    __le32 i_blocks_lo;   /* Blocks count */

    /* 看上面宏定义规定的flag状态 */
    __le32 i_flags; /* File flags */
    union {
        struct {
            __le32 l_i_version;
        } linux1;
        struct {
            __u32 h_i_translator;
        } hurd1;
        struct {
            __u32 m_i_reserved1;
        } masix1;
    } osd1;                        /* OS dependent 1 */
    __le32 i_block[EXT4_N_BLOCKS]; /* Pointers to blocks 后面祥讲*/
    __le32 i_generation;           /* File version (for NFS) */
    __le32 i_file_acl_lo;          /* File ACL */
    __le32 i_size_high;
    __le32 i_obso_faddr; /* Obsoleted fragment address */
    union {
        struct {
            __le16 l_i_blocks_high; /* were l_i_reserved1 */
            __le16 l_i_file_acl_high;
            __le16 l_i_uid_high;    /* these 2 fields */
            __le16 l_i_gid_high;    /* were reserved2[0] */
            __le16 l_i_checksum_lo; /* crc32c(uuid+inum+inode) LE */
            __le16 l_i_reserved;
        } linux2;
        struct {
            __le16 h_i_reserved1; /* Obsoleted fragment number/size which are removed in ext4 */
            __u16 h_i_mode_high;
            __u16 h_i_uid_high;
            __u16 h_i_gid_high;
            __u32 h_i_author;
        } hurd2;
        struct {
            __le16 h_i_reserved1; /* Obsoleted fragment number/size which are removed in ext4 */
            __le16 m_i_file_acl_high;
            __u32 m_i_reserved2[2];
        } masix2;
    } osd2; /* OS dependent 2 */
    __le16 i_extra_isize;
    __le16 i_checksum_hi;  /* crc32c(uuid+inum+inode) BE */
    __le32 i_ctime_extra;  /* extra Change time      (nsec << 2 | epoch) */
    __le32 i_mtime_extra;  /* extra Modification time(nsec << 2 | epoch) */
    __le32 i_atime_extra;  /* extra Access time      (nsec << 2 | epoch) */
    __le32 i_crtime;       /* File Creation time */
    __le32 i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */
    __le32 i_version_hi;   /* high 32 bits for 64-bit version */
    __le32 i_projid;       /* Project ID */
};

12.1特殊inode

    Inode号    用途
    0      不存在0号inode
    1      损坏数据块链表
    2      根目录
    3      User quota. 用户quota索引
    4      Group quota. 组quota索引
    5      Boot loader
    6      Undelete directory. 未删除的目录
    7      预留的块组描述符inode. (用于调整inode数目)
    8      日志inode
    9      The "exclude" inode, for snapshots(?)
    10     Replica inode, used for some non-upstream feature?
    11     第一个非预留的inode,通常是lost+found目录

13.extent

extent特性为ext4新增特性,我们先来看一下之前的索引方式

Inode中有个 数组__le32 i_block[15],该数组中,前12个是直接索引,后面三个分别是一级索引、二级索引和三级索引。这种方式不仅实现起来较为复杂,而且在面对大文件的时候效率较为低下,且会浪费很多的间接块以存储映射关系
在这里插入图片描述
开启该特性后:在inode节点中的60byte存放1个Header和4个Entry
在这里插入图片描述

struct ext4_extent_header {
    __le16 eh_magic;      /* probably will support different formats */
    __le16 eh_entries;    /* number of valid entries */
    __le16 eh_max;        /* capacity of store in entries */
    __le16 eh_depth;      /* has tree real underlying blocks? */
    __le32 eh_generation; /* generation of the tree */
};

如果文件小于512M 则 extent header 的deep值为0 ,4个条目全部为数据项。 关于为什么是512M 因为数据项条目中的 __le16 ee_len 最大管理32768个块 也就是一个组块大小128M 4个就是512M

struct ext4_extent {
    __le32 ee_block;    /* first logical block extent covers */
    __le16 ee_len;      /* number of blocks covered by extent */
    __le16 ee_start_hi; /* high 16 bits of physical block */
    __le32 ee_start_lo; /* low 32 bits of physical block */
};

如果文件大于512M则 extent header 的deep值为1 后面的条目就变成了索引

struct ext4_extent_idx {
    __le32 ei_block;   /* index covers logical blocks from 'block' */
    __le32 ei_leaf_lo; /* pointer to the physical block of the next *
                        * level. leaf or next index could be there */
    __le16 ei_leaf_hi; /* high 16 bits of physical block */
    __u16 ei_unused;
};

此时需要一个额外的块存储叶子节点。一个块占用4KB,除去12byte的Header其他全都用来存储ext4_extent,共可以存储:(4096 - 12)/ 12 = 340 个ext4_extent,算下来可以存储 340 * 128M = 42.5G,当树第一次分裂占用一个块去存储ext4_extent的情况下,文件就已经非常大了,即当 deep=1 最大文件就扩展到了 42.5*4G

14.inline_data

如果开启,且inode标志inline启用,则数据直接存在__le32 i_block[15]这个数组里
内联数据实际上最大能存的数量为:256 - 128 - i_extra_isize(32) - INLINE_DAT_ATTR(16) +60 = 140

开启内联后,inode后面会存储该结构

struct ext4_xattr_entry {
    __u8 e_name_len;      /* length of name */
    __u8 e_name_index;    /* attribute name index */
    __le16 e_value_offs;  /* offset in disk block of value */
    __le32 e_value_block; /* disk block attribute is stored on (n/i) */
    __le32 e_value_size;  /* size of attribute value */
    __le32 e_hash;        /* hash value of name and value */
    char e_name[0];       /* attribute name */
};

在这里插入图片描述
在这里插入图片描述

15.目录条目(线性目录)

这里只分析线性目录,下图为经典线性目录结构
在这里插入图片描述
如下图为根目录节点的数据结构,可以看出,目录项以一种平铺的方式(线性)存储的
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
注意这里是有“.”和“…”目录的,其指向的inode=2即根目录
在这里插入图片描述
如果开启了filetype特性,目录使用以下结构

struct ext4_dir_entry_2 {
    __le32 inode;   /* Inode number */
    __le16 rec_len; /* Directory entry length */
    __u8 name_len;  /* Name length */
    __u8 file_type;
    char name[EXT4_NAME_LEN]; /* File name */
};

否则使用下面的结构

struct ext4_dir_entry {
    __le32 inode;             /* Inode number */
    __le16 rec_len;           /* Directory entry length */
    __le16 name_len;          /* Name length */
    char name[EXT4_NAME_LEN]; /* File name */
};

16.日志

16.1 has_journal

如果格式化时 –O ^has_journal,则干脆不会创建日志,如下图,没有了创建日志的阶段
在这里插入图片描述
特殊节点8的值全部为0,挂载时显示没有日志
在这里插入图片描述
在这里插入图片描述
对比
在这里插入图片描述

16.2 日志3种模式

Journal
data=journal模式可靠性最高,提供了完全的数据块和元数据块的日志,所有的数据都会被先写入到日志里,然后再写入磁盘上。在文件系统崩溃的时候,可以通过日志重放,把数据和元数据恢复到一致性的状态。但同时,journal模式性能是三种模式中最差的,因为所有的数据都需要日志来记录。并且该模式不支持delayed allocation(延迟分配)以及O_DIRECT(直接IO)。

ordered(*)
data=ordered模式是ext4文件系统默认日志模式,在该模式下,文件系统只提供元数据的日志,但它逻辑上将与数据更改相关的元数据信息与数据块分组到一个称为事务的单元中。当需要把元数据写入到磁盘上的时候,与元数据关联的数据块会首先写入。也就是数据先落盘,再将元数据的日志刷到磁盘。 在机器crash时,未完成的写操作对应的元数据仍然保存在文件系统日志中,因此文件系统可以通过回滚日志将未完成的写操作清除。所以,在ordered模式下,crash可能会导致在crash时操作的文件损坏,但对于文件系统本身以及其他未操作的文件,是可以确保安全的。一般情况下,ordered模式的性能会略逊色于writeback但是比journal模式要快的多。

Writeback
data=writeback模式下,当元数据提交到日志后,data可以直接被提交到磁盘。即会做元数据日志,数据不做日志,并且不保证数据比元数据先落盘。metadata journal是串行操作,因此采用writeback模式就不会出现由于其他进程写journal,阻塞另一个进程的情况,因此IOPS也能得到提升。writeback是ext4提供的性能最好的模式。不过,尽管writeback模式也能保证文件系统自身的安全性,但是在系统crash时文件数据也更容易丢失和损坏。

16.3 日志结构

日志结构为超级块+事务+事务+…,可以将日志除了超级块之外的区域看出环形缓冲区,事务在这个区域循环利用
在这里插入图片描述

16.3.1 日志超级块

如果不使用journal_dev,且开启日志,日志会自动在某个区块中创建,根据特殊节点8能够找到对应超级块
如下图该日志大小:32768*4096=128M,即使用了一个块组
在这里插入图片描述
Commit id代表日志提交从第几个是开始有效的,Start blocknr代表日志从日志区的第几个块开始

16.3.2 日志事务结构

在这里插入图片描述

#define JBD2_DESCRIPTOR_BLOCK 1
#define JBD2_COMMIT_BLOCK 2
#define JBD2_SUPERBLOCK_V1 3
#define JBD2_SUPERBLOCK_V2 4
#define JBD2_REVOKE_BLOCK 5

#define JBD2_MAGIC_NUMBER 0xc03b3998U

typedef struct journal_header_s {
    __be32 h_magic;
    __be32 h_blocktype;
    __be32 h_sequence;
} journal_header_t;

第一个红框为超级块,在magic后面跟着的type类型可以看出是 JBD2_SUPERBLOCK_V2
第二个红框为 JBD2_DESCRIPTOR_BLOCK ,日志描述符
后面的就是数据块
在这里插入图片描述
在这里插入图片描述

  • 35
    点赞
  • 33
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

夜暝

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值