qcow2 总结

最新推荐文章于 2024-05-03 19:50:33 发布

置顶车泰勇

最新推荐文章于 2024-05-03 19:50:33 发布

阅读量4k

点赞数 2

分类专栏：存储文章标签： qcow2 虚拟化 libvirt qemu

本文链接：https://blog.csdn.net/ctylihuai/article/details/117285629

版权

存储专栏收录该内容

3 篇文章 0 订阅

订阅专栏

本文深入解析了qcow2文件格式，包括其内存管理机制，如TLB缓存与三级表的类比，以及qcow2如何通过L1、L2和cluster表快速定位数据。此外，介绍了qcow2的预分配策略对性能的影响，并详细阐述了引用计数表和引用计数块在快照功能中的关键作用。最后，概述了qcow2快照的创建过程及簇的分配算法。

摘要由CSDN通过智能技术生成

1、qcow2文件分布

对内存不了解的可跳过此部分
MMU虚拟地址和物理地址，TLB（Translation Lookaside Buffer）
   4k一个page，需要一个地址存放，4K-->4B,4G-->4MB，100万个页, 100个进程需要400M
   将页表（一级页表）分为 1024 个页表（二级页表），每个表（二级页表）中包含 1024 个「页表项」
   页表一定要覆盖全部虚拟地址空间，不分级的页表就需要有 100 多万个页表项来映射，
       而二级分页则只需要 1024 个页表项
   二级分页再推广到多级页表
   专门存放程序最常访问的页表项的 Cache，这个 Cache 就是 TLB（Translation Lookaside Buffer）

qcow2文件数据和TLB类似，L1，L2和cluster表一起形成三级表
   目的：通过地址快速找到数据
   cluster表中每个条目存放用户数据，L2表条目存放cluster的地址，L1表条目存放L2表的起始地址，
   这里的地址指的是qcow2文件内的偏移，根据这个地址可以在qcow2文件内找到用户数据。

2、结构
qemu-5.1.0/block/qcow2.h
typedef struct QCowHeader {
uint32_t magic;       //QCOW magic string ("QFI\xfb")
uint32_t version;   //Version number
uint64_t backing_file_offset;   //backing file name
uint32_t backing_file_size;       //less than 1024 bytes
uint32_t cluster_bits;           //1 << cluster_bits is the cluster size, < 2M
uint64_t size; /* in bytes */   //Virtual disk size in bytes.
                                   //L1 table limit of 32MB, with 2M cluster 2Eb, 512 byte 128GB
                                   //L1/L2 layouts limit 64 PB
uint32_t crypt_method;           //0 for no encryption, 1 for AES, 2 for LUKS
   uint32_t l1_size; /* XXX: save number of clusters instead ? */
                                   //Number of entries in L1 table
uint64_t l1_table_offset;       //active L1 talbe starts, aligned to a cluster
uint64_t refcount_table_offset;           //the refcount table starts, aligned to a cluster
uint32_t refcount_table_clusters;       //Number of clusters that the refcount table occupies
uint32_t nb_snapshots;           //Number of snapshots contained in the image
uint64_t snapshots_offset;       //snapshot table starts

/* The following fields are only valid for version >= 3 */
uint64_t incompatible_features;           //
uint64_t compatible_features;           //
uint64_t autoclear_features;           //https://git.qemu.org/?p=qemu.git;a=blob;f=docs/interop/qcow2.txt

uint32_t refcount_order; //Describes the width of a reference count block entry
uint32_t header_length; //Length of the header structure in bytes

/* Additional fields */
uint8_t compression_type; //0: zlib, 1: zstd

/* header must be a multiple of 8 */
uint8_t padding[7]; //
} QEMU_PACKED QCowHeader;

###偏移地址计算（从客户机的磁盘设备虚拟偏移地址转换为宿主机的镜像文件中的真实偏移地址）
Given an offset into the virtual disk, the offset into the image file can be obtained as follows:

// 每个cluster包含的L2表个数
l2_entries = (cluster_size / sizeof(uint64_t));

// offset在L2中的索引
l2_index = (offset / cluster_size) % l2_entries;

// offset所属的L2在L1中的索引
l1_index = (offset / cluster_size) / l2_entries;

// 从L1中获取L2的起始地址”并加载到内存
l2_table = load_cluster(l1_table[l1_index]);

// 从offset所在的L2中获取所在簇的起始地址
cluster_offset = l2_table[l2_index];

// offset在镜像中的真实地址 = 簇起始地址 + 簇内偏移
return cluster_offset + (offset % cluster_size);

###偏移地址所在簇的refcount获取
每个cluster都含有2字节的引用计数表，每个表项描述一个“refcount块”的起始地址
每个refcount块占用一个cluster，每个表项的大小为refcount_bit（qcow2必须为16）

// 每个refcount块容纳的refcount表项个数
refcount_block_entries = (cluster_size * 8 / refcount_bits)

// offset所在的refcount块索引
refcount_block_index = (offset / cluster_size) % refcount_block_entries

// offset所在refcount块在refcount表中索引
refcount_table_index = (offset / cluster_size) / refcount_block_entries

// 加载offset所在的refcount块到内存
refcount_block = load_cluster(refcount_table[refcount_table_index]);

// 获得offset所在簇的refcount值
return refcount_block[refcount_block_index];

与操作->右移->与操作。直到最后找到顶级表的索引,这里的地址指的是qcow2文件内的偏移

3、预分配(preallocation)策略

旧版本，不准确
preallocation setting   time to create       phy size
off                       0.312s               196K
metadata               0.507s               844K
full                   39.402s               4.0G
falloc                   0.015s               4.0G

For this test each virtual disk is mounted and written to using dd

preallocation setting   time to create       MB/s
off                       184.23s               729kB/ s
metadata               85.87s               1.6MB/ s
falloc                   100.77s               1.3MB/ s
full                   84.31s               1.6MB/ s

4、计算

镜像的大小计算 https://my.oschina.net/LastRitter/blog/1542075

cluster_size -- 簇大小，默认为65536字节（64K）
total_size -- 要分配的镜像大小
refcount_bits -- refcount占用bit数，qcow2必须为16

按字节对齐的镜像大小
aligned_total_size = align_offset(total_size, cluster_size);

Header 大小
header_size = cluster_size；

L2表项个数
l2_num = aligned_total_size / cluster_size;
l2_num = align_offset(l2_num, cluster_size / sizeof(uint64_t));

L2表大小
l2_size = l2_num * sizeof(uint64_t);

L1表项个数
l1_num = l2_num * sizeof(uint64_t) / cluster_size;
l1_num = align_offset(l1_num, cluster_size / sizeof(uint64_t));

L1表大小
l1_size = l1_num * sizeof(uint64_t);

每个refcount的大小，以及一个refcount块包含的refcount个数
refcount_size = refcount_bits / 8;
refcount_num = cluster_size / refcount_size;

/* total size of refcount blocks
*
* note: every host cluster is reference-counted, including metadata
* (even refcount blocks are recursively included).
* Let:
* a = total_size (this is the guest disk size)
* m = meta size not including refcount blocks and refcount tables
* c = cluster size
* y1 = number of refcount blocks entries
* y2 = meta size including everything
* rces = refcount entry size in bytes
* then,
* y1 = (y2 + a)/c
* y2 = y1 * rces + y1 * rces * sizeof(u64) / c + m
* we can get y1:
* y1 = (a + m) / (c - rces - rces * sizeof(u64) / c)
*/

refcount块个数
refcount_block_num = (aligned_total_size + header_size + l1_size + l2_size) /
(cluster_size – refcount_size - refcount_size * sizeof(uint64_t) / cluster_size)

refcount块大小
refcount_block_size = DIV_ROUND_UP(refcount_block_num, refcount_num) * cluster_size;

refcount表个数
refcount_table_num = refcount_block_num / refcount_num;
refcount_table_num = align_offset(refcount_block_num, cluster_size / sizeof(uint64_t));

refcount表大小
refcount_table_size = refcount_table_num * sizeof(uint64_t);

总大小
使用默认参数（簇大小为65526字节，8字节地址，refcount为16位，2字节），
不考虑字节对齐，不考虑快照的情况下，镜像为10G的近似计算如下：

t = total_size;
c = cluster_size;
header_size = c;
l2_size = t/c * 8;
l1_size = t/c / (c/8) * 8;
rb_size = t/c * 2;
rt_size = t/c / (c/2) * 8;
image_size ≈ t + c + t/c * 8 + t/c / (c/8) * 8 + t/c * 2 + t/c / (c/2) * 8
= t + c + t/c*10 + t/(c*c)*80
≈ 10G + 1.63M
因此使用默认参数时，元数据的大小大概只占磁盘数据大小的0.02%不到。新分配快照时，
初始需要的空间为当时的L1表大小，约为160字节，几乎可以忽略不计
（在修改后，所需空间会大幅增长，具体跟写入的量有关）。

5、 qcow2的引用计数表和引用计数块

refcount table：引用计数表；refcount blocks：引用计数块。

两张表只处理一个问题：cluster的引用计数。如果用一张表，表中每个条目记录一个cluster的引用计数，
也可以达到目录，但两张表可以提高索引效率，与用L1，L2 ，cluster表存储用户数据的目的相同。

引用计数块的每个条目存放了cluster的引用计数，引用计数表存放的是引用计数块的起始地址。

qcow2为啥要记录cluster的引用计数？

qcow2要实现快照这个高级特性，怎么实现？通过写时复制（cow），复制对象是cluster数据块。
快照的普遍实现原理就是利用cow，在做快照时将cluster标记为只读，后续有写操作时先检查cluster是否只读，
如果是就复制一份再写。所以必须有一个标记用来表明cluster是否是只读的，但仅仅是一个标记还不够，
因为对同一个qcow2可能快照很多次，重复标记只读对删除快照没有帮助，删除快照时，对于做了多次快照的cluster，
qcow2怎么知道哪些cluster需要被真正删除，哪些还在被其它快照引用呢？

所以简单实用标记来记录只读属性没有用，因此qcow2引入了引用计数表和引用计数块，这两张表用来记录cluster的引用计数。

cluster引用计数为0：这个cluster没有被使用。

cluster引用计数为1：这个cluster正在被使用。

cluster引用计数为2或者以上：这个cluster正在被使用，并且有快照包含了这个cluster，写这个cluster之前需要执行cow。

有了引用计数这个基础功能，快照这个高级特性才得以实现。

因此可以说，引用计数表和引用计数块是为了实现qcow2快照而设计的。

6、快照

复制一份L1表，将所有已被分配的L1和L2表项的最高位置位0，以及对应的refcount计数加1；

在快照表中新分配个表项，把L1属性指向新创建的L1表，Header中的nb_snapshots加1；

当向镜像写入数据时，对应的L1或者L2表项为0，对应的refcount大于或等于2，
则需要重新分配对应的L2表项和簇，并让L1表项指向新的L2表，L2表项执行新分配的簇。

簇分配（不预先分配时的算法，写入时进行分配，读取时不分配，直接填充0）

当要向偏移地址offset处写入数据时，从Header的l1_table_offset中获得L1表的起始偏移地址；

用前偏移地址offset的前“64 - l2_bits - cluster_bits“位作为索引从L1表中获取对应L2表的描述符
（l2_bits为每个簇中保存的L2表项个数，cluster_bits为每个簇大小的比特数）；

如果L2表的表述符的最高位为0（未被分配或者是COW），则需要新分配L2表；

从refcount表和refcount块中查找2块未被使用的簇，标记为1；

初始化第一个簇为L2表，并使用它的起始地址初始化对应的L1表项；

初始化第二个簇，并使用它的地址初始化对应的L2表项；

使用第二个簇的首地址加上偏移地址的簇内偏移得出其真实地址；

在此地址进行数据写入。

车泰勇

关注

2
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
qcow2 总结

1、qcow2文件分布对内存不了解的可跳过此部分MMU虚拟地址和物理地址，TLB（Translation Lookaside Buffer） 4k一个page，需要一个地址存放，4K-->4B,4G-->4MB，100万个页, 100个进程需要400M 将页表（一级页表）分为 1024 个页表（二级页表），每个表（二级页表）中包含 1024 个「页表项」页表一定要覆盖全部虚拟地址空间，不分级的页表就需要有 100 多万个页表项来映射，而二级分页...
复制链接

扫一扫

专栏目录