一、HugePages大页内存
1.1 物理内存与虚拟内存
物理内存
也称为实际内存或硬件内存,是计算机中实际安装的内存条的容量。
虚拟内存
是一种利用硬盘空间来扩展物理内存的技术,它允许计算机将暂时不需要的数据和程序从物理内存中转移到硬盘上,以释放出物理内存空间。当这些数据和程序再次需要时,它们可以被重新加载到物理内存中。
1.2 MMU与内存页
MMU
(Memory Mangement Unit),核心思想是利用虚拟地址替代物理地址,即CPU寻址时使用虚址,由MMU负责将虚址映射为物理地址。
内存页
(Memory paging),是基于MMU的一种内存管理机制,它将虚拟地址和物理地址按固定大小(通常是4K)分割成页(page)和页帧(page frame),这种机制,从数据结构上保证了访问内存的高效,并使操作系统能支持非连续性的内存分配。
虚拟内存和物理内存的映射关系保存在页表(page table)中,为了进一步优化性能,现代CPU架构引入了 TLB
(Translation Lookaside Buffer),用来缓存一部分经常访问的页表内容。
1.3 HugePages大页内存
TLB有大小限制,当超出TLB上限时,就会发生TLB miss,如果频繁的出现TLB miss,程序的性能会下降的很快。为了让TLB可以存储更多的页地址映射关系,通常的做法是调大内存分页大小, 一般把大于4K的页,统称为大页(HugePages)
。
假设一个程序需要64G内存,如果内存页大小是4K,就会有64G/4k=4194304(410241024)个页表,超过TLB上限,出现TLB miss;而如果把内存页大小设置为2M,则只有64G/2M=8192(8*1024)个页表。
二、linux中使用HugePages
查看HugePages
# cat /proc/meminfo | grep Huge
AnonHugePages: 96256 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB # 2Mi的大页内存
Hugetlb: 0 kB
调整HugePages:
# echo 128 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
再次查看HugePages信息:
# cat /proc/meminfo | grep Huge
AnonHugePages: 100352 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 128
HugePages_Free: 128
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 17039360 kB
三、kubernetes中使用HugePages
kubernetes中使用Hugepages是在pod中配置的,官网说明可参考说明:Manage HugePages,本章节只做搬运工。
3.1 前提
为了使节点能够上报HugePages容量,Kubernetes节点必须预先分配HugePages,每个节点支持预先分配多种规格的HugePages。
3.2 API
用户可以通过在容器级别的资源需求中使用资源名称 hugepages- 来使用HugePages,其中的 size 是特定节点上支持的以整数值表示的最小二进制单位。 例如,如果一个节点支持 2048KiB 和 1048576KiB 页面大小,它将公开可调度的资源 hugepages-2Mi 和 hugepages-1Gi。与 CPU 或普通内存不同,HugePages不支持超分(overcommit)。 注意,在请求HugePages资源时,还必须请求普通内存或 CPU 资源。
同一 Pod 的 spec 中可能会消耗不同尺寸的HugePages。在这种情况下,它必须对所有挂载卷使用 medium: HugePages- 标识:
apiVersion: v1
kind: Pod
metadata:
name: huge-pages-example
spec:
containers:
- name: example
image: fedora:latest
command:
- sleep
- inf
volumeMounts:
- mountPath: /hugepages-2Mi
name: hugepage-2mi
- mountPath: /hugepages-1Gi
name: hugepage-1gi
resources:
limits:
hugepages-2Mi: 100Mi
hugepages-1Gi: 2Gi
memory: 100Mi
requests:
memory: 100Mi
volumes:
- name: hugepage-2mi
emptyDir:
medium: HugePages-2Mi
- name: hugepage-1gi
emptyDir:
medium: HugePages-1Gi
Pod 只有在请求同一大小的巨页时才使用 medium:HugePages
:
apiVersion: v1
kind: Pod
metadata:
name: huge-pages-example
spec:
containers:
- name: example
image: fedora:latest
command:
- sleep
- inf
volumeMounts:
- mountPath: /hugepages
name: hugepage
resources:
limits:
hugepages-2Mi: 100Mi
memory: 100Mi
requests:
memory: 100Mi
volumes:
- name: hugepage
emptyDir:
medium: HugePages
注意事项:
- HugePages的request必须等于limit
- HugePages是在容器级别隔离的,因此每个容器都可以有自己的Hugepages规格配置
- HugePages可用于 EmptyDir 卷,不过 EmptyDir 卷所使用的大小不能够超出 Pod request的HugePages大小
- 通过带有 SHM_HUGETLB 的 shmget() 使用HugePages的应用,必须运行在一个与 proc/sys/vm/hugetlb_shm_group 匹配的补充组下
- 可以通过 ResourceQuota 来限制HugePages的用量,类似于 cpu 或 memory 等其他计算资源,HugePages使用hugepages- 标识
四、kubevirt中使用HugePages
以下内容基于kubevirt@0.26.0
kubevirt中给给虚拟机使用HugePages是在vmi对象中配置的:
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachineInstance
metadata:
name: vmi-test
namespace: public
spec:
domain:
cpu:
cores: 67 # 注意这个参数,这里是随意设置,但是后面会用到
model: host-passthrough
sockets: 1
threads: 1
devices:
inputs:
- bus: usb
name: input0
type: tablet
interfaces:
- masquerade: {}
name: default
features:
acpi:
enabled: true
ioThreadsPolicy: auto
machine:
type: q35
memory:
hugepages:
pageSize: 2Mi # 使用2Mi的HugePages
resources:
limits:
cpu: "4"
memory: 4Gi # 和request memory相等
requests:
cpu: "2"
memory: 4Gi # 和limit memory相等
networks:
- name: default
pod: {}
创建完该vmi后,生成的pod如下(只保留了相关字段):
apiVersion: v1
kind: Pod
metadata:
name: virt-launcher-test-vmi-hv8sv
namespace: public
spec:
containers:
- name: compute
resources:
limits:
cpu: "4"
hugepages-2Mi: 4Gi # 2Mi大页内存大小被设置成了vmi request/limit memory大小
memory: "723591169" # ???
requests:
cpu: "2"
hugepages-2Mi: 4Gi # 2Mi大页内存大小被设置成了vmi request/limit memory大小
memory: "723591169" # ???
volumeMounts:
- mountPath: /dev/hugepages
name: hugepages
volumes:
- emptyDir:
medium: HugePages
name: hugepages
上面pod yaml的request/limit下cpu、hugepages-2Mi数值都可以理解,但request/limit下的memory被设置为了723591169
,这是什么?要回答这个问题,得看看kubevirt源码:
// pkg/virt-controller/services/template.go
func (t *templateService) RenderLaunchManifest(vmi *v1.VirtualMachineInstance) (*k8sv1.Pod, error) {
/*...*/
// Get memory overhead
memoryOverhead := getMemoryOverhead(vmi.Spec.Domain)
/*...*/
// Consider hugepages resource for pod scheduling
if vmi.Spec.Domain.Memory != nil && vmi.Spec.Domain.Memory.Hugepages != nil {
hugepageType := k8sv1.ResourceName(k8sv1.ResourceHugePagesPrefix + vmi.Spec.Domain.Memory.Hugepages.PageSize)
resources.Requests[hugepageType] = resources.Requests[k8sv1.ResourceMemory]
resources.Limits[hugepageType] = resources.Requests[k8sv1.ResourceMemory]
// Configure hugepages mount on a pod
volumeMounts = append(volumeMounts, k8sv1.VolumeMount{
Name: "hugepages",
MountPath: filepath.Join("/dev/hugepages"),
})
volumes = append(volumes, k8sv1.Volume{
Name: "hugepages",
VolumeSource: k8sv1.VolumeSource{
EmptyDir: &k8sv1.EmptyDirVolumeSource{
Medium: k8sv1.StorageMediumHugePages,
},
},
})
// 如果开启了HugePages,会把limit/request memory设置为memoryOverhead
resources.Requests[k8sv1.ResourceMemory] = *memoryOverhead
if _, ok := resources.Limits[k8sv1.ResourceMemory]; ok {
resources.Limits[k8sv1.ResourceMemory] = *memoryOverhead
}
} else {
// Add overhead memory
memoryRequest := resources.Requests[k8sv1.ResourceMemory]
if !vmi.Spec.Domain.Resources.OvercommitGuestOverhead {
memoryRequest.Add(*memoryOverhead)
}
resources.Requests[k8sv1.ResourceMemory] = memoryRequest
if memoryLimit, ok := resources.Limits[k8sv1.ResourceMemory]; ok {
memoryLimit.Add(*memoryOverhead)
resources.Limits[k8sv1.ResourceMemory] = memoryLimit
}
}
/*...*/
}
从上面代码可以看到,如果开启了HugePages,kubevirt会把pod的limit/request memory设置为memoryOverhead,memoryOverhead函数来自如下函数:
// getMemoryOverhead computes the estimation of total
// memory needed for the domain to operate properly.
// This includes the memory needed for the guest and memory
// for Qemu and OS overhead.
//
// The return value is overhead memory quantity
//
// Note: This is the best estimation we were able to come up with
// and is still not 100% accurate
func getMemoryOverhead(domain v1.DomainSpec) *resource.Quantity {
vmiMemoryReq := domain.Resources.Requests.Memory()
overhead := resource.NewScaledQuantity(0, resource.Kilo)
// Add the memory needed for pagetables (one bit for every 512b of RAM size)
pagetableMemory := resource.NewScaledQuantity(vmiMemoryReq.ScaledValue(resource.Kilo), resource.Kilo)
pagetableMemory.Set(pagetableMemory.Value() / 512)
overhead.Add(*pagetableMemory)
// Add fixed overhead for shared libraries and such
// TODO account for the overhead of kubevirt components running in the pod
overhead.Add(resource.MustParse("128M"))
// Add CPU table overhead (8 MiB per vCPU and 8 MiB per IO thread)
// overhead per vcpu in MiB
coresMemory := resource.MustParse("8Mi")
if domain.CPU != nil {
value := coresMemory.Value() * int64(domain.CPU.Cores)
coresMemory = *resource.NewQuantity(value, coresMemory.Format)
}
overhead.Add(coresMemory)
// static overhead for IOThread
overhead.Add(resource.MustParse("8Mi"))
// Add video RAM overhead
if domain.Devices.AutoattachGraphicsDevice == nil || *domain.Devices.AutoattachGraphicsDevice == true {
overhead.Add(resource.MustParse("16Mi"))
}
return overhead
}
从getMemoryOverhead函数中可以看出,overhead其实是给domain预留一定的普通内存
。但是计算overhead时,如果vmi.spec.domain.cpu.cores配置了数值,overhead会加上8Mi * cores值。例如面vmi yaml中,request.memoty配置了4Gi,cpu.cores配置了67,根据getMemoryOverhead函数计算,overhead = 4Gi/512 + 128M + 8Mi*67 + 8Mi,统一单位后,就是723591169(大约0.7G),也就是pod yaml中看到的limit/request memory的值。
overhead的计算过程,说明创建一个vmi时,即使vmi中只声明了HugePages大页内存,kubevirt也会给pod额外使用一些普通内存,这一点在做精细化的资源调度的时候一定要注意,否则可能会因节点上的普通内存预留太少导致无法调度。
微信公众号卡巴斯同步发布,欢迎大家关注。