NSX-T Edge Node 3.1.3.0 升级到 3.2.0报错
KB: https://docs.vmware.com/cn/VMware-NSX-T-Data-Center/3.1/installation/GUID-22F87CA8-01A9-4F2E-B7DB-9350CA60EA4E.html
第一个 Edge 节点升级到 3.2.0 时,从一开始就已经出现一堆的错误消息。
问题情况发生在我的 NSX-T 环境中,并且发生在第一个EDGE节点上。升级达到约30%,然后失败,并在下面显示的错误消息。
Edge
3.2.0.0.0.19067070/Edge/nub/VMware-NSX-edge-3.2.0.0.0.19067089.nub switch OS task failed on edge TransportNode
b0630e4a-cac5-48e2-8cd6-7e112822595d: clientType EDGE , target edge
fabric node id b0630e4a-cac5-48e2-8cd6-7e112822595d, return status
switch_os execution failed with msg: An unexpected exception occurred:
CommandFailedError: Command [‘chroot’, ‘/os_bak’,
‘/opt/vmware/nsx-edge/bin/config.py’, ‘–update-only’] returned
non-zero code 1: b"lspci: Unable to load libkmod resources: error
-12\nlspci: Unable to load libkmod resources: error -12\nlspci: Unable to load libkmod resources: error -12\nlspci: Unable to load libkmod
resources: error -12\nlspci: Unable to load libkmod resources: error
-12\nSystem has not been booted with systemd as init system (PID 1). Can’t operate.\nERROR: Unable to get maintenance mode
information\nNsxRpcClient encountered an error: [Errno 2] No such file
or directory\nWARNING: Exception reading InbandMgmtInterfaceMsg from
nestdb, Command ‘[’/opt/vmware/nsx-nestdb/bin/nestdb-cli’, ‘–json’,
‘–cmd’, ‘get’, ‘InbandMgmtInterfaceMsg’]’ returned non-zero exit
status 1.\nERROR: NSX Edge configuration has failed. 1G hugepage
support required\n"
出显以上问题的原因是CPU 兼容性的问题,部份CPU EVC模式隐藏了PDPE1GB CPU功能(1GB Hugepages)尽管它是自Nehalem以来支持的功能。Haswell EVC启用了它,我只能通过禁用EVC并使用featMask高级选项强制启用该功能来支持VM。EVC文档提到,并非所有功能都支持EVC级别,但这有点令人费解,因为前三种架构的大多数企业CPU在大多数情况下都支持PDPE1GB。
解决办法:把EDGE VM 关机,高级选项中添加 featMask.vm.cpuid.pdpe1gb = Val:1
通过从"root"登录名运行 cat /proc/cpuinfo
,验证 pdpe1gb 是否列在每个DEGE节点的 CPU 标志中看不到pdpe1gb.
重启在次升级,问题解决!
或者直接把EDGE节点的集群上禁用 EVC功能。