cadvisor数据获取源码分析

AloneDrifters

已于 2023-02-25 15:40:11 修改

阅读量243

点赞数

分类专栏：云学习文章标签： kubernetes

于 2022-09-29 16:06:03 首次发布

本文链接：https://blog.csdn.net/qq_53609683/article/details/126001459

版权

云学习专栏收录该内容

1 篇文章 0 订阅

订阅专栏

介绍

cAdvisor（Container Advisor）：可以对节点上的资源及容器进行监控与数据的收集，如：cpu、memory、net等

manager

manager是cAdvisor的资源管理器，包含containerData数组，InMemoryCache、与系统文件系统的接口sysFs、监听器以及httpClient等等

type manager struct {
//所有受监控容器的具体操作和信息
containers               map[namespacedContainerName]*containerData
//缓存在内存中的数据，主要是容器相关的一些信息
memoryCache              *memory.InMemoryCache
//host机器的文件系统的相关信息
fsInfo                   fs.FsInfo
//文件系统接口
sysFs                    sysfs.SysFs
//节点的信息
machineInfo              info.MachineInfo
//cadvisor运行在哪个容器
cadvisorContainer        string
//对event相关操作的封装
eventHandler             events.EventManager
//manager的启动时间
startupTime              time.Time
//搜集更新容器信息的时间间隔
maxHousekeepingInterval  time.Duration
...
}

其中InMemoryCache是一个内存存储器，包含的containerCacheMap为容器数据缓存的map，storageDriver控制数据会被持久化存储到哪里，maxAge是数据在内存保留多长时间

sysFs是与系统文件系统的接口，获取文件系统对应的信息

maxHousekeepingInterval 是搜集更新容器信息的时间间隔

cAdvisor启动后就会有上面这些内容

Manager接口

Manager接口中定义了获取容器、进程和节点数据的方法

Container info

//  获取指定 container 的 info 数据
DockerContainer(containerName string, query *info.ContainerInfoRequest) (info.ContainerInfo, error)

1、参数有

containerName 容器名
ContainerInfoRequest（指定了用户想要获得多少关于容器的数据，即要返回的最大统计数、开始时间和结束时间）

2、返回值ContainerInfo中包含

ContainerReference 是唯一标识容器信息的结构体（容器id 名称别名命名空间）

Subcontainers []ContainerReference 是当前容器的直接子容器

ContainerSpec 是容器信息，创建时间 cpu使用情况内存使用情况使用的镜像等等

Stats []*ContainerStats 是从容器中收集的历史数据的统计，cpu 内存 io 文件系统

3、函数实现

3.1、DockerContainer通过容器名获取ContainerData（通过manager中map[namespacedContainerName]*containerData 查找ContainerData ）

container, err := m.getDockerContainer(containerName)

containerData包含提供对容器的实际操作以及containerInfo，对InMemoryCache中的数据的实际操作都由他进行

type containerData struct {
   oomEvents                uint64
   handler                  container.ContainerHandler
   info                     containerInfo
   memoryCache              *memory.InMemoryCache
   housekeepingInterval     time.Duration
   maxHousekeepingInterval  time.Duration
   allowDynamicHousekeeping bool
   // 更新perf_event cgroup控制器的统计信息
   perfCollector stats.Collector
   // 更新resctrl控制器的统计信息
   resctrlCollector stats.Collector
   ...
}

3.2、获取到ContainerInfo （query后的ContainerInfo）（ContainerData中就有ContainerInfo）

m.containerDataToContainerInfo(container, query)  （参数为ContainerData和ContainerInfoRequest）

3.2.1、先通过containerData的GetInfo(…)方法

cinfo, err := cont.GetInfo(true)

将containerData中ContainerInfo的ContainerReference、子容器和ContainerSpec 赋值给要返回的ContainerInfo中
GetInfo()方法中

// 间隔超过5s就会更新spec和子容器信息
if cd.clock.Since(cd.infoLastUpdatedTime) > 5*time.Second || shouldUpdateSubcontainers {
   err := cd.updateSpec()

3.2.2、通过manager的getAdjustedSpec(info)，将spec.Memory.Limit从默认值设置为实际值

spec.Memory.Limit = uint64(m.machineInfo.MemoryCapacity)

3.2.3、最后获取到stats

// memoryCache为InMemoryCache类型，包含containerCacheMap 容器数据缓存的map
stats, err := m.memoryCache.RecentStats(cinfo.Name, query.Start, query.End, query.NumStats)

获取stats的具体函数为通过containerCacheMap[name]中获取到containerCache，然后通过containerCache的RecentStats(start, end, maxStats)获取到stats

4、manage资源管理器会定时读取Linux 上相关的文件，数据都来源于Cgroups，Cgroups 的目录为/sys/fs/cgroup，该目录下包含了 Cgroups 的各个子系统 subSystem。subSystem可以用来对不同的资源（如CPU、内存、PID、磁盘 IO）进行限制（如：cpu超出限制会将其cpu使用率降到设置的值附近，内存超出限制可以在文件中设置选择将其杀掉）以及采集各个资源的使用数据。

（cgroup中的概念：

1、任务：task，一个任务就是一个进程

2、控制组：cgroup，cgroup资源控制方式就是以控制组的方式，控制组中配置资源的限制，一个cgroup中的 tasks 中包含多个task

3、层级树：hierarchy，描述控制组的层级关系，层级树上的一个节点就是一个cgroup，子节点的控制信息继承父节点的控制信息

4、子系统：subSystem，资源控制器，各个子系统控制各自的资源，一个子系统加入到伯格层级树，该层级树中的所有节点，都会被这个子系统控制，比如：在子系统下创建文件夹，该文件夹就会拥有这些控制文件，这个文件夹就是一个节点，即一个控制组，也就表示在该层级树上创建了一个节点

）

5、更新ContainerSpec
-----通过containerData 去 updateSpec()

// updateSpec()方法会调用下面两个方法
spec, err := cd.handler.GetSpec()
customMetrics, err := cd.collectorManager.GetSpec() //自定义指标

数据由cgroup提供，读取的 cgroup 文件如下：

cpu文件：cpu.weight、cpu.max、cpu.shares、cpu.cfs_period_us、cpu.cfs_quota_us等等

cpuacct文件： cpuacct.usage、cpuacct.stat、cpuacct.usage_percpu

cpuset文件：cpuset.cpus.effective、cpuset.cpus等等

memory文件：memory.max、memory.high、memory.max、memory.swap.max、memory.limit_in_bytes、memory.memsw.limit_in_bytes、memory.soft_limit_in_bytes等等

blkio文件：blkio.time、blkio.avg_queue_size、blkio.dequeue、blkio.io_service_bytes等等

…

GetSpec() 中调用getSpecInternal() 方法，通过cgroupPaths 和对应的系统文件获取到machine info（ machineInfoFactory.GetMachineInfo() ）、CPU、Cpu Mask、Memory、Hugepage、Processes数据

func getSpecInternal(cgroupPaths map[string]string, machineInfoFactory info.MachineInfoFactory, hasNetwork, hasFilesystem, cgroup2UnifiedMode bool) (info.ContainerSpec, error)

getSpecInternal() 方法实现（通过cgroup获取数据 ) (非cgroup2UnifiedMode）

(1) Get machine info

mi, err := machineInfoFactory.GetMachineInfo()
if err != nil { return spec, err }

(2) Get cpu info

spec.Cpu.Limit = readUInt64(cpuRoot, "cpu.shares")
spec.Cpu.Period = readUInt64(cpuRoot, "cpu.cfs_period_us")
// 将读出的quota字符串转化为十进制，quota不为空且不等于-1
quota := readString(cpuRoot, "cpu.cfs_quota_us")
if quota != "" &amp;&amp; quota != "-1" {
   val, err := strconv.ParseUint(quota, 10, 64)
   if err != nil {
      klog.Errorf("GetSpec: Failed to parse CPUQuota from %q: %s", path.Join(cpuRoot, "cpu.cfs_quota_us"), err)
   } else {
      spec.Cpu.Quota = val
   }
}

(3) Get Cpu Mask info ( cgroup 中 cpuset )

mask = readString(cpusetRoot, "cpuset.cpus")
// mi 是(1)中获取的 MachineInfo
spec.Cpu.Mask = utils.FixCpuMask(mask, mi.NumCores)

(4) Get Memory info

spec.Memory.Limit = readUInt64(memoryRoot, "memory.limit_in_bytes")
spec.Memory.SwapLimit = readUInt64(memoryRoot, "memory.memsw.limit_in_bytes")
spec.Memory.Reservation = readUInt64(memoryRoot, "memory.soft_limit_in_bytes")

(5) Get Hugepage info

hugepageRoot, ok := cgroupPaths["hugetlb"]
if ok {
   if utils.FileExists(hugepageRoot) {
      spec.HasHugetlb = true
   }
}

(6) Get pids info （直接从pids路径读取它的值）

pidsRoot, ok := GetControllerPath(cgroupPaths, "pids", cgroup2UnifiedMode)
if ok {
   if utils.FileExists(pidsRoot) {
      spec.HasProcesses = true
      spec.Processes.Limit = readUInt64(pidsRoot, "pids.max")
   }
}

(7) Network、Filesystm、DiskIo

（hasNetwork、hasFilesystem是方法的参数）

spec.HasNetwork = hasNetwork
spec.HasFilesystem = hasFilesystem

ioControllerName := "blkio"
if blkioRoot, ok := cgroupPaths[ioControllerName]; ok && utils.FileExists(blkioRoot) {
   spec.HasDiskIo = true
}

6、获取指定容器的cgroup和网络统计信息

func (h *Handler) GetStats() (*info.ContainerStats, error)

方法实现：

/proc/cgroups —当前内核支持的 cgroup 子系统

cgroupStats, err := h.cgroupManager.GetStats()
libcontainerStats := &amp;libcontainer.Stats{
   CgroupStats: cgroupStats,
}
stats := newContainerStats(libcontainerStats, h.includedMetrics)

如果我们知道pid，那么从/proc//net/dev获取网络统计信息

...
// If we know the pid then get network stats from /proc/<pid>/net/dev
if h.pid > 0 {
   if h.includedMetrics.Has(container.NetworkUsageMetrics) {
      netStats, err := networkStatsFromProc(h.rootFs, h.pid)
      if err != nil {
         klog.V(4).Infof("Unable to get network stats from pid %d: %v", h.pid, err)
      } else {
         stats.Network.Interfaces = append(stats.Network.Interfaces, netStats...)
      }
   }
   if h.includedMetrics.Has(container.NetworkTcpUsageMetrics) {
      t, err := tcpStatsFromProc(h.rootFs, h.pid, "net/tcp")
      if err != nil {
         klog.V(4).Infof("Unable to get tcp stats from pid %d: %v", h.pid, err)
      } else {
         stats.Network.Tcp = t
...

7、获取 fs 相关信息

func NewFsInfo(context Context) (FsInfo, error)

方法实现

/proc/self/mountinfo —当前运行进程的挂载信息

...
fileReader, err := os.Open("/proc/self/mountinfo")
mounts, err := mount.GetMountsFromReader(fileReader, nil)
...
for _, mnt := range mounts {
   fsInfo.mounts[mnt.Mountpoint] = *mnt
}
...
fsInfo.addDockerImagesLabel(context, mounts)
fsInfo.addCrioImagesLabel(context, mounts)
fsInfo.addSystemRootLabel(mounts)
...

8、启动一个docker容器，cgroup就会在对应子系统下创建节点

cpu：/sys/fs/cgroup/cpu/docker/容器ID/一些限制文件及tasks文件
memory：/sys/fs/cgroup/memory/docker/容器ID/一些限制文件及tasks文件
…

当容器有多个进程，进程ID就会加入到tasks中，一个容器在各层级树中对应一个cgroup，对这个cgroup数据的采集与限制，就是对这个容器中所有进程的采集与限制

9、Cadvisor的架构是一个event机制事件监听层（监听linux系统发生的事件）和事件处理层

事件监听层 ContainerAdd事件（watchForNewContainers函数）OOM事件（watchForNewOoms函数）

rawWatcher直接监控系统的cgroup根目录，发生事件由事件处理层进行数据的更新处理

Machine Info

方法GetMachineInfo()用来获取节点机器信息

func GetCPUVendorID(procInfo []byte) string     //返回“vendor_id ”,读取/proc/cpuinfo文件

func GetPhysicalCores(procInfo []byte) int      //返回读取/proc/cpuinfo文件的CPU核心数

func GetSockets(procInfo []byte) int       //返回读取/proc/cpuinfo文件的CPU套接字数量

func GetClockSpeed(procInfo []byte) (uint64, error)        //返回CPU时钟速度

func GetMachineMemoryCapacity() (uint64, error)      //从/proc/meminfo返回机器的总内存

func GetMachineMemoryByType(edacPath string) (map[string]*info.MemoryInfo, error)  //返回内存容量和DIMMs数量

func GetMachineSwapCapacity() (uint64, error)     //从/proc/meminfo返回机器的总交换空间

…

通过读取文件获取数据

out, err := ioutil.ReadFile("/proc/cpuinfo") //获取cpuinfo

clockSpeed, err := machine.GetClockSpeed(cpuinfo) // 获取时钟速率

out, err := ioutil.ReadFile("/proc/meminfo") //memoryInfo

对应文件：

/proc/cpuinfo //MumCores、CpuFrequency等

/var/lib/dbus/machine-id //machine_id

/proc/sys/kernel/random/boot_id //boot_id

/proc/diskstats //disk

/sys/block //获取磁盘设备信息

/sys/class/net //netDevices

/sys/class/dmi/id/product_uuid //system UUID

ProcessList

func (m *manager) GetProcessList(containerName string, options v2.RequestOptions) ([]v2.ProcessInfo, error)

函数中

1、获取到requestedContainers, 返回ContainerData数组

//conts为 map[string]*containerData
conts, err := m.getRequestedContainers(containerName, options)

2、对ContainerData数组遍历，获取所有容器的ps信息

ps, err = cont.GetProcessList(m.cadvisorContainer, m.inHostNamespace)

2.1、

out, err := cd.getPsOutput(inHostNamespace, format)

此函数中cAdvisor 首先判断自己是否运行在 container 中，如果是则 chroot /rootfs （cAdvisor 在 container 中运行会通过 volume 方式把系统 / mount 到 /rootfs 下）再执行

ps -e -o user,pid,ppid,stime,pcpu,pmem,rss,vsz,stat,time,comm,psr,cgroup

具体函数实现

func (cd *containerData) getPsOutput(inHostNamespace bool, format string) ([]byte, error) {
   args := []string{}
   command := "ps"
   if !inHostNamespace {
      command = "/usr/sbin/chroot"
      args = append(args, "/rootfs", "ps")
   }
   args = append(args, "-e", "-o", format)
   out, err := exec.Command(command, args...).Output()
   if err != nil {
      return nil, fmt.Errorf("failed to execute %q command: %v", command, err)
   }
   return out, err
}

否则直接执行获取, 如果是获取全局而非某个 container 的 processlist，则 containername 默认为 “/”

2.2、接着

func (cd *containerData) parseProcessList(cadvisorContainer string, inHostNamespace bool, out []byte) ([]v2.ProcessInfo, error) {

对上面的out []byte进行每行读取、格式化，将其转化为ProcessInfo类型返回

AloneDrifters

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
cadvisor数据获取源码分析

cadvisor
复制链接

扫一扫

专栏目录