kubernetes 1.24.2实战与源码(1)

kubernetes 1.24.2实战与源码

第1章 准备工作

1.1 关于Kubernetes的介绍与核心对象概念 

关于Kubernetes的介绍与核心对象概念-阿里云开发者社区

k8s架构

 

 

核心对象

 

 

使用kubeadm+10分钟部署k8集群

使用 KuboardSpray 安装kubernetes_v1.23.1 | Kuboard

k8s-上部署第一个应用程序

Deployment基本概念

给应用添加service,执行扩容和滚动更新

 安装Kuboard在页面上熟悉k8s集群

kubernetes 1.24.2安装kuboard v3 

static pod 安装 kuboard

安装命令

curl -fsSL https://addons.kuboard.cn/kuboard/kuboard-static-pod.sh -o kuboard.sh
sh kuboard.sh

阅读k8s源码的准备工作

vscode

下载k8s 1.24.2源码

k8s组件代码仓库地址

第2章 创建pod时kubectl的执行流程和它的设计模式

 

2.1 使用kubectl部署一个简单的nginx-pod

从创建pod开始看流程和源码

Kubernetes源码分析一叶知秋(一)kubectl中Pod的创建流程 - 掘金

编写一个创建nginx pod的yaml

使用kubelet部署这个pod

2.2 命令行解析工具cobra的使用

cobra中的主要概念

cobra 中有个重要的概念,分别是 commands、arguments 和 flags。其中 commands 代表执行动作,arguments 就是执行参数,flags 是这些动作的标识符。执行命令行程序时的一般格式为:
APPNAME COMMAND ARG --FLAG
比如下面的例子:

# server是 commands,port 是 flag
hugo server --port=1313
 
# clone 是 commands,URL 是 arguments,brae 是 flag
git clone URL --bare

kubectl create命令 执行入口在cmd目录下各个组件的下面

代码位置/home/gopath/src/k8s.io/kubernetes-1.24.2/cmd/kubectl/kubectl.go

package main

import (
	"k8s.io/component-base/cli"
	"k8s.io/kubectl/pkg/cmd"
	"k8s.io/kubectl/pkg/cmd/util"

	// Import to initialize client auth plugins.
	_ "k8s.io/client-go/plugin/pkg/client/auth"
)

func main() {
	command := cmd.NewDefaultKubectlCommand()
	if err := cli.RunNoErrOutput(command); err != nil {
		// Pretty-print the error and exit with an error.
		util.CheckErr(err)
	}
}

 rand.Seed设置随机数

调用kubectl库调用cmd创建command对象

command := cmd.NewDefaultKubectlCommand()

D:\Workspace\Go\src\k8s.io\kubernetes@v0.24.2\staging\src\k8s.io\kubectl\pkg\cmd\cmd.go

staging\src\k8s.io\kubectl\pkg\cmd\cmd.go

github.com/spf13/cobra

cobra的主要功能如下

Cobra主要提供的功能

*   简易的子命令行模式,如 app server, app fetch等等
*   完全兼容posix命令行模式
*   嵌套子命令subcommand
*   支持全局,局部,串联flags
*   使用Cobra很容易的生成应用程序和命令,使用cobra create appname和cobra add cmdname
*   如果命令输入错误,将提供智能建议,如 app srver,将提示srver没有,是否是app server
*   自动生成commands和flags的帮助信息
*   自动生成详细的help信息,如app help
*   自动识别-h,--help帮助flag
*   自动生成应用程序在bash下命令自动完成功能
*   自动生成应用程序的man手册
*   命令行别名
*   自定义help和usage信息
*   可选的紧密集成的[viper](http://github.com/spf13/viper) apps

创建cobra应用

go install github.com/spf13/cobra-cli@latest

mkdir my_cobra
cd my_cobra
// 打开my_cobra项目,执行go mod init后可以看到相关的文件
go mod init github.com/spf13/my_cobra
find
go run main.go
// 修改root.go
        // Uncomment the following line if your bare application
        // has an action associated with it:
        // Run: func(cmd *cobra.Command, args []string) { },
        Run: func(cmd *cobra.Command, args []string) { 
                fmt.Println("my_cobra")
        },

// 编译运行后打印
go run main.go
[root@k8s-worker02 my_cobra]# go run main.go
# github.com/spf13/my_cobra/cmd
cmd/root.go:29:10: undefined: fmt
[root@k8s-worker02 my_cobra]# go run main.go
my_cobra

// 用cobra程序生成应用程序框架
cobra-cli init
// 除了init生成应用程序框架,还可以通过cobra-cli add命令生成子命令的代码文件,比如下面的命令会添加两个子命令image和container相关的代码文件:
cobra-cli add image
cobra-cli add container

[root@k8s-worker02 my_cobra]# find
.
./go.mod
./main.go
./cmd
./cmd/root.go
./cmd/image.go
./LICENSE
./go.sum
[root@k8s-worker02 my_cobra]# cobra-cli add container
container created at /home/gopath/src/my_cobra
[root@k8s-worker02 my_cobra]# go run main.go image
image called
[root@k8s-worker02 my_cobra]# go run main.go container
container called

可以看出执行的是对应xxxCmd下的Run方法

// containerCmd represents the container command
var containerCmd = &cobra.Command{
        Use:   "container",
        Short: "A brief description of your command",
        Long: `A longer description that spans multiple lines and likely contains examples
and usage of using your command. For example:

Cobra is a CLI library for Go that empowers applications.
This application is a tool to generate the needed files
to quickly create a Cobra application.`,
        Run: func(cmd *cobra.Command, args []string) {
                fmt.Println("container called")
        },
}

赋值cmd/container.go为version.go添加version信息

package cmd
import (
        "fmt"

        "github.com/spf13/cobra"
)

// versionCmd represents the version command
var versionCmd = &cobra.Command{
        Use:   "version",
        Short: "A brief description of your command",
        Long: `A longer description that spans multiple lines and likely contains examples
and usage of using your command. For example:

Cobra is a CLI library for Go that empowers applications.
This application is a tool to generate the needed files
to quickly create a Cobra application.`,
        Run: func(cmd *cobra.Command, args []string) {
                fmt.Println("my_cobra version is v1.0")
        },
}
func init() {
        rootCmd.AddCommand(versionCmd)
}

 

设置一个MinimumNArgs的验证

新增一个cmd/times.go

package cmd

import (
    "fmt"
    "strings"
    "github.com/spf13/cobra"
)

// containerCmd respresents the container command
var echoTimes int
var timesCmd = &cobra.Command{
    Use:        "times [string to echo]",
    Short:      "Echo anything to the screen more times",
    Long:        `echo things multiple times back to the user by providing a count and a string`,
    Args: cobra.MinimumNArgs(1),
    Run:  func(cmd *cobra.Command, args []string) {
         for i := 0; i < echoTimes; i++ {
            fmt.Println("Echo: " + strings.Join(args, " "))
         }
    },  
}
func init() {
        rootCmd.AddCommand(timesCmd)
        timseCmd.Flags().IntVarP(&echoTimes, "times", "t", 1, "times to echo the input")
}

因为我们为timeCmd命令设置了Args: cobra.MinimumNArgs(1),所以必须为times子命令传入一个参数,不然times子命令会报错:

go run main.go times -t=4 k8s

[root@k8s-worker02 my_cobra]# go run main.go times -t=4 k8s
Echo: k8s
Echo: k8s
Echo: k8s
Echo: k8s

修改rootCmd

        PersistentPreRun: func(cmd *cobra.Command, args []string) {
                fmt.Printf("[step_1]PersistentPreRun with args: %v\n", args)
        },
        PreRun: func(cmd *cobra.Command, args []string) {
                fmt.Printf("[step_2]PreRun with args: %v\n", args)
        },
        Run: func(cmd *cobra.Command, args []string) {
                fmt.Printf("[step_3]my_cobra version is v1.0: %v\n", args")
        },
        PostRun: func(cmd *cobra.Command, args []string) {
                fmt.Printf("[step_4]PostRun with args: %v\n", args)
        },
        PersistentPostRun: func(cmd *cobra.Command, args []string) {
                fmt.Printf("[step_5]PersistentPostRun with args: %v\n", args)
        },

[root@k8s-worker02 my_cobra]# go run main.go 
[step_1]PersistentPreRun with args: []
[step_2]PreRun with args: []
[step_3]my_cobra version is v1.0: []
[step_4]PostRun with args: []
[step_5]PersistentPostRun with args: []

 kubectl命令行设置pprof抓取火焰图

cmd调用入口

D:\Workspace\Go\src\k8s.io\kubernetes@v0.24.2\staging\src\k8s.io\kubectl\pkg\cmd\cmd.go

// NewDefaultKubectlCommand creates the `kubectl` command with default arguments
func NewDefaultKubectlCommand() *cobra.Command {
	return NewDefaultKubectlCommandWithArgs(KubectlOptions{
		PluginHandler: NewDefaultPluginHandler(plugin.ValidPluginFilenamePrefixes),
		Arguments:     os.Args,
		ConfigFlags:   defaultConfigFlags,
		IOStreams:     genericclioptions.IOStreams{In: os.Stdin, Out: os.Stdout, ErrOut: os.Stderr},
	})
}

底层函数NewKubectlCommand解析

func NewKubectlCommand(o KubectlOptions) *cobra.Command {}

使用cobra创建rootCmd

	// Parent command to which all subcommands are added.
	cmds := &cobra.Command{
		Use:   "kubectl",
		Short: i18n.T("kubectl controls the Kubernetes cluster manager"),
		Long: templates.LongDesc(`
      kubectl controls the Kubernetes cluster manager.

      Find more information at:
            https://kubernetes.io/docs/reference/kubectl/overview/`),
		Run: runHelp,
		// Hook before and after Run initialize and write profiles to disk,
		// respectively.
		PersistentPreRunE: func(*cobra.Command, []string) error {
			rest.SetDefaultWarningHandler(warningHandler)
			return initProfiling()
		},
		PersistentPostRunE: func(*cobra.Command, []string) error {
			if err := flushProfiling(); err != nil {
				return err
			}
			if warningsAsErrors {
				count := warningHandler.WarningCount()
				switch count {
				case 0:
					// no warnings
				case 1:
					return fmt.Errorf("%d warning received", count)
				default:
					return fmt.Errorf("%d warnings received", count)
				}
			}
			return nil
		},
	}

配合后面的addProfilingFlags(flags)添加pprof的flag

在persistentPreRunE设置pprof采集相关指令

代码位置

staging\src\k8s.io\kubectl\pkg\cmd\profiling.go

意思是有两个选项

--profile代表pprof统计哪类指标,可以是cpu,block等

--profile-output代表输出的pprof结果文件

initProfiling代码

func addProfilingFlags(flags *pflag.FlagSet) {
	flags.StringVar(&profileName, "profile", "none", "Name of profile to capture. One of (none|cpu|heap|goroutine|threadcreate|block|mutex)")
	flags.StringVar(&profileOutput, "profile-output", "profile.pprof", "Name of the file to write the profile to")
}

func initProfiling() error {
	var (
		f   *os.File
		err error
	)
	switch profileName {
	case "none":
		return nil
	case "cpu":
		f, err = os.Create(profileOutput)
		if err != nil {
			return err
		}
		err = pprof.StartCPUProfile(f)
		if err != nil {
			return err
		}
	// Block and mutex profiles need a call to Set{Block,Mutex}ProfileRate to
	// output anything. We choose to sample all events.
	case "block":
		runtime.SetBlockProfileRate(1)
	case "mutex":
		runtime.SetMutexProfileFraction(1)
	default:
		// Check the profile name is valid.
		if profile := pprof.Lookup(profileName); profile == nil {
			return fmt.Errorf("unknown profile '%s'", profileName)
		}
	}

	// If the command is interrupted before the end (ctrl-c), flush the
	// profiling files
	c := make(chan os.Signal, 1)
	signal.Notify(c, os.Interrupt)
	go func() {
		<-c
		f.Close()
		flushProfiling()
		os.Exit(0)
	}()

	return nil
}

并且在PersistentPostRunE中设置了pprof统计结果落盘

		PersistentPostRunE: func(*cobra.Command, []string) error {
			if err := flushProfiling(); err != nil {
				return err
			}
			if warningsAsErrors {
				count := warningHandler.WarningCount()
				switch count {
				case 0:
					// no warnings
				case 1:
					return fmt.Errorf("%d warning received", count)
				default:
					return fmt.Errorf("%d warnings received", count)
				}
			}
			return nil
		},

对应执行的flushProfiling

func flushProfiling() error {
	switch profileName {
	case "none":
		return nil
	case "cpu":
		pprof.StopCPUProfile()
	case "heap":
		runtime.GC()
		fallthrough
	default:
		profile := pprof.Lookup(profileName)
		if profile == nil {
			return nil
		}
		f, err := os.Create(profileOutput)
		if err != nil {
			return err
		}
		defer f.Close()
		profile.WriteTo(f, 0)
	}

	return nil
}

执行采集pprof cpu的kubelet命令

# 执行命令
kubectl get node --profile=cpu --profile-output=cpu.pprof
# 查看结果文件
ll cpu.pprof
# 生成svg
go tool pprof -svg cpu.pprof > kubectl_get_node_cpu.svg

kubectl get node --profile=goroutine --profile-output=goroutine.pprof
go tool pprof -text goroutine.pprof

cpu火焰图svg结果

2.4 kubectl命令行设置7大命令分组

kubectl架构图

用cmd工厂函数f创建7大分组命令

基础初级命令Basic Commands(Beginner)

基础中级命令Basic Commands(Intermediate)

部署命令Deploy Commands

集群管理分组 Cluster Management Commands

故障排查和调试Troubleshooting and Debugging Commands

高级命令Advanced Commands

设置命令Settings Commands

设置参数-替换方法

	flags := cmds.PersistentFlags()

	addProfilingFlags(flags)

	flags.BoolVar(&warningsAsErrors, "warnings-as-errors", warningsAsErrors, "Treat warnings received from the server as errors and exit with a non-zero exit code")

设置kubeconfig相关的命令行

	kubeConfigFlags := o.ConfigFlags
	if kubeConfigFlags == nil {
		kubeConfigFlags = defaultConfigFlags
	}
	kubeConfigFlags.AddFlags(flags)
	matchVersionKubeConfigFlags := cmdutil.NewMatchVersionFlags(kubeConfigFlags)
	matchVersionKubeConfigFlags.AddFlags(flags)

设置cmd工厂函数f,主要是封装了与kube-apiserver交互客户端

后面的子命令都使用这个f创建

	f := cmdutil.NewFactory(matchVersionKubeConfigFlags)

创建proxy子命令

	proxyCmd := proxy.NewCmdProxy(f, o.IOStreams)
	proxyCmd.PreRun = func(cmd *cobra.Command, args []string) {
		kubeConfigFlags.WrapConfigFn = nil
	}

创建7大分组命令

1.基础初级命令Basic Commands (Begginner)

代码

		{
			Message: "Basic Commands (Beginner):",
			Commands: []*cobra.Command{
				create.NewCmdCreate(f, o.IOStreams),
				expose.NewCmdExposeService(f, o.IOStreams),
				run.NewCmdRun(f, o.IOStreams),
				set.NewCmdSet(f, o.IOStreams),
			},
		},

命令行使用kubectl

对应的输出

 释义

create代表创建资源

expose将一种资源暴露成service

run运行一个镜像

set在对象上设置一些功能

2.基础中级命令Basic Commands(Intermediate)

		{
			Message: "Basic Commands (Intermediate):",
			Commands: []*cobra.Command{
				explain.NewCmdExplain("kubectl", f, o.IOStreams),
				getCmd,
				edit.NewCmdEdit(f, o.IOStreams),
				delete.NewCmdDelete(f, o.IOStreams),
			},
		},

打印的help效果

释义

explain获取资源的文档

get显示资源

edit编辑资源

delete删除资源

3.部署命令Deploy Commands

		{
			Message: "Deploy Commands:",
			Commands: []*cobra.Command{
				rollout.NewCmdRollout(f, o.IOStreams),
				scale.NewCmdScale(f, o.IOStreams),
				autoscale.NewCmdAutoscale(f, o.IOStreams),
			},
		},

释义

rollout滚动更新

scale扩缩容

autoscale自动扩缩容

4.集群管理分组Cluster Management Commands:

		{
			Message: "Cluster Management Commands:",
			Commands: []*cobra.Command{
				certificates.NewCmdCertificate(f, o.IOStreams),
				clusterinfo.NewCmdClusterInfo(f, o.IOStreams),
				top.NewCmdTop(f, o.IOStreams),
				drain.NewCmdCordon(f, o.IOStreams),
				drain.NewCmdUncordon(f, o.IOStreams),
				drain.NewCmdDrain(f, o.IOStreams),
				taint.NewCmdTaint(f, o.IOStreams),
			},
		},

释义

certificate管理证书

cluster-info展示集群信息

top展示资源消耗top

cordon将节点标记为不可用

uncordon将节点标记为可用

drain驱逐pod

taint设置节点污点

5.故障排查和调试Troubleshooting and Debugging Commands

		{
			Message: "Troubleshooting and Debugging Commands:",
			Commands: []*cobra.Command{
				describe.NewCmdDescribe("kubectl", f, o.IOStreams),
				logs.NewCmdLogs(f, o.IOStreams),
				attach.NewCmdAttach(f, o.IOStreams),
				cmdexec.NewCmdExec(f, o.IOStreams),
				portforward.NewCmdPortForward(f, o.IOStreams),
				proxyCmd,
				cp.NewCmdCp(f, o.IOStreams),
				auth.NewCmdAuth(f, o.IOStreams),
				debug.NewCmdDebug(f, o.IOStreams),
			},
		},

 输出

 释义

describe展示资源详情

logs打印pod中容器日志

attach进入容器

exec在容器中执行命令

port-forward端口转发

proxy运行代理

cp拷贝文件

auth检查鉴权

debug打印debug

6.高级命令Advanced Commands

代码

		{
			Message: "Advanced Commands:",
			Commands: []*cobra.Command{
				diff.NewCmdDiff(f, o.IOStreams),
				apply.NewCmdApply("kubectl", f, o.IOStreams),
				patch.NewCmdPatch(f, o.IOStreams),
				replace.NewCmdReplace(f, o.IOStreams),
				wait.NewCmdWait(f, o.IOStreams),
				kustomize.NewCmdKustomize(o.IOStreams),
			},
		},

输出

 释义

diff对比当前和应该运行的版本

apply应用变更或配置

patch更新资源的字段

replace替换资源

wait等待资源的特定状态

kustomize从目录或远程url构建kustomization目标

7.设置命令Setting Commands

代码

		{
			Message: "Settings Commands:",
			Commands: []*cobra.Command{
				label.NewCmdLabel(f, o.IOStreams),
				annotate.NewCmdAnnotate("kubectl", f, o.IOStreams),
				completion.NewCmdCompletion(o.IOStreams.Out, ""),
			},
		},

输出

释义

label打标签

annotate更新注释

completion在shell上设置补全

本节重点总结

设置cmd工厂函数f,主要是封装了与kube-apiserver交互客户端

用cmd工厂函数f创建7大分组命令

2.5 create命令执行流程

kubectl create架构图

 create流程

newCmdCreate调用cobra的Run函数

调用RunCreate构建resourceBuilder对象

调用visit方法创建资源

底层使用resetclient和看k8s-api通信

create的流程NewCmdCreate

代码入口staging\src\k8s.io\kubectl\pkg\cmd\create\create.go

创建Create选项对象

o := NewCreateOptions(ioStreams)

初始化cmd

	cmd := &cobra.Command{
		Use:                   "create -f FILENAME",
		DisableFlagsInUseLine: true,
		Short:                 i18n.T("Create a resource from a file or from stdin"),
		Long:                  createLong,
		Example:               createExample,
		Run: func(cmd *cobra.Command, args []string) {
			if cmdutil.IsFilenameSliceEmpty(o.FilenameOptions.Filenames, o.FilenameOptions.Kustomize) {
				ioStreams.ErrOut.Write([]byte("Error: must specify one of -f and -k\n\n"))
				defaultRunFunc := cmdutil.DefaultSubCommandRun(ioStreams.ErrOut)
				defaultRunFunc(cmd, args)
				return
			}
			cmdutil.CheckErr(o.Complete(f, cmd))
			cmdutil.CheckErr(o.ValidateArgs(cmd, args))
			cmdutil.CheckErr(o.RunCreate(f, cmd))
		},
	}

 设置选项

具体绑定到o的各个字段上

	// bind flag structs
	o.RecordFlags.AddFlags(cmd)

	usage := "to use to create the resource"
	cmdutil.AddFilenameOptionFlags(cmd, &o.FilenameOptions, usage)
	cmdutil.AddValidateFlags(cmd)
	cmd.Flags().BoolVar(&o.EditBeforeCreate, "edit", o.EditBeforeCreate, "Edit the API resource before creating")
	cmd.Flags().Bool("windows-line-endings", runtime.GOOS == "windows",
		"Only relevant if --edit=true. Defaults to the line ending native to your platform.")
	cmdutil.AddApplyAnnotationFlags(cmd)
	cmdutil.AddDryRunFlag(cmd)
	cmdutil.AddLabelSelectorFlagVar(cmd, &o.Selector)
	cmd.Flags().StringVar(&o.Raw, "raw", o.Raw, "Raw URI to POST to the server.  Uses the transport specified by the kubeconfig file.")
	cmdutil.AddFieldManagerFlagVar(cmd, &o.fieldManager, "kubectl-create")

	o.PrintFlags.AddFlags(cmd)

绑定创建子命令

	// create subcommands
	cmd.AddCommand(NewCmdCreateNamespace(f, ioStreams))
	cmd.AddCommand(NewCmdCreateQuota(f, ioStreams))
	cmd.AddCommand(NewCmdCreateSecret(f, ioStreams))
	cmd.AddCommand(NewCmdCreateConfigMap(f, ioStreams))
	cmd.AddCommand(NewCmdCreateServiceAccount(f, ioStreams))
	cmd.AddCommand(NewCmdCreateService(f, ioStreams))
	cmd.AddCommand(NewCmdCreateDeployment(f, ioStreams))
	cmd.AddCommand(NewCmdCreateClusterRole(f, ioStreams))
	cmd.AddCommand(NewCmdCreateClusterRoleBinding(f, ioStreams))
	cmd.AddCommand(NewCmdCreateRole(f, ioStreams))
	cmd.AddCommand(NewCmdCreateRoleBinding(f, ioStreams))
	cmd.AddCommand(NewCmdCreatePodDisruptionBudget(f, ioStreams))
	cmd.AddCommand(NewCmdCreatePriorityClass(f, ioStreams))
	cmd.AddCommand(NewCmdCreateJob(f, ioStreams))
	cmd.AddCommand(NewCmdCreateCronJob(f, ioStreams))
	cmd.AddCommand(NewCmdCreateIngress(f, ioStreams))
	cmd.AddCommand(NewCmdCreateToken(f, ioStreams))

核心的cmd.Run函数

校验文件参数

			if cmdutil.IsFilenameSliceEmpty(o.FilenameOptions.Filenames, o.FilenameOptions.Kustomize) {
				ioStreams.ErrOut.Write([]byte("Error: must specify one of -f and -k\n\n"))
				defaultRunFunc := cmdutil.DefaultSubCommandRun(ioStreams.ErrOut)
				defaultRunFunc(cmd, args)
				return
			}

完善并填充所需字段

			cmdutil.CheckErr(o.Complete(f, cmd))

校验参数

			cmdutil.CheckErr(o.ValidateArgs(cmd, args))

核心的RunCreate

			cmdutil.CheckErr(o.RunCreate(f, cmd))

如果配置了apiserver的raw-uri就直接发送请求

	if len(o.Raw) > 0 {
		restClient, err := f.RESTClient()
		if err != nil {
			return err
		}
		return rawhttp.RawPost(restClient, o.IOStreams, o.Raw, o.FilenameOptions.Filenames[0])
	}

如果配置了创建前edit就执行RunEditOnCreate

	if o.EditBeforeCreate {
		return RunEditOnCreate(f, o.PrintFlags, o.RecordFlags, o.IOStreams, cmd, &o.FilenameOptions, o.fieldManager)
	}

根据配置中validate决定是否开启validate

--validate=true: 使用一种模式校验一下配置,模式是true的

	cmdNamespace, enforceNamespace, err := f.ToRawKubeConfigLoader().Namespace()
	if err != nil {
		return err
	}

构建builder对象,建造者模式

	r := f.NewBuilder().
		Unstructured().
		Schema(schema).
		ContinueOnError().
		NamespaceParam(cmdNamespace).DefaultNamespace().
		FilenameParam(enforceNamespace, &o.FilenameOptions).
		LabelSelectorParam(o.Selector).
		Flatten().
		Do()
	err = r.Err()
	if err != nil {
		return err
	}

FilenameParam读取配置文件

除了支持简单的本地文件,也支持标准输入和http/https协议访问的文件,保存为Visitor

代码位置 staging\src\k8s.io\cli-runtime\pkg\resource\builder.go

// FilenameParam groups input in two categories: URLs and files (files, directories, STDIN)
// If enforceNamespace is false, namespaces in the specs will be allowed to
// override the default namespace. If it is true, namespaces that don't match
// will cause an error.
// If ContinueOnError() is set prior to this method, objects on the path that are not
// recognized will be ignored (but logged at V(2)).
func (b *Builder) FilenameParam(enforceNamespace bool, filenameOptions *FilenameOptions) *Builder {
	if errs := filenameOptions.validate(); len(errs) > 0 {
		b.errs = append(b.errs, errs...)
		return b
	}
	recursive := filenameOptions.Recursive
	paths := filenameOptions.Filenames
	for _, s := range paths {
		switch {
		case s == "-":
			b.Stdin()
		case strings.Index(s, "http://") == 0 || strings.Index(s, "https://") == 0:
			url, err := url.Parse(s)
			if err != nil {
				b.errs = append(b.errs, fmt.Errorf("the URL passed to filename %q is not valid: %v", s, err))
				continue
			}
			b.URL(defaultHttpGetAttempts, url)
		default:
			matches, err := expandIfFilePattern(s)
			if err != nil {
				b.errs = append(b.errs, err)
				continue
			}
			if !recursive && len(matches) == 1 {
				b.singleItemImplied = true
			}
			b.Path(recursive, matches...)
		}
	}
	if filenameOptions.Kustomize != "" {
		b.paths = append(
			b.paths,
			&KustomizeVisitor{
				mapper:  b.mapper,
				dirPath: filenameOptions.Kustomize,
				schema:  b.schema,
				fSys:    filesys.MakeFsOnDisk(),
			})
	}

	if enforceNamespace {
		b.RequireNamespace()
	}

	return b
}

调用visit函数创建资源

	err = r.Visit(func(info *resource.Info, err error) error {
		if err != nil {
			return err
		}
		if err := util.CreateOrUpdateAnnotation(cmdutil.GetFlagBool(cmd, cmdutil.ApplyAnnotationsFlag), info.Object, scheme.DefaultJSONEncoder()); err != nil {
			return cmdutil.AddSourceToErr("creating", info.Source, err)
		}

		if err := o.Recorder.Record(info.Object); err != nil {
			klog.V(4).Infof("error recording current command: %v", err)
		}

		if o.DryRunStrategy != cmdutil.DryRunClient {
			if o.DryRunStrategy == cmdutil.DryRunServer {
				if err := o.DryRunVerifier.HasSupport(info.Mapping.GroupVersionKind); err != nil {
					return cmdutil.AddSourceToErr("creating", info.Source, err)
				}
			}
			obj, err := resource.
				NewHelper(info.Client, info.Mapping).
				DryRun(o.DryRunStrategy == cmdutil.DryRunServer).
				WithFieldManager(o.fieldManager).
				WithFieldValidation(o.ValidationDirective).
				Create(info.Namespace, true, info.Object)
			if err != nil {
				return cmdutil.AddSourceToErr("creating", info.Source, err)
			}
			info.Refresh(obj, true)
		}

Create函数追踪底层调用createResource创建资源

代码位置D:\Workspace\Go\src\k8s.io\kubernetes@v0.24.2\staging\src\k8s.io\cli-runtime\pkg\resource\helper.go

func (m *Helper) createResource(c RESTClient, resource, namespace string, obj runtime.Object, options *metav1.CreateOptions) (runtime.Object, error) {
	return c.Post().
		NamespaceIfScoped(namespace, m.NamespaceScoped).
		Resource(resource).
		VersionedParams(options, metav1.ParameterCodec).
		Body(obj).
		Do(context.TODO()).
		Get()
}

底层使用restfulclient.post

代码位置staging\src\k8s.io\cli-runtime\pkg\resource\interfaces.go

// RESTClient is a client helper for dealing with RESTful resources
// in a generic way.
type RESTClient interface {
	Get() *rest.Request
	Post() *rest.Request
	Patch(types.PatchType) *rest.Request
	Delete() *rest.Request
	Put() *rest.Request
}

本节重点总结

1.newCmdCreate调用cobra的Run函数

2.调用RunCreate构建resourceBuilder对象

3.调用Visit方法创建资源

4.底层使用resetclient和k8s-api通信

2.6 createCmd中的builder建造者设计模式模式

本节重点总结

设计模式之建造者模式

优点

缺点

kubectl中的创建者模式

设计模式之建造者模式

建造者(Builder)模式:指将一个复杂对象的构造与它的表示分离

使同样的构建过程可以创建不同的对象,这样的设计模式被称为建造者模式

它是将一个复杂的对象分解为多个简单的对象,然后一步一步构建而成

它将变与不变相分离,即产品的组成部分是不变的,但每一部分是可以灵活选择的。

更多用来针对复杂对象的创建

优点

封装性好,构建和表示分离

扩展性好,各个具体的建造者相互分离,有利于系统的解耦

客户端不必知道产品内部组成的细节,建造者可以对创建过程逐步细化,而不对其他模块产生任何影响,便于控制细节风险。

缺点

产品的组成部分必须相同,这限制了其使用范围

如果产品的内部变化复杂,如果产品内部发生变化,则建造者也要同步修改,后期维护成本较大。

kubectl中的创建者模式

kubectl中的Builder对象

特点1 针对复杂对象的创建,字段非常多

特点2 开头的方法要返回要创建对象的指针

特点3 所有的方法都返回的是建造者对象的指针

特点1 针对复杂对象的创建,字段非常多

kubectl中的Builder对象,可以看到字段非常多

如果使用Init函数构造参数会非常多

而且参数是不固定的,即可以根据用户传入的参数情况构造不同对象

代码位置staging\src\k8s.io\cli-runtime\pkg\resource\builder.go

// Builder provides convenience functions for taking arguments and parameters
// from the command line and converting them to a list of resources to iterate
// over using the Visitor interface.
type Builder struct {
	categoryExpanderFn CategoryExpanderFunc

	// mapper is set explicitly by resource builders
	mapper *mapper

	// clientConfigFn is a function to produce a client, *if* you need one
	clientConfigFn ClientConfigFunc

	restMapperFn RESTMapperFunc

	// objectTyper is statically determinant per-command invocation based on your internal or unstructured choice
	// it does not ever need to rely upon discovery.
	objectTyper runtime.ObjectTyper

	// codecFactory describes which codecs you want to use
	negotiatedSerializer runtime.NegotiatedSerializer

	// local indicates that we cannot make server calls
	local bool

	errs []error

	paths      []Visitor
	stream     bool
	stdinInUse bool
	dir        bool

	labelSelector     *string
	fieldSelector     *string
	selectAll         bool
	limitChunks       int64
	requestTransforms []RequestTransform

	resources   []string
	subresource string

	namespace    string
	allNamespace bool
	names        []string

	resourceTuples []resourceTuple

	defaultNamespace bool
	requireNamespace bool

	flatten bool
	latest  bool

	requireObject bool

	singleResourceType bool
	continueOnError    bool

	singleItemImplied bool

	schema ContentValidator

	// fakeClientFn is used for testing
	fakeClientFn FakeClientFunc
}

特点2 开头的方法要返回要创建对象的指针

func NewBuilder(restClientGetter RESTClientGetter) *Builder {
	categoryExpanderFn := func() (restmapper.CategoryExpander, error) {
		discoveryClient, err := restClientGetter.ToDiscoveryClient()
		if err != nil {
			return nil, err
		}
		return restmapper.NewDiscoveryCategoryExpander(discoveryClient), err
	}

	return newBuilder(
		restClientGetter.ToRESTConfig,
		restClientGetter.ToRESTMapper,
		(&cachingCategoryExpanderFunc{delegate: categoryExpanderFn}).ToCategoryExpander,
	)
}

特点3 所有的方法都返回的是建造者对象的指针

staging\src\k8s.io\kubectl\pkg\cmd\create\create.go

	r := f.NewBuilder().
		Unstructured().
		Schema(schema).
		ContinueOnError().
		NamespaceParam(cmdNamespace).DefaultNamespace().
		FilenameParam(enforceNamespace, &o.FilenameOptions).
		LabelSelectorParam(o.Selector).
		Flatten().
		Do()

调用时看着像链式调用,链上的每个方法都返回这个要建造对象的指针

func (b *Builder) Schema(schema ContentValidator) *Builder {
    b.schema = schema
    return b
}
func (b *Builder) ContinueOnError() *Builder {
    b.continueOnError= true
    return b
}

看起来就是设置构造对象的各种属性

2.7 createCmd中的visitor访问者设计模式

visitor访问者模式简介

访问者模式(Visitor Pattern)是一种将数据结构与数据操作作分离的设计模式,

指封装一些作用于某种数据结构中的各元素的操作,

可以在不改变数据结构的前提下定义作用于这些元素的新的操作,

属于行为型设计模式。

kubectl中的访问者模式

在kubectl中多个Visitor是来访问一个数据结构的不同部分

这种情况下,数据结构有点像一个数据库,而各个Visitor会成为一个个小应用

本节重点总结:

visitor访问者模式简介

kubeclt中的visitor应用

visitor访问者模式简介

访问者模式(Visitor Pattern)是一种将数据结构与数据操作分离的设计模式,

指封装一些作用于某种数据结构中的各元素的操作,

可以在不改变数据结构的前提下定义作用于这些元素的新操作,

属于行为型设计模式。

kubectl中的访问者模式

在kubectl中多个Visitor是来访问一个数据结构的不同部分。

这种情况下,数据结构有点像一个数据库,而各个Visitor会成为一个个小应用。

访问者模式主要适用于以下应用场景:

(1)数据结构稳定,作用于数据结构稳定的操作经常变化的场景。

(2)需要数据结构与数据作分离的场景。

(3)需要对不同数据类型(元素)进行操作,而不使用分支判断具体类型的场景。

访问者模式的优点

(1)解耦了数据结构与数据操作,使得操作集合可以独立变化。

(2)可以通过扩展访问者角色,实现对数据集的不同操作,程序扩展性更好。

(3)元素具体类型并非单一,访问者均可操作。

(4)各角色职责分离,符合单一职责原则。

访问者模式的缺点

(1)无法增加元素类型:若系统数据结构对象易于变化,

经常有新的数据对象增加进来,

则访问者类必须增加对应元素类型的操作,违背了开闭原则。

(2)具体元素变更困难:具体元素增加属性、删除属性等操作,

会导致对应的访问者类需要进行相应的修改,

尤其当有大量访问类时,修改范围太大。

(3)违背依赖倒置原则:为了达到“区别对待”,

访问者角色依赖的具体元素类型,而不是抽象。

kubectl中访问者模式

在kubectl中多个Visitor是来访问一个数据结构的不同部分

这种情况下,数据结构有点像一个数据库,而各个Visitor会成为一个个小应用。

Visitor接口和VisitorFunc定义

位置在kubernetes/staging/src/k8s.io/cli-runtime/pkg/resource/interfaces.go

// Visitor lets clients walk a list of resources.
type Visitor interface {
	Visit(VisitorFunc) error
}

// VisitorFunc implements the Visitor interface for a matching function.
// If there was a problem walking a list of resources, the incoming error
// will describe the problem and the function can decide how to handle that error.
// A nil returned indicates to accept an error to continue loops even when errors happen.
// This is useful for ignoring certain kinds of errors or aggregating errors in some way.
type VisitorFunc func(*Info, error) error

result的Visit方法

func(r *Result) Visit(fn VisitorFunc) error {
    if r.err != nil {
        return r.err
    }
    err := r.visitor.Visit(fn)
    return utilerrors.FilterOut(err, r.ignoreErrors...)
}

具体的visitor的visit方法定义,参数都是一个VisitorFunc的fn

// Visit in a FileVisitor is just taking care of opening/closing files
func (v *FileVisitor) Visit(fn VisitorFunc) error {
	var f *os.File
	if v.Path == constSTDINstr {
		f = os.Stdin
	} else {
		var err error
		f, err = os.Open(v.Path)
		if err != nil {
			return err
		}
		defer f.Close()
	}

	// TODO: Consider adding a flag to force to UTF16, apparently some
	// Windows tools don't write the BOM
	utf16bom := unicode.BOMOverride(unicode.UTF8.NewDecoder())
	v.StreamVisitor.Reader = transform.NewReader(f, utf16bom)

	return v.StreamVisitor.Visit(fn)
}

kubectl create中 通过Builder模式创建visitor并执行的过程

FilenameParam解析 -f文件参数 创建一个visitor

位置kubernetes/staging/src/k8s.io/cli-runtime/pkg/resource/builder.go

validate校验-f参数

func (o *FilenameOptions) validate() []error {
	var errs []error
	if len(o.Filenames) > 0 && len(o.Kustomize) > 0 {
		errs = append(errs, fmt.Errorf("only one of -f or -k can be specified"))
	}
	if len(o.Kustomize) > 0 && o.Recursive {
		errs = append(errs, fmt.Errorf("the -k flag can't be used with -f or -R"))
	}
	return errs
}

-k代表使用Kustomize配置

如果-f -k都存在报错only one of -f or -k can be specified

kubectl create -f rule.yaml -k rule.yaml
error: only one of -f or -k can be specified

-k不支持递归 -R

kubectl create -k rule.yaml -R
error: the -k flag can't be used with -f or -R

调用path解析文件

	recursive := filenameOptions.Recursive
	paths := filenameOptions.Filenames
	for _, s := range paths {
		switch {
		case s == "-":
			b.Stdin()
		case strings.Index(s, "http://") == 0 || strings.Index(s, "https://") == 0:
			url, err := url.Parse(s)
			if err != nil {
				b.errs = append(b.errs, fmt.Errorf("the URL passed to filename %q is not valid: %v", s, err))
				continue
			}
			b.URL(defaultHttpGetAttempts, url)
		default:
			matches, err := expandIfFilePattern(s)
			if err != nil {
				b.errs = append(b.errs, err)
				continue
			}
			if !recursive && len(matches) == 1 {
				b.singleItemImplied = true
			}
			b.Path(recursive, matches...)
		}
	}

遍历-f传入的paths

如果是-代表从标准输入传入

如果是http开头的代表从远端http接口读取,调用b.URL

默认是文件,调用b.Path解析

b.Path调用ExpandPathsToFileVisitors生产visitor

// ExpandPathsToFileVisitors will return a slice of FileVisitors that will handle files from the provided path.
// After FileVisitors open the files, they will pass an io.Reader to a StreamVisitor to do the reading. (stdin
// is also taken care of). Paths argument also accepts a single file, and will return a single visitor
func ExpandPathsToFileVisitors(mapper *mapper, paths string, recursive bool, extensions []string, schema ContentValidator) ([]Visitor, error) {
	var visitors []Visitor
	err := filepath.Walk(paths, func(path string, fi os.FileInfo, err error) error {
		if err != nil {
			return err
		}

		if fi.IsDir() {
			if path != paths && !recursive {
				return filepath.SkipDir
			}
			return nil
		}
		// Don't check extension if the filepath was passed explicitly
		if path != paths && ignoreFile(path, extensions) {
			return nil
		}

		visitor := &FileVisitor{
			Path:          path,
			StreamVisitor: NewStreamVisitor(nil, mapper, path, schema),
		}

		visitors = append(visitors, visitor)
		return nil
	})

	if err != nil {
		return nil, err
	}
	return visitors, nil
}

底层调用的StreamVisitor,把对应的方法注册到visitor中

位置D:\Workspace\Go\kubernetes\staging\src\k8s.io\cli-runtime\pkg\resource\visitor.go

// Visit implements Visitor over a stream. StreamVisitor is able to distinct multiple resources in one stream.
func (v *StreamVisitor) Visit(fn VisitorFunc) error {
	d := yaml.NewYAMLOrJSONDecoder(v.Reader, 4096)
	for {
		ext := runtime.RawExtension{}
		if err := d.Decode(&ext); err != nil {
			if err == io.EOF {
				return nil
			}
			return fmt.Errorf("error parsing %s: %v", v.Source, err)
		}
		// TODO: This needs to be able to handle object in other encodings and schemas.
		ext.Raw = bytes.TrimSpace(ext.Raw)
		if len(ext.Raw) == 0 || bytes.Equal(ext.Raw, []byte("null")) {
			continue
		}
		if err := ValidateSchema(ext.Raw, v.Schema); err != nil {
			return fmt.Errorf("error validating %q: %v", v.Source, err)
		}
		info, err := v.infoForData(ext.Raw, v.Source)
		if err != nil {
			if fnErr := fn(info, err); fnErr != nil {
				return fnErr
			}
			continue
		}
		if err := fn(info, nil); err != nil {
			return err
		}
	}
}

用jsonYamlDecoder解析文件

ValidateSchema会解析文件中字段进行校验,比如我们把spec故意写成aspec

kubectl apply -f rule.yaml
error: error validating "rule.yaml": error validating data: [ValidationError(PrometheusRule): Unknown field "aspec" in 

infoForData将解析结果转换为Info对象

创建Info。object就是k8s的对象

位置staging\src\k8s.io\cli-runtime\pkg\resource\mapper.go

m.decoder.Decode解析出object和gvk对象

其中object代表就是k8s的对象

gvk是Group/Vsersion/Kind的缩写

// InfoForData creates an Info object for the given data. An error is returned
// if any of the decoding or client lookup steps fail. Name and namespace will be
// set into Info if the mapping's MetadataAccessor can retrieve them.
func (m *mapper) infoForData(data []byte, source string) (*Info, error) {
	obj, gvk, err := m.decoder.Decode(data, nil, nil)
	if err != nil {
		return nil, fmt.Errorf("unable to decode %q: %v", source, err)
	}

	name, _ := metadataAccessor.Name(obj)
	namespace, _ := metadataAccessor.Namespace(obj)
	resourceVersion, _ := metadataAccessor.ResourceVersion(obj)

	ret := &Info{
		Source:          source,
		Namespace:       namespace,
		Name:            name,
		ResourceVersion: resourceVersion,

		Object: obj,
	}

	if m.localFn == nil || !m.localFn() {
		restMapper, err := m.restMapperFn()
		if err != nil {
			return nil, err
		}
		mapping, err := restMapper.RESTMapping(gvk.GroupKind(), gvk.Version)
		if err != nil {
			if _, ok := err.(*meta.NoKindMatchError); ok {
				return nil, fmt.Errorf("resource mapping not found for name: %q namespace: %q from %q: %v\nensure CRDs are installed first",
					name, namespace, source, err)
			}
			return nil, fmt.Errorf("unable to recognize %q: %v", source, err)
		}
		ret.Mapping = mapping

		client, err := m.clientFn(gvk.GroupVersion())
		if err != nil {
			return nil, fmt.Errorf("unable to connect to a server to handle %q: %v", mapping.Resource, err)
		}
		ret.Client = client
	}

	return ret, nil
}

k8s对象object讲解

Object k8s对象

文档地址https://kubenetes.io/zh/docs/concepts/overview/working-objects/kubernetes-objects/

位置staging\src\k8s.io\apimachinery\pkg\runtime\interfaces.go

// Object interface must be supported by all API types registered with Scheme. Since objects in a scheme are
// expected to be serialized to the wire, the interface an Object must provide to the Scheme allows
// serializers to set the kind, version, and group the object is represented as. An Object may choose
// to return a no-op ObjectKindAccessor in cases where it is not expected to be serialized.
type Object interface {
	GetObjectKind() schema.ObjectKind
	DeepCopyObject() Object
}

作用

Kubernetes对象是持久化的实体

Kubernetes使用这些实体去表示整个集群的状态。特别地,它们描述了如下信息:

哪些容器化应用在运行(以及在哪些节点上)

可以被应用使用的资源

关于应用运行时表现的策略,比如重启策略、升级策略,以及容错策略

操作Kubernetes对象,无论是创建、修改,或者删除,需要使用Kubernetes API

期望状态

Kuberetes对象是“目标性记录”一旦创建对象,Kubernetes系统将持续工作以确保对象存在

通过创建对象,本质上是在告知Kubernetes系统,所需要的集群工作负载看起来是什么样子的,这就是Kubernetes集群的期望状态(Desired State)

对象规约(Spec)与状态(Status)

几乎每个Kuberneter对象包含两个嵌套的对象字段,它们负责管理对象的配置:对象spec(规约)和对象status(状态)

对于具有spec的对象,你必须在创建时设置其内容,描述你希望对象所具有的特征:期望状态(Desired State)。

status描述了对象的当前状态(Current State),它是由Kubernetes系统和组件设置并更新的。在任何时刻,Kubernetes控制平面都一直积极地管理者对象的实际状态,以使之与期望状态相匹配。

yaml中的必须字段

在想要创建的Kubernetes对象对应的.yaml文件中,需要配置如下的字段:

apiVersion - 创建该对象所使用的Kubernetes API的版本

kind - 想要创建的对象的类别

metadata-帮助唯一性标识对象的一些数据,包括一个name字符串、UID和可选的namespace

Do中创建一批visitor

// Do returns a Result object with a Visitor for the resources identified by the Builder.
// The visitor will respect the error behavior specified by ContinueOnError. Note that stream
// inputs are consumed by the first execution - use Infos() or Object() on the Result to capture a list
// for further iteration.
func (b *Builder) Do() *Result {
	r := b.visitorResult()
	r.mapper = b.Mapper()
	if r.err != nil {
		return r
	}
	if b.flatten {
		r.visitor = NewFlattenListVisitor(r.visitor, b.objectTyper, b.mapper)
	}
	helpers := []VisitorFunc{}
	if b.defaultNamespace {
		helpers = append(helpers, SetNamespace(b.namespace))
	}
	if b.requireNamespace {
		helpers = append(helpers, RequireNamespace(b.namespace))
	}
	helpers = append(helpers, FilterNamespace)
	if b.requireObject {
		helpers = append(helpers, RetrieveLazy)
	}
	if b.continueOnError {
		r.visitor = ContinueOnErrorVisitor{Visitor: r.visitor}
	}
	r.visitor = NewDecoratedVisitor(r.visitor, helpers...)
	return r
}

helpers代表一批VisitorFunc

比如校验namespace的RequireNamespace

// RequireNamespace will either set a namespace if none is provided on the
// Info object, or if the namespace is set and does not match the provided
// value, returns an error. This is intended to guard against administrators
// accidentally operating on resources outside their namespace.
func RequireNamespace(namespace string) VisitorFunc {
	return func(info *Info, err error) error {
		if err != nil {
			return err
		}
		if !info.Namespaced() {
			return nil
		}
		if len(info.Namespace) == 0 {
			info.Namespace = namespace
			UpdateObjectNamespace(info, nil)
			return nil
		}
		if info.Namespace != namespace {
			return fmt.Errorf("the namespace from the provided object %q does not match the namespace %q. You must pass '--namespace=%s' to perform this operation.", info.Namespace, namespace, info.Namespace)
		}
		return nil
	}
}

创建带装饰器的visitor DecoratedVisitor

	if b.continueOnError {
		r.visitor = ContinueOnErrorVisitor{Visitor: r.visitor}
	}
	r.visitor = NewDecoratedVisitor(r.visitor, helpers...)

对应的visit方法

// Visit implements Visitor
func (v DecoratedVisitor) Visit(fn VisitorFunc) error {
	return v.visitor.Visit(func(info *Info, err error) error {
		if err != nil {
			return err
		}
		for i := range v.decorators {
			if err := v.decorators[i](info, nil); err != nil {
				return err
			}
		}
		return fn(info, nil)
	})
}

visitor的调用

Visitor调用链分析

外层调用result.Visit方法,内部的func

	err = r.Visit(func(info *resource.Info, err error) error {
		if err != nil {
			return err
		}
		if err := util.CreateOrUpdateAnnotation(cmdutil.GetFlagBool(cmd, cmdutil.ApplyAnnotationsFlag), info.Object, scheme.DefaultJSONEncoder()); err != nil {
			return cmdutil.AddSourceToErr("creating", info.Source, err)
		}

		if err := o.Recorder.Record(info.Object); err != nil {
			klog.V(4).Infof("error recording current command: %v", err)
		}

		if o.DryRunStrategy != cmdutil.DryRunClient {
			if o.DryRunStrategy == cmdutil.DryRunServer {
				if err := o.DryRunVerifier.HasSupport(info.Mapping.GroupVersionKind); err != nil {
					return cmdutil.AddSourceToErr("creating", info.Source, err)
				}
			}
			obj, err := resource.
				NewHelper(info.Client, info.Mapping).
				DryRun(o.DryRunStrategy == cmdutil.DryRunServer).
				WithFieldManager(o.fieldManager).
				WithFieldValidation(o.ValidationDirective).
				Create(info.Namespace, true, info.Object)
			if err != nil {
				return cmdutil.AddSourceToErr("creating", info.Source, err)
			}
			info.Refresh(obj, true)
		}

		count++

		return o.PrintObj(info.Object)
	})

visitor接口中的调用方法

// Visit implements the Visitor interface on the items described in the Builder.
// Note that some visitor sources are not traversable more than once, or may
// return different results.  If you wish to operate on the same set of resources
// multiple times, use the Infos() method.
func (r *Result) Visit(fn VisitorFunc) error {
	if r.err != nil {
		return r.err
	}
	err := r.visitor.Visit(fn)
	return utilerrors.FilterOut(err, r.ignoreErrors...)
}

最终的调用就是前面注册的各个visitor的Visit方法

外层VisitorFunc分析

如果出错就返回错误

DryRunStraregy代表试运行策略

默认为None代表不试运行

client代表客户端试运行,不发送请求到server

server点服务端试运行,发送请求,但是如果会改变状态的话就不做

最终调用,Create创建资源,然后调用o.PrintObj(info.Object)打印结果

func(info *resource.Info, err error) error {
		if err != nil {
			return err
		}
		if err := util.CreateOrUpdateAnnotation(cmdutil.GetFlagBool(cmd, cmdutil.ApplyAnnotationsFlag), info.Object, scheme.DefaultJSONEncoder()); err != nil {
			return cmdutil.AddSourceToErr("creating", info.Source, err)
		}

		if err := o.Recorder.Record(info.Object); err != nil {
			klog.V(4).Infof("error recording current command: %v", err)
		}

		if o.DryRunStrategy != cmdutil.DryRunClient {
			if o.DryRunStrategy == cmdutil.DryRunServer {
				if err := o.DryRunVerifier.HasSupport(info.Mapping.GroupVersionKind); err != nil {
					return cmdutil.AddSourceToErr("creating", info.Source, err)
				}
			}
			obj, err := resource.
				NewHelper(info.Client, info.Mapping).
				DryRun(o.DryRunStrategy == cmdutil.DryRunServer).
				WithFieldManager(o.fieldManager).
				WithFieldValidation(o.ValidationDirective).
				Create(info.Namespace, true, info.Object)
			if err != nil {
				return cmdutil.AddSourceToErr("creating", info.Source, err)
			}
			info.Refresh(obj, true)
		}

		count++

		return o.PrintObj(info.Object)
	}

2.8 kubectl功能和对象总结

kubectl的职责

主要的工作是处理用户提交的东西(包括,命令行参数,yaml文件等)

然后其会把用户提交的这些东西组织成一个数据结构体

然后把其发送给API Server

kubectl的代码原理

cobra从命令行和yaml文件中获取信息

通过Builder模式并把其转成一系列的资源

最后用Visitor模式来迭代处理这些Resources,实现各类资源对象的解析和校验

用RESTClient将Object发送到kube-apiserver

kubectl架构图

 create流程

kubectl中的核心对象

RESTClient 和k8s-api通信的restful-client

位置D:\Workspace\Go\kubernetes\staging\src\k8s.io\cli-runtime\pkg\resource\interfaces.go

type RESTClientGetter interface {
	ToRESTConfig() (*rest.Config, error)
	ToDiscoveryClient() (discovery.CachedDiscoveryInterface, error)
	ToRESTMapper() (meta.RESTMapper, error)
}

 Object k8s对象

文档地址https://kubenetes.io/zh/docs/concepts/overview/working-objects/kubernetes-objects/

staging\src\k8s.io\cli-runtime\pkg\resource\interfaces.go

第3章 apiserver中的权限相关

3.1 apiserver启动主流程分析

本节重点总结:

apiserver启动流程

CreateServerChain创建3个server

CreateKubeAPIServer创建kubeAPIServer代表API核心服务,包括常见的Pod/Deployment/Service

createAPIExtensionsServer创建apiExtensionsServer代表API扩展服务,主要针对CRD

createAggregatorServer创建aggregatorServer代表处理merics的服务

apiserver启动流程

入口地址

位置D:\Workspace\Go\kubernetes\cmd\kube-apiserver\apiserver.go

初始化apiserver的cmd并执行

func main() {
	command := app.NewAPIServerCommand()
	code := cli.Run(command)
	os.Exit(code)
}

newCmd执行流程

之前我们说过cobra的几个func执行顺序

// The *Run functions are executed in the following order:
//   * PersistenPreRun()
//   * PreRun()
//   * Run()
//   * *PostRun()
//   * *PersistentPostRun()
// All functions get the same args, the arguments after the command name.
//

PersistentPreRunE准备

设置WarningHandler

		PersistentPreRunE: func(*cobra.Command, []string) error {
			// silence client-go warnings.
			// kube-apiserver loopback clients should not log self-issued warnings.
			rest.SetDefaultWarningHandler(rest.NoWarnings{})
			return nil
		},

runE解析 准备工作

打印版本信息

verflag.PrintAndExitIfRequested()
...
// PrintAndExitIfRequested will check if the -version flag was passed
// and, if so, print the version and exit.
func PrintAndExitIfRequested() {
	if *versionFlag == VersionRaw {
		fmt.Printf("%#v\n", version.Get())
		os.Exit(0)
	} else if *versionFlag == VersionTrue {
		fmt.Printf("%s %s\n", programName, version.Get())
		os.Exit(0)
	}
}

打印命令行参数

// PrintFlags logs the flags in the flagset
func PrintFlags(flags *pflag.FlagSet) {
	flags.VisitAll(func(flag *pflag.Flag) {
		klog.V(1).Infof("FLAG: --%s=%q", flag.Name, flag.Value)
	})
}

检查不安全的端口

delete this check after insecure flags removed in v1.24

Complete设置默认值

			// set default options
			completedOptions, err := Complete(s)
			if err != nil {
				return err
			}

检查命令行参数

			// validate options
			if errs := completedOptions.Validate(); len(errs) != 0 {
				return utilerrors.NewAggregate(errs)
			}

cmd\kube-apiserver\app\options\validation.go

// Validate checks ServerRunOptions and return a slice of found errs.
func (s *ServerRunOptions) Validate() []error {
	var errs []error
	if s.MasterCount <= 0 {
		errs = append(errs, fmt.Errorf("--apiserver-count should be a positive number, but value '%d' provided", s.MasterCount))
	}
	errs = append(errs, s.Etcd.Validate()...)
	errs = append(errs, validateClusterIPFlags(s)...)
	errs = append(errs, validateServiceNodePort(s)...)
	errs = append(errs, validateAPIPriorityAndFairness(s)...)
	errs = append(errs, s.SecureServing.Validate()...)
	errs = append(errs, s.Authentication.Validate()...)
	errs = append(errs, s.Authorization.Validate()...)
	errs = append(errs, s.Audit.Validate()...)
	errs = append(errs, s.Admission.Validate()...)
	errs = append(errs, s.APIEnablement.Validate(legacyscheme.Scheme, apiextensionsapiserver.Scheme, aggregatorscheme.Scheme)...)
	errs = append(errs, validateTokenRequest(s)...)
	errs = append(errs, s.Metrics.Validate()...)
	errs = append(errs, validateAPIServerIdentity(s)...)

	return errs
}

举一个例子,比如这个校验etcd的src\k8s.io\apiserver\pkg\server\options\etcd.go

func (s *EtcdOptions) Validate() []error {
	if s == nil {
		return nil
	}

	allErrors := []error{}
	if len(s.StorageConfig.Transport.ServerList) == 0 {
		allErrors = append(allErrors, fmt.Errorf("--etcd-servers must be specified"))
	}

	if s.StorageConfig.Type != storagebackend.StorageTypeUnset && !storageTypes.Has(s.StorageConfig.Type) {
		allErrors = append(allErrors, fmt.Errorf("--storage-backend invalid, allowed values: %s. If not specified, it will default to 'etcd3'", strings.Join(storageTypes.List(), ", ")))
	}

	for _, override := range s.EtcdServersOverrides {
		tokens := strings.Split(override, "#")
		if len(tokens) != 2 {
			allErrors = append(allErrors, fmt.Errorf("--etcd-servers-overrides invalid, must be of format: group/resource#servers, where servers are URLs, semicolon separated"))
			continue
		}

		apiresource := strings.Split(tokens[0], "/")
		if len(apiresource) != 2 {
			allErrors = append(allErrors, fmt.Errorf("--etcd-servers-overrides invalid, must be of format: group/resource#servers, where servers are URLs, semicolon separated"))
			continue
		}

	}

	return allErrors
}
kubectl get pod -n kube-system
ps -ef |grep apiserver
ps -ef |grep apiserver |grep etcd

真正的Run函数

Run(completeOptions, genericapiserver.SetupSignalHandler())

completedOptions代表ServerRunOptions

第二个参数解析stopCh

在底层的Run函数定义上可以看到第二个参数类型是一个只读的stop chan, stopCh <- chan struct{}

对应的genericapiserver.SetupSignalHandler()解析

var onlyOneSignalHandler = make(chan struct{})
var shutdownHandler chan os.Signal

// SetupSignalHandler registered for SIGTERM and SIGINT. A stop channel is returned
// which is closed on one of these signals. If a second signal is caught, the program
// is terminated with exit code 1.
// Only one of SetupSignalContext and SetupSignalHandler should be called, and only can
// be called once.
func SetupSignalHandler() <-chan struct{} {
	return SetupSignalContext().Done()
}

// SetupSignalContext is same as SetupSignalHandler, but a context.Context is returned.
// Only one of SetupSignalContext and SetupSignalHandler should be called, and only can
// be called once.
func SetupSignalContext() context.Context {
	close(onlyOneSignalHandler) // panics when called twice

	shutdownHandler = make(chan os.Signal, 2)

	ctx, cancel := context.WithCancel(context.Background())
	signal.Notify(shutdownHandler, shutdownSignals...)
	go func() {
		<-shutdownHandler
		cancel()
		<-shutdownHandler
		os.Exit(1) // second signal. Exit directly.
	}()

	return ctx
}

从上面可以看出这是一个context的Done方法返回,就是一个<-chan struct{}

CreateServerChain创建3个server

CreateKubeAPIServer创建 kubeAPIServer代表API核心服务,包括常见的Pod/Deployment/Service

createAPIExtensionsServer创建 apiExtensionsServer 代表API扩展服务,主要针对CRD

createAggregatorServer创建aggregatorServer代表处理metrics的服务

然后运行

这一小节先简单过一下运行的流程,后面再慢慢看细节

// Run runs the specified APIServer.  This should never exit.
func Run(completeOptions completedServerRunOptions, stopCh <-chan struct{}) error {
	// To help debugging, immediately log version
	klog.Infof("Version: %+v", version.Get())

	klog.InfoS("Golang settings", "GOGC", os.Getenv("GOGC"), "GOMAXPROCS", os.Getenv("GOMAXPROCS"), "GOTRACEBACK", os.Getenv("GOTRACEBACK"))

	server, err := CreateServerChain(completeOptions, stopCh)
	if err != nil {
		return err
	}

	prepared, err := server.PrepareRun()
	if err != nil {
		return err
	}

	return prepared.Run(stopCh)
}

3.2 API核心服务通用配置genericConifg的准备工作

本节重点总结

API核心服务需要的通用配置工作中的准备工作

创建和节点通信的结构体proxyTransport,使用缓存长连接来提高效率

创建clientset

初始化etcd存储

CreateKubeAPIServerConfig创建所需配置解析

D:\Workspace\Go\src\github.com\kubernetes\kubernetes\cmd\kube-apiserver\app\server.go

创建和节点通信的结构体proxyTransport,使用缓存长连接提高效率

proxyTransport := CreateProxyTransport()

http.transport功能简介

transport的主要功能其实就是缓存了长连接

用于大量http请求场景下的连接复用

减少发送请求时TCP(TLS)连接建立的时间损耗

创建通用配置genericConfig

	genericConfig, versionedInformers, serviceResolver, pluginInitializers, admissionPostStartHook, storageFactory, err := buildGenericConfig(s.ServerRunOptions, proxyTransport)

下面是众多的ApplyTo分析

众多ApplyTo分析,并且有对应的AddFlags标记命令行参数

先创建genericConfig

genericConfig = genericapiserver.NewConfig(legacyscheme.Codecs)

以检查https配置的ApplyTo分析

	if lastErr = s.SecureServing.ApplyTo(&genericConfig.SecureServing, &genericConfig.LoopbackClientConfig); lastErr != nil {
		return
	}

底层调用SecureServingOptions的ApplyTo,有对应的AddFlags方法标记命令行参数,位置再

staging\src\k8s.io\apiserver\pkg\server\options\serving.go

func (s *SecureServingOptions) AddFlags(fs *pflag.FlagSet) {
	if s == nil {
		return
	}

	fs.IPVar(&s.BindAddress, "bind-address", s.BindAddress, ""+
		"The IP address on which to listen for the --secure-port port. The "+
		"associated interface(s) must be reachable by the rest of the cluster, and by CLI/web "+
		"clients. If blank or an unspecified address (0.0.0.0 or ::), all interfaces will be used.")

	desc := "The port on which to serve HTTPS with authentication and authorization."
	if s.Required {
		desc += " It cannot be switched off with 0."
	} else {
		desc += " If 0, don't serve HTTPS at all."
	}
	fs.IntVar(&s.BindPort, "secure-port", s.BindPort, desc)

初始化etcd存储

创建存储工厂配置

	storageFactoryConfig := kubeapiserver.NewStorageFactoryConfig()
	storageFactoryConfig.APIResourceConfig = genericConfig.MergedResourceConfig
	completedStorageFactoryConfig, err := storageFactoryConfig.Complete(s.Etcd)
	if err != nil {
		lastErr = err
		return
	}

初始化存储工厂

	storageFactory, lastErr = completedStorageFactoryConfig.New()
	if lastErr != nil {
		return
	}

将存储工厂应用到服务端运行对象中,后期可以通过RESTOptionsGetter获取操作Etcd的句柄

	if lastErr = s.Etcd.ApplyWithStorageFactoryTo(storageFactory, genericConfig); lastErr != nil {
		return
	}
func (s *EtcdOptions) ApplyWithStorageFactoryTo(factory serverstorage.StorageFactory, c *server.Config) error {
	if err := s.addEtcdHealthEndpoint(c); err != nil {
		return err
	}

	// use the StorageObjectCountTracker interface instance from server.Config
	s.StorageConfig.StorageObjectCountTracker = c.StorageObjectCountTracker

	c.RESTOptionsGetter = &StorageFactoryRestOptionsFactory{Options: *s, StorageFactory: factory}
	return nil
}

addEtcdHealthEndpoint创建etcd的健康检测

func (s *EtcdOptions) addEtcdHealthEndpoint(c *server.Config) error {
	healthCheck, err := storagefactory.CreateHealthCheck(s.StorageConfig)
	if err != nil {
		return err
	}
	c.AddHealthChecks(healthz.NamedCheck("etcd", func(r *http.Request) error {
		return healthCheck()
	}))

	if s.EncryptionProviderConfigFilepath != "" {
		kmsPluginHealthzChecks, err := encryptionconfig.GetKMSPluginHealthzCheckers(s.EncryptionProviderConfigFilepath)
		if err != nil {
			return err
		}
		c.AddHealthChecks(kmsPluginHealthzChecks...)
	}

	return nil
}

从CreateHealthCheck得知,只支持etcdV3的接口

// CreateHealthCheck creates a healthcheck function based on given config.
func CreateHealthCheck(c storagebackend.Config) (func() error, error) {
	switch c.Type {
	case storagebackend.StorageTypeETCD2:
		return nil, fmt.Errorf("%s is no longer a supported storage backend", c.Type)
	case storagebackend.StorageTypeUnset, storagebackend.StorageTypeETCD3:
		return newETCD3HealthCheck(c)
	default:
		return nil, fmt.Errorf("unknown storage type: %s", c.Type)
	}
}

设置使用protobufs用来内部交互,并且禁用压缩功能

因为内部网络速度快,没必要为了节省带宽而将cpu浪费在压缩和解压上

	// Use protobufs for self-communication.
	// Since not every generic apiserver has to support protobufs, we
	// cannot default to it in generic apiserver and need to explicitly
	// set it in kube-apiserver.
	genericConfig.LoopbackClientConfig.ContentConfig.ContentType = "application/vnd.kubernetes.protobuf"
	// Disable compression for self-communication, since we are going to be
	// on a fast local network
	genericConfig.LoopbackClientConfig.DisableCompression = true

创建clientset

	kubeClientConfig := genericConfig.LoopbackClientConfig
	clientgoExternalClient, err := clientgoclientset.NewForConfig(kubeClientConfig)
	if err != nil {
		lastErr = fmt.Errorf("failed to create real external clientset: %v", err)
		return
	}
	versionedInformers = clientgoinformers.NewSharedInformerFactory(clientgoExternalClient, 10*time.Minute)

versionedInformers代表k8s-client的informer对象,用于listAndWatch k8s对象的

3.3 API核心服务的Authentication认证

Authenticatioon的目的

验证你是谁 确认“你是不是你”

包括多种方式,如Client Certificates,Password,and Plain Tokens, Bootstarp Tokens, and JWT Tokens等

Kubernets使用身份认证插件利用下面的策略来认证API请求的身份

-客户端证书

-持有者令牌(Bearer Token)

-身份认证代理(Proxy)

-HTTP基本认证机制

union认证的规则

-如果某一个认证方法报错就返回,说明认证没过

-如果某一个认证方法报ok,说明认证过了,直接return了,无需再运行其他认证了

-如果所有的认证方法都没报ok,则认证没过

验证你是谁 确认“你是不是你”

包括多种方式,如Client Certificates,Password,and Plain Tokens, Bootstarp Tokens, and JWT Tokens等

文档地址https://kubernetes.io/zh/docs/reference/access-authn-authz/authentication/

所有Kubernetes集群都有两类用户:由Kubernetes管理的服务账号和普通用户

所以认证要围绕这两类用户展开

身份认证策略

Kubernetes使用身份认证插件利用客户端证书、持有者令牌(Bearer Token)、身份认证代理(Proxy)或者HTTP基本认证机制来认证API请求的身份

Http请求发给API服务器时,插件会将以下属性关联到请求本身:

- 用户名:用来辨识最终用户的字符串。常见的值可以是kube-admin或tom@example.com。

- 用户ID:用来辨识最终用户的字符串。旨在比用户名有更好的一致性和唯一性。

- 用户组:取值为一组字符串,其中各个字符串用来表明用户是某个命名的用户逻辑集合的成员。常见的值可能是sysytem:masters或者devops-team等。

-附加字段:一组额外的键-值映射,键是字符串,值是一组字符串;用来保存一些鉴权组件可能觉得有额外信息

你可以同时启用多种身份认证方法,并且你通常会至少使用两种方法:

-针对服务账号使用服务账号令牌

-至少另外一种方法对用户的身份进行认证

当集群中启用了多个身份认证模块时,第一个成功地对请求完成身份认证的模块会直接做出评估决定。API服务器并不保证身份认证模块的运行顺序。

对于所有通过身份认证的用户,system:authenticated组都会被添加到其组列表中。

与其他身份认证协议(LDAP、SAML、Kuberos、X509的替代模块等等)都可以通过使用一个身份认证代理或身份认证webhook来实现。

代码解读

D:\Workspace\Go\src\github.com\kubernetes\kubernetes\cmd\kube-apiserver\app\server.go

之前构建server之前生成通用配置buildGenericConfig里

	// Authentication.ApplyTo requires already applied OpenAPIConfig and EgressSelector if present
	if lastErr = s.Authentication.ApplyTo(&genericConfig.Authentication, genericConfig.SecureServing, genericConfig.EgressSelector, genericConfig.OpenAPIConfig, genericConfig.OpenAPIV3Config, clientgoExternalClient, versionedInformers); lastErr != nil {
		return
	}

真正的Authentication初始化

D:\Workspace\Go\src\github.com\kubernetes\kubernetes\pkg\kubeapiserver\options\authorization.go

	authInfo.Authenticator, openAPIConfig.SecurityDefinitions, err = authenticatorConfig.New()

New代码、创建认证实例,支持多种认证方式:请求Header认证、Auth文件认证、CA证书认证、Bearer token认证

D:\Workspace\Go\src\github.com\kubernetes\kubernetes\pkg\kubeapiserver\authenticator\config.go

核心变量1 tokenAuthenticators []authenticator.Token代表Bearer token认证

// Token checks a string value against a backing authentication store and
// returns a Response or an error if the token could not be checked.
type Token interface {
	AuthenticateToken(ctx context.Context, token string) (*Response, bool, error)
}

不断添加到数组中,最终创建union对象,最终调用unionAuthTokenHandler.AuthenticateToken

		// Union the token authenticators
		tokenAuth := tokenunion.New(tokenAuthenticators...)
// AuthenticateToken authenticates the token using a chain of authenticator.Token objects.
func (authHandler *unionAuthTokenHandler) AuthenticateToken(ctx context.Context, token string) (*authenticator.Response, bool, error) {
	var errlist []error
	for _, currAuthRequestHandler := range authHandler.Handlers {
		info, ok, err := currAuthRequestHandler.AuthenticateToken(ctx, token)
		if err != nil {
			if authHandler.FailOnError {
				return info, ok, err
			}
			errlist = append(errlist, err)
			continue
		}

		if ok {
			return info, ok, err
		}
	}

	return nil, false, utilerrors.NewAggregate(errlist)
}

核心变量2 authenticator.Request代表用户认证的接口,其中AuthenticateRequest是对应得认证方法

// Request attempts to extract authentication information from a request and
// returns a Response or an error if the request could not be checked.
type Request interface {
	AuthenticateRequest(req *http.Request) (*Response, bool, error)
}

然后不断添加到切片中,比如x509认证

	// X509 methods
	if config.ClientCAContentProvider != nil {
		certAuth := x509.NewDynamic(config.ClientCAContentProvider.VerifyOptions, x509.CommonNameUserConversion)
		authenticators = append(authenticators, certAuth)
	}

把上面得unionAuthTokenHandler也加入到链中

		authenticators = append(authenticators, bearertoken.New(tokenAuth), websocket.NewProtocolAuthenticator(tokenAuth))

最后创建一个union对象unionAuthRequestHandler

authenticator := union.New(authenticators...)

最终调用得unionAuthRequestHandler.AuthenticateRequest方法遍历认证方法认证

// AuthenticateRequest authenticates the request using a chain of authenticator.Request objects.
func (authHandler *unionAuthRequestHandler) AuthenticateRequest(req *http.Request) (*authenticator.Response, bool, error) {
	var errlist []error
	for _, currAuthRequestHandler := range authHandler.Handlers {
		resp, ok, err := currAuthRequestHandler.AuthenticateRequest(req)
		if err != nil {
			if authHandler.FailOnError {
				return resp, ok, err
			}
			errlist = append(errlist, err)
			continue
		}

		if ok {
			return resp, ok, err
		}
	}

	return nil, false, utilerrors.NewAggregate(errlist)
}

代码解读:

-如果某一个认证方法报错就返回,说明认证没过

-如果某一个认证方法报ok,说明认证过了,直接retrun了,无需再运行其他认证了

-如果所有得认证方法都没报ok,则认证没过

本节重点总结:

Authentication的目的

Kubernetes使用身份认证插件利用下面的策略来认证API请求的身份

-客户端证书

-持有者令牌(Bearer Token)

-身份认证代理(Proxy)

-HTTP基本认证机制

union认证的规则

-如果某一个认证方法报错就返回,说明认证没过

-如果某一个认证方法报ok,说明认证过了,直接retrun了,无需再运行其他认证了

-如果所有得认证方法都没报ok,则认证没过

3.4API核心服务的Authorization鉴权

Authorization鉴权

确认“你是不是有权力做这件事”。怎样判定是否有权利,通过配置策略

4种鉴权模块

鉴权执行链unionAuthorizHandler

Authorization鉴权相关

3.5node类型Authorization鉴权

 Authorization鉴权,确认“你是不是有权力做这件事”。怎样判定是否有权利,通过配置策略。

Kubernetes使用API服务器对API请求进行鉴权

它根据所有策略评估所有请求属性来决定允许或拒绝请求。

一个API请求的所有部分都必须被某些策略允许才能继续。这意味着默认情况下拒绝权限。

当系统配置了多个鉴权模块时,Kubernetes将按顺序使用每个模块。如果任何鉴权模块批准或拒绝请求·,则立即返回该决定,并不会与其他鉴权模块协商。如果所有模块对请求没有意见,则拒绝该请求。被拒绝相应返回HTTP状态码403.

文档地址https://kubernrtes.io/zh/docs/reference/access-authn-authz/authorization/

4种鉴权模块

文档地址https://kubernetes.io/zh/docs/references/access-authn-authz/authorization/#authorization-modules

Node - 一个专用鉴权组件,根据调度到kubelet上运行的Pod为kubelet授予权限。了解有关使用节点鉴权模式的更多信息,请参阅节点鉴权。

ABAC-基于属性的访问控制(ABAC)定义了一种访问控制范型,通过使用将属性组合在一起的策略,将访问权限授予用户。策略可以使用任何类型的属性(用户属性、资源属性、对象、环境属性等)。要了解有关ABAC模式更多信息,请参阅ABAC模式。

RBAC-基于角色的访问控制(RBAC)是一种基于企业内个人用户的角色管理对计算机或网络资源的访问的方法。在此上下文中,权限是单个用户执行特定任务的能力,例如查看、创建或修改文件。要了解有关使用RBAC模式更多信息,请参阅RBAC模式。

- 被启用之后,RBAC(基于角色的访问控制)使用rbac.authorization.k8s.io API组成驱动鉴权决策,从而允许管理员通过Kubernetes API动态配置权限策略。

- 要启用RBAC,请使用--authorization-mode = RBAC启动API服务器。

Webhook-Webhook是一个HTTP回调:发生某些事情时调用的HTTP POST:通过HTTP POST进行简单的事件通知。实现Webhook的Web应用程序会在发生某些事情时将消息发布到UrL。要了解有关使用Webhook模型的更多信息,请参阅webhook模式。

代码解析

入口还在buildGenericConfig D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-apiserver\app\server.go

	genericConfig.Authorization.Authorizer, genericConfig.RuleResolver, err = BuildAuthorizer(s, genericConfig.EgressSelector, versionedInformers)

还是通过New构造,位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\kubeapiserver\authorizer\config.go

authorizationConfig.New()

构造函数New分析

核心变量1 authorizers

D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\authorization\authorizer\interfaces.go

// Authorizer makes an authorization decision based on information gained by making
// zero or more calls to methods of the Attributes interface.  It returns nil when an action is
// authorized, otherwise it returns an error.
type Authorizer interface {
	Authorize(ctx context.Context, a Attributes) (authorized Decision, reason string, err error)
}

鉴权的接口,有对应的Authorize执行鉴权操作,返回参数如下

Decision代表鉴权结果,有

- 拒绝DecisionDeny

-通过DecisionAllow

- 未表态 DecisionNoOpinion

reason代表拒绝的原因

核心变量2 ruleResolvers

D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\authorization\authorizer\interfaces.go

// RuleResolver provides a mechanism for resolving the list of rules that apply to a given user within a namespace.
type RuleResolver interface {
	// RulesFor get the list of cluster wide rules, the list of rules in the specific namespace, incomplete status and errors.
	RulesFor(user user.Info, namespace string) ([]ResourceRuleInfo, []NonResourceRuleInfo, bool, error)
}

获取rule的接口,有对应的RulesFor执行获取rule操作,返回参数如下

[]ResourceRuleInfo代表资源型的rule

[]NonResourceRuleInfo代表非资源型的如nonResourceURLs:["/metrics"]

遍历鉴权模块判断,向上述切片中append

	for _, authorizationMode := range config.AuthorizationModes {
		// Keep cases in sync with constant list in k8s.io/kubernetes/pkg/kubeapiserver/authorizer/modes/modes.go.
		switch authorizationMode {
		case modes.ModeNode:
			node.RegisterMetrics()
			graph := node.NewGraph()
			node.AddGraphEventHandlers(
				graph,
				config.VersionedInformerFactory.Core().V1().Nodes(),
				config.VersionedInformerFactory.Core().V1().Pods(),
				config.VersionedInformerFactory.Core().V1().PersistentVolumes(),
				config.VersionedInformerFactory.Storage().V1().VolumeAttachments(),
			)
			nodeAuthorizer := node.NewAuthorizer(graph, nodeidentifier.NewDefaultNodeIdentifier(), bootstrappolicy.NodeRules())
			authorizers = append(authorizers, nodeAuthorizer)
			ruleResolvers = append(ruleResolvers, nodeAuthorizer)

		case modes.ModeAlwaysAllow:
			alwaysAllowAuthorizer := authorizerfactory.NewAlwaysAllowAuthorizer()
			authorizers = append(authorizers, alwaysAllowAuthorizer)
			ruleResolvers = append(ruleResolvers, alwaysAllowAuthorizer)
		case modes.ModeAlwaysDeny:
			alwaysDenyAuthorizer := authorizerfactory.NewAlwaysDenyAuthorizer()
			authorizers = append(authorizers, alwaysDenyAuthorizer)
			ruleResolvers = append(ruleResolvers, alwaysDenyAuthorizer)
		case modes.ModeABAC:
			abacAuthorizer, err := abac.NewFromFile(config.PolicyFile)
			if err != nil {
				return nil, nil, err
			}
			authorizers = append(authorizers, abacAuthorizer)
			ruleResolvers = append(ruleResolvers, abacAuthorizer)
		case modes.ModeWebhook:
			if config.WebhookRetryBackoff == nil {
				return nil, nil, errors.New("retry backoff parameters for authorization webhook has not been specified")
			}
			clientConfig, err := webhookutil.LoadKubeconfig(config.WebhookConfigFile, config.CustomDial)
			if err != nil {
				return nil, nil, err
			}
			webhookAuthorizer, err := webhook.New(clientConfig,
				config.WebhookVersion,
				config.WebhookCacheAuthorizedTTL,
				config.WebhookCacheUnauthorizedTTL,
				*config.WebhookRetryBackoff,
			)
			if err != nil {
				return nil, nil, err
			}
			authorizers = append(authorizers, webhookAuthorizer)
			ruleResolvers = append(ruleResolvers, webhookAuthorizer)
		case modes.ModeRBAC:
			rbacAuthorizer := rbac.New(
				&rbac.RoleGetter{Lister: config.VersionedInformerFactory.Rbac().V1().Roles().Lister()},
				&rbac.RoleBindingLister{Lister: config.VersionedInformerFactory.Rbac().V1().RoleBindings().Lister()},
				&rbac.ClusterRoleGetter{Lister: config.VersionedInformerFactory.Rbac().V1().ClusterRoles().Lister()},
				&rbac.ClusterRoleBindingLister{Lister: config.VersionedInformerFactory.Rbac().V1().ClusterRoleBindings().Lister()},
			)
			authorizers = append(authorizers, rbacAuthorizer)
			ruleResolvers = append(ruleResolvers, rbacAuthorizer)
		default:
			return nil, nil, fmt.Errorf("unknown authorization mode %s specified", authorizationMode)
		}
	}

最后返回两个对象的union对象,跟authentication一样

	return union.New(authorizers...), union.NewRuleResolvers(ruleResolvers...), nil

authorizaers的union unionauthzHandler

位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\authorization\union\union.go

// New returns an authorizer that authorizes against a chain of authorizer.Authorizer objects
func New(authorizationHandlers ...authorizer.Authorizer) authorizer.Authorizer {
	return unionAuthzHandler(authorizationHandlers)
}

// Authorizes against a chain of authorizer.Authorizer objects and returns nil if successful and returns error if unsuccessful
func (authzHandler unionAuthzHandler) Authorize(ctx context.Context, a authorizer.Attributes) (authorizer.Decision, string, error) {
	var (
		errlist    []error
		reasonlist []string
	)

	for _, currAuthzHandler := range authzHandler {
		decision, reason, err := currAuthzHandler.Authorize(ctx, a)

		if err != nil {
			errlist = append(errlist, err)
		}
		if len(reason) != 0 {
			reasonlist = append(reasonlist, reason)
		}
		switch decision {
		case authorizer.DecisionAllow, authorizer.DecisionDeny:
			return decision, reason, err
		case authorizer.DecisionNoOpinion:
			// continue to the next authorizer
		}
	}

	return authorizer.DecisionNoOpinion, strings.Join(reasonlist, "\n"), utilerrors.NewAggregate(errlist)
}

unionAuthzHandler的鉴权执行方法Auhorize同样是遍历执行内部的鉴权方法Authoriza

如果任一方法的鉴权结果decision为通过或者拒绝,就直接返回

否则代表不表态,继续执行下一个Authorize方法

ruleResolvers的union unionauthzHandler

位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\authorization\union\union.go

// unionAuthzRulesHandler authorizer against a chain of authorizer.RuleResolver
type unionAuthzRulesHandler []authorizer.RuleResolver

// NewRuleResolvers returns an authorizer that authorizes against a chain of authorizer.Authorizer objects
func NewRuleResolvers(authorizationHandlers ...authorizer.RuleResolver) authorizer.RuleResolver {
	return unionAuthzRulesHandler(authorizationHandlers)
}

// RulesFor against a chain of authorizer.RuleResolver objects and returns nil if successful and returns error if unsuccessful
func (authzHandler unionAuthzRulesHandler) RulesFor(user user.Info, namespace string) ([]authorizer.ResourceRuleInfo, []authorizer.NonResourceRuleInfo, bool, error) {
	var (
		errList              []error
		resourceRulesList    []authorizer.ResourceRuleInfo
		nonResourceRulesList []authorizer.NonResourceRuleInfo
	)
	incompleteStatus := false

	for _, currAuthzHandler := range authzHandler {
		resourceRules, nonResourceRules, incomplete, err := currAuthzHandler.RulesFor(user, namespace)

		if incomplete {
			incompleteStatus = true
		}
		if err != nil {
			errList = append(errList, err)
		}
		if len(resourceRules) > 0 {
			resourceRulesList = append(resourceRulesList, resourceRules...)
		}
		if len(nonResourceRules) > 0 {
			nonResourceRulesList = append(nonResourceRulesList, nonResourceRules...)
		}
	}

	return resourceRulesList, nonResourceRulesList, incompleteStatus, utilerrors.NewAggregate(errList)
}

unionAuthzRulesHandler的执行方法RulesFor中遍历内部的authzHandler

执行他们的RulesFor方法获取resourceRules和nonResourceRules

并将结果添加到resourceRuleList和nonReourceRulesList,返回

本节重点总结:

Authorization鉴权的目的

4种鉴权模块

鉴权执行链unionAuthzHandler

3.5 node类型的Authorization鉴权

Authorization鉴权

本节重点总结:

节点鉴权是一种特殊用途的鉴权模式,专门对kubelet发出的API请求进行鉴权

4种规则解读-如果不是node的请求则拒绝

- 如果nodeName没找到则拒绝

-如果请求的是configmap、pod、pv、pvc、secret需要校验

  - 如果动作是非get,拒绝

  - 如果请求的资源和节点没关系,拒绝

- 如果请求其他资源,需要按照定义好的rule匹配

节点鉴权

文档地址https://kubernetes.io/zh/docs/reference/access-authn-authz/node/

节点鉴权是一种特殊用途的鉴权模式,专门对kubelet发出的API请求进行鉴权。

概述

节点鉴权器允许kubelet执行API操作,包括:

读取操作:

services

endpoints

nodes

pods

secrets、configmaps、pvcs以及绑定到kubelet节点的与pod相关的持久卷

写入操作:

节点和节点状态(启用NodeRestriction准入插件以限制kubelet只能修改自己的节点)

Pod和Pod状态(启用NodeRestriction准入插件以限制kubelet只能修改绑定到自身的Pod)

鉴权相关操作:

对于基于TLS的启动引导过程时使用的certificationsigningrequests API的读/写权限

为委派的身份验证/授权检查创建tokenreviews和subjectaccessreviews的能力

源码解读

位置D:\Workspace\Go\src\k8s.io\kubernetes\plugin\pkg\auth\authorizer\node\node_authorizer.go

func (r *NodeAuthorizer) Authorize(ctx context.Context, attrs authorizer.Attributes) (authorizer.Decision, string, error) {
	nodeName, isNode := r.identifier.NodeIdentity(attrs.GetUser())
	if !isNode {
		// reject requests from non-nodes
		return authorizer.DecisionNoOpinion, "", nil
	}
	if len(nodeName) == 0 {
		// reject requests from unidentifiable nodes
		klog.V(2).Infof("NODE DENY: unknown node for user %q", attrs.GetUser().GetName())
		return authorizer.DecisionNoOpinion, fmt.Sprintf("unknown node for user %q", attrs.GetUser().GetName()), nil
	}

	// subdivide access to specific resources
	if attrs.IsResourceRequest() {
		requestResource := schema.GroupResource{Group: attrs.GetAPIGroup(), Resource: attrs.GetResource()}
		switch requestResource {
		case secretResource:
			return r.authorizeReadNamespacedObject(nodeName, secretVertexType, attrs)
		case configMapResource:
			return r.authorizeReadNamespacedObject(nodeName, configMapVertexType, attrs)
		case pvcResource:
			if attrs.GetSubresource() == "status" {
				return r.authorizeStatusUpdate(nodeName, pvcVertexType, attrs)
			}
			return r.authorizeGet(nodeName, pvcVertexType, attrs)
		case pvResource:
			return r.authorizeGet(nodeName, pvVertexType, attrs)
		case vaResource:
			return r.authorizeGet(nodeName, vaVertexType, attrs)
		case svcAcctResource:
			return r.authorizeCreateToken(nodeName, serviceAccountVertexType, attrs)
		case leaseResource:
			return r.authorizeLease(nodeName, attrs)
		case csiNodeResource:
			return r.authorizeCSINode(nodeName, attrs)
		}

	}

	// Access to other resources is not subdivided, so just evaluate against the statically defined node rules
	if rbac.RulesAllow(attrs, r.nodeRules...) {
		return authorizer.DecisionAllow, "", nil
	}
	return authorizer.DecisionNoOpinion, "", nil
}

规则解读

// NodeAuthorizer authorizes requests from kubelets, with the following logic:
// 1. If a request is not from a node (NodeIdentity() returns isNode=false), reject
// 2. If a specific node cannot be identified (NodeIdentity() returns nodeName=""), reject
// 3. If a request is for a secret, configmap, persistent volume or persistent volume claim, reject unless the verb is get, and the requested object is related to the requesting node:
//    node <- configmap
//    node <- pod
//    node <- pod <- secret
//    node <- pod <- configmap
//    node <- pod <- pvc
//    node <- pod <- pvc <- pv
//    node <- pod <- pvc <- pv <- secret
// 4. For other resources, authorize all nodes uniformly using statically defined rules

前两条规则很好理解

规则3解读

第三条如果请求的资源时secret,configmap,persistent volume or persistent volume claim,需要验证动作是否是get

以secretResource为例,调用authorizeReadNamespaceObject方法

        case secretResource:
                return r.authorizeReadNamespaceObject(nodeName, secretVertexType, attrs)

authorizeReadNamespaceObject验证namespace的方法

authorizeReadNamespaceObject方法是装饰方法,先校验资源是否是namespace级别的,再调用底层的authorize方法

// authorizeReadNamespacedObject authorizes "get", "list" and "watch" requests to single objects of a
// specified types if they are related to the specified node.
func (r *NodeAuthorizer) authorizeReadNamespacedObject(nodeName string, startingType vertexType, attrs authorizer.Attributes) (authorizer.Decision, string, error) {
	switch attrs.GetVerb() {
	case "get", "list", "watch":
		//ok
	default:
		klog.V(2).Infof("NODE DENY: '%s' %#v", nodeName, attrs)
		return authorizer.DecisionNoOpinion, "can only read resources of this type", nil
	}

	if len(attrs.GetSubresource()) > 0 {
		klog.V(2).Infof("NODE DENY: '%s' %#v", nodeName, attrs)
		return authorizer.DecisionNoOpinion, "cannot read subresource", nil
	}
	if len(attrs.GetNamespace()) == 0 {
		klog.V(2).Infof("NODE DENY: '%s' %#v", nodeName, attrs)
		return authorizer.DecisionNoOpinion, "can only read namespaced object of this type", nil
	}
	return r.authorize(nodeName, startingType, attrs)
}

解读

- DecisionNoOpinion代表不表态,如果只有一个Authorizer,意味着拒绝

- 如果动作是变更类的就拒绝

- 如果请求保护子资源就拒绝

- 如果请求参数中没有namespace就拒绝

- 然后调用底层的authorize

node底层的authorize方法

- 如果资源的名称没找到就拒绝

- hasPathFrom代表判断资源是否和节点有关系

- 如果没关系就拒绝

func (r *NodeAuthorizer) authorize(nodeName string, startingType vertexType, attrs authorizer.Attributes) (authorizer.Decision, string, error) {
	if len(attrs.GetName()) == 0 {
		klog.V(2).Infof("NODE DENY: '%s' %#v", nodeName, attrs)
		return authorizer.DecisionNoOpinion, "No Object name found", nil
	}

	ok, err := r.hasPathFrom(nodeName, startingType, attrs.GetNamespace(), attrs.GetName())
	if err != nil {
		klog.V(2).InfoS("NODE DENY", "err", err)
		return authorizer.DecisionNoOpinion, fmt.Sprintf("no relationship found between node '%s' and this object", nodeName), nil
	}
	if !ok {
		klog.V(2).Infof("NODE DENY: '%s' %#v", nodeName, attrs)
		return authorizer.DecisionNoOpinion, fmt.Sprintf("no relationship found between node '%s' and this object", nodeName), nil
	}
	return authorizer.DecisionAllow, "", nil
}

pvResource使用的authorizeGet解析

如果动作不是get就拒绝

如果含有subresource就拒绝

然后调用底层的authorize

// authorizeGet authorizes "get" requests to objects of the specified type if they are related to the specified node
func (r *NodeAuthorizer) authorizeGet(nodeName string, startingType vertexType, attrs authorizer.Attributes) (authorizer.Decision, string, error) {
	if attrs.GetVerb() != "get" {
		klog.V(2).Infof("NODE DENY: '%s' %#v", nodeName, attrs)
		return authorizer.DecisionNoOpinion, "can only get individual resources of this type", nil
	}
	if len(attrs.GetSubresource()) > 0 {
		klog.V(2).Infof("NODE DENY: '%s' %#v", nodeName, attrs)
		return authorizer.DecisionNoOpinion, "cannot get subresource", nil
	}
	return r.authorize(nodeName, startingType, attrs)
}

规则4解读

规则4代表如果node请求其他资源,就通过对应的静态队长认证

	// Access to other resources is not subdivided, so just evaluate against the statically defined node rules
	if rbac.RulesAllow(attrs, r.nodeRules...) {
		return authorizer.DecisionAllow, "", nil
	}
	return authorizer.DecisionNoOpinion, "", nil

底层调用rbac.RuleAllows

D:\Workspace\Go\src\k8s.io\kubernetes\plugin\pkg\auth\authorizer\rbac\rbac.go

func RuleAllows(requestAttributes authorizer.Attributes, rule *rbacv1.PolicyRule) bool {
	if requestAttributes.IsResourceRequest() {
		combinedResource := requestAttributes.GetResource()
		if len(requestAttributes.GetSubresource()) > 0 {
			combinedResource = requestAttributes.GetResource() + "/" + requestAttributes.GetSubresource()
		}

		return rbacv1helpers.VerbMatches(rule, requestAttributes.GetVerb()) &&
			rbacv1helpers.APIGroupMatches(rule, requestAttributes.GetAPIGroup()) &&
			rbacv1helpers.ResourceMatches(rule, combinedResource, requestAttributes.GetSubresource()) &&
			rbacv1helpers.ResourceNameMatches(rule, requestAttributes.GetName())
	}

	return rbacv1helpers.VerbMatches(rule, requestAttributes.GetVerb()) &&
		rbacv1helpers.NonResourceURLMatches(rule, requestAttributes.GetPath())
}

最底层是两个Matches

VerbMatches

代表如果动作是*那就放行

如果请求的动作和rule的动作一致就放行

D:\Workspace\Go\src\k8s.io\kubernetes\pkg\apis\rbac\v1\evaluation_helpers.go

func VerbMatches(rule *rbacv1.PolicyRule, requestedVerb string) bool {
	for _, ruleVerb := range rule.Verbs {
		if ruleVerb == rbacv1.VerbAll {
			return true
		}
		if ruleVerb == requestedVerb {
			return true
		}
	}

	return false
}

NonResourceURLMatches

D:\Workspace\Go\src\k8s.io\kubernetes\pkg\apis\rbac\v1\evaluation_helpers.go

如果动作是*就放行

cat rbac.yaml

 如果请求的url和rule定义的url一致就放行

如果rule中定义的url末尾有*代表通配,那么要判断请求的url是否包含定义url的前缀

func NonResourceURLMatches(rule *rbacv1.PolicyRule, requestedURL string) bool {
	for _, ruleURL := range rule.NonResourceURLs {
		if ruleURL == rbacv1.NonResourceAll {
			return true
		}
		if ruleURL == requestedURL {
			return true
		}
		if strings.HasSuffix(ruleURL, "*") && strings.HasPrefix(requestedURL, strings.TrimRight(ruleURL, "*")) {
			return true
		}
	}

	return false
}

node的rule在哪里

位置D:\Workspace\Go\src\k8s.io\kubernetes\plugin\pkg\auth\authorizer\rbac\bootstrappolicy\policy.go

// NodeRules returns node policy rules, it is slice of rbacv1.PolicyRule.
func NodeRules() []rbacv1.PolicyRule {
	nodePolicyRules := []rbacv1.PolicyRule{
		// Needed to check API access.  These creates are non-mutating
		rbacv1helpers.NewRule("create").Groups(authenticationGroup).Resources("tokenreviews").RuleOrDie(),
		rbacv1helpers.NewRule("create").Groups(authorizationGroup).Resources("subjectaccessreviews", "localsubjectaccessreviews").RuleOrDie(),

		// Needed to build serviceLister, to populate env vars for services
		rbacv1helpers.NewRule(Read...).Groups(legacyGroup).Resources("services").RuleOrDie(),

		// Nodes can register Node API objects and report status.
		// Use the NodeRestriction admission plugin to limit a node to creating/updating its own API object.
		rbacv1helpers.NewRule("create", "get", "list", "watch").Groups(legacyGroup).Resources("nodes").RuleOrDie(),
		rbacv1helpers.NewRule("update", "patch").Groups(legacyGroup).Resources("nodes/status").RuleOrDie(),
		rbacv1helpers.NewRule("update", "patch").Groups(legacyGroup).Resources("nodes").RuleOrDie(),

		// TODO: restrict to the bound node as creator in the NodeRestrictions admission plugin
		rbacv1helpers.NewRule("create", "update", "patch").Groups(legacyGroup).Resources("events").RuleOrDie(),

		// TODO: restrict to pods scheduled on the bound node once field selectors are supported by list/watch authorization
		rbacv1helpers.NewRule(Read...).Groups(legacyGroup).Resources("pods").RuleOrDie(),

		// Needed for the node to create/delete mirror pods.
		// Use the NodeRestriction admission plugin to limit a node to creating/deleting mirror pods bound to itself.
		rbacv1helpers.NewRule("create", "delete").Groups(legacyGroup).Resources("pods").RuleOrDie(),
		// Needed for the node to report status of pods it is running.
		// Use the NodeRestriction admission plugin to limit a node to updating status of pods bound to itself.
		rbacv1helpers.NewRule("update", "patch").Groups(legacyGroup).Resources("pods/status").RuleOrDie(),
		// Needed for the node to create pod evictions.
		// Use the NodeRestriction admission plugin to limit a node to creating evictions for pods bound to itself.
		rbacv1helpers.NewRule("create").Groups(legacyGroup).Resources("pods/eviction").RuleOrDie(),

		// Needed for imagepullsecrets, rbd/ceph and secret volumes, and secrets in envs
		// Needed for configmap volume and envs
		// Use the Node authorization mode to limit a node to get secrets/configmaps referenced by pods bound to itself.
		rbacv1helpers.NewRule("get", "list", "watch").Groups(legacyGroup).Resources("secrets", "configmaps").RuleOrDie(),
		// Needed for persistent volumes
		// Use the Node authorization mode to limit a node to get pv/pvc objects referenced by pods bound to itself.
		rbacv1helpers.NewRule("get").Groups(legacyGroup).Resources("persistentvolumeclaims", "persistentvolumes").RuleOrDie(),

		// TODO: add to the Node authorizer and restrict to endpoints referenced by pods or PVs bound to the node
		// Needed for glusterfs volumes
		rbacv1helpers.NewRule("get").Groups(legacyGroup).Resources("endpoints").RuleOrDie(),
		// Used to create a certificatesigningrequest for a node-specific client certificate, and watch
		// for it to be signed. This allows the kubelet to rotate it's own certificate.
		rbacv1helpers.NewRule("create", "get", "list", "watch").Groups(certificatesGroup).Resources("certificatesigningrequests").RuleOrDie(),

		// Leases
		rbacv1helpers.NewRule("get", "create", "update", "patch", "delete").Groups("coordination.k8s.io").Resources("leases").RuleOrDie(),

		// CSI
		rbacv1helpers.NewRule("get").Groups(storageGroup).Resources("volumeattachments").RuleOrDie(),

		// Use the Node authorization to limit a node to create tokens for service accounts running on that node
		// Use the NodeRestriction admission plugin to limit a node to create tokens bound to pods on that node
		rbacv1helpers.NewRule("create").Groups(legacyGroup).Resources("serviceaccounts/token").RuleOrDie(),
	}

	// Use the Node authorization mode to limit a node to update status of pvc objects referenced by pods bound to itself.
	// Use the NodeRestriction admission plugin to limit a node to just update the status stanza.
	pvcStatusPolicyRule := rbacv1helpers.NewRule("get", "update", "patch").Groups(legacyGroup).Resources("persistentvolumeclaims/status").RuleOrDie()
	nodePolicyRules = append(nodePolicyRules, pvcStatusPolicyRule)

	// CSI
	csiDriverRule := rbacv1helpers.NewRule("get", "watch", "list").Groups("storage.k8s.io").Resources("csidrivers").RuleOrDie()
	nodePolicyRules = append(nodePolicyRules, csiDriverRule)
	csiNodeInfoRule := rbacv1helpers.NewRule("get", "create", "update", "patch", "delete").Groups("storage.k8s.io").Resources("csinodes").RuleOrDie()
	nodePolicyRules = append(nodePolicyRules, csiNodeInfoRule)

	// RuntimeClass
	nodePolicyRules = append(nodePolicyRules, rbacv1helpers.NewRule("get", "list", "watch").Groups("node.k8s.io").Resources("runtimeclasses").RuleOrDie())
	return nodePolicyRules
}

以endpoint为例,代表node可以访问core apigroup中的endpoint资源,用get方法

				rbacv1helpers.NewRule("get").Groups(legacyGroup).Resources("nodes").RuleOrDie(),

本节重点总结 同上

3.6 rbac类型Authorization鉴权

role、clusterrole中的rules规则

- 资源对象

- 非资源对象

- apiGroups

- verb动作

rbac鉴权的代码逻辑

- 通过informer获取clusterRoleBindings列表,根据user匹配subject,通过informer获取clusterRoleBindings的rules,遍历调用visit进行rule匹配

- 通过informer获取RoleBindings列表,根据user和namespace匹配subject,,通过informer获取RoleBindings的rules,遍历调用visit进行rule匹配

本节重点总结

rbac模型四种对象的关系
        o role, clusterrole
        o rolebinding,clusterrolebinding
。role、clusterrole中的rules规则
        。资源对象
        。非资源对象
        。apiGroups
        o verb动作
。rbac鉴权的代码逻辑
        。通过informer获取clusterRoleBindings列表,根据user匹配subject,通过informer获取clusterRoleBindings的rules,遍历调用visit进行rule匹配
        。通过informer获取RoleBindings列表,根据user和namespace匹配subject,,通过informer获取RoleBindings的rules,遍历调用visit进行rule匹配

rbac鉴权模型

文档地址https://kubernetes.io/zh/docs/reference/access-authn-authz/rbac/
简介
基干角色(Role) 的访问控制 (RBAC) 是一种基于组织中用户的角色来调节控制对 计算机或网络资源的访问的方法。

RBAC 鉴权机制使用rbac.authorization.k8sio API组来驱动鉴权决定,允许你通过 Kubernetes API动态配置策略。
看文档介绍

源码解读

入口D:\Workspace\Go\src\k8s.io\kubernetes\pkg\kubeapiserver\authorizer\config.go

		case modes.ModeRBAC:
			rbacAuthorizer := rbac.New(
				&rbac.RoleGetter{Lister: config.VersionedInformerFactory.Rbac().V1().Roles().Lister()},
				&rbac.RoleBindingLister{Lister: config.VersionedInformerFactory.Rbac().V1().RoleBindings().Lister()},
				&rbac.ClusterRoleGetter{Lister: config.VersionedInformerFactory.Rbac().V1().ClusterRoles().Lister()},
				&rbac.ClusterRoleBindingLister{Lister: config.VersionedInformerFactory.Rbac().V1().ClusterRoleBindings().Lister()},
			)
			authorizers = append(authorizers, rbacAuthorizer)
			ruleResolvers = append(ruleResolvers, rbacAuthorizer)

rbac.New 传入Role、ClusterRole、RoleBinding 和ClusterRoleBinding 4种对象的Getter

func New(roles rbacregistryvalidation.RoleGetter, roleBindings rbacregistryvalidation.RoleBindingLister, clusterRoles rbacregistryvalidation.ClusterRoleGetter, clusterRoleBindings rbacregistryvalidation.ClusterRoleBindingLister) *RBACAuthorizer {
	authorizer := &RBACAuthorizer{
		authorizationRuleResolver: rbacregistryvalidation.NewDefaultRuleResolver(
			roles, roleBindings, clusterRoles, clusterRoleBindings,
		),
	}
	return authorizer
}

构建DefaultRuleResolver,并用DefaultRuleResolver构建RBACAuthorizer
RBACAuthorizer的Authorize解析

核心判断点在ruleCheckingVisitor的allowed标志位,如果为true,则通过,否则就不通过

func (r *RBACAuthorizer) Authorize(ctx context.Context, requestAttributes authorizer.Attributes) (authorizer.Decision, string, error) {
	ruleCheckingVisitor := &authorizingVisitor{requestAttributes: requestAttributes}

	r.authorizationRuleResolver.VisitRulesFor(requestAttributes.GetUser(), requestAttributes.GetNamespace(), ruleCheckingVisitor.visit)
	if ruleCheckingVisitor.allowed {
		return authorizer.DecisionAllow, ruleCheckingVisitor.reason, nil
	}

	// Build a detailed log of the denial.
	// Make the whole block conditional so we don't do a lot of string-building we won't use.
	if klogV := klog.V(5); klogV.Enabled() {
		var operation string
		if requestAttributes.IsResourceRequest() {
			b := &bytes.Buffer{}
			b.WriteString(`"`)
			b.WriteString(requestAttributes.GetVerb())
			b.WriteString(`" resource "`)
			b.WriteString(requestAttributes.GetResource())
			if len(requestAttributes.GetAPIGroup()) > 0 {
				b.WriteString(`.`)
				b.WriteString(requestAttributes.GetAPIGroup())
			}
			if len(requestAttributes.GetSubresource()) > 0 {
				b.WriteString(`/`)
				b.WriteString(requestAttributes.GetSubresource())
			}
			b.WriteString(`"`)
			if len(requestAttributes.GetName()) > 0 {
				b.WriteString(` named "`)
				b.WriteString(requestAttributes.GetName())
				b.WriteString(`"`)
			}
			operation = b.String()
		} else {
			operation = fmt.Sprintf("%q nonResourceURL %q", requestAttributes.GetVerb(), requestAttributes.GetPath())
		}

		var scope string
		if ns := requestAttributes.GetNamespace(); len(ns) > 0 {
			scope = fmt.Sprintf("in namespace %q", ns)
		} else {
			scope = "cluster-wide"
		}

		klogV.Infof("RBAC: no rules authorize user %q with groups %q to %s %s", requestAttributes.GetUser().GetName(), requestAttributes.GetUser().GetGroups(), operation, scope)
	}

	reason := ""
	if len(ruleCheckingVisitor.errors) > 0 {
		reason = fmt.Sprintf("RBAC: %v", utilerrors.NewAggregate(ruleCheckingVisitor.errors))
	}
	return authorizer.DecisionNoOpinion, reason, nil
}

这个allowed标志位只有在visit方法中才会被设置,条件是RuleAllows=true

func (v *authorizingVisitor) visit(source fmt.Stringer, rule *rbacv1.PolicyRule, err error) bool {
	if rule != nil && RuleAllows(v.requestAttributes, rule) {
		v.allowed = true
		v.reason = fmt.Sprintf("RBAC: allowed by %s", source.String())
		return false
	}
	if err != nil {
		v.errors = append(v.errors, err)
	}
	return true
}

VisitRulesFor调用visit方法校验每一条rule

	r.authorizationRuleResolver.VisitRulesFor(requestAttributes.GetUser(), requestAttributes.GetNamespace(), ruleCheckingVisitor.visit)

先校验clusterRoleBinding
·具体流程先用informer获取clusterRoleBindings,出错了就校验失败,因为传给visitor的rule为nil就意味着allowed不会被设置为true

	if clusterRoleBindings, err := r.clusterRoleBindingLister.ListClusterRoleBindings(); err != nil {
		if !visitor(nil, nil, err) {
			return
		}
	}

先根据传入的user对象对比subject主体

		for _, clusterRoleBinding := range clusterRoleBindings {
			subjectIndex, applies := appliesTo(user, clusterRoleBinding.Subjects, "")
			if !applies {
				continue
			}

appliesToUser对比函数
根据user类型判断
。如果是普通用户就判断名字
。如果是group就判断里面的user名字
。如果是ServiceAccount就要用serviceaccountMatchesUsername对比

func appliesToUser(user user.Info, subject rbacv1.Subject, namespace string) bool {
	switch subject.Kind {
	case rbacv1.UserKind:
		return user.GetName() == subject.Name

	case rbacv1.GroupKind:
		return has(user.GetGroups(), subject.Name)

	case rbacv1.ServiceAccountKind:
		// default the namespace to namespace we're working in if its available.  This allows rolebindings that reference
		// SAs in th local namespace to avoid having to qualify them.
		saNamespace := namespace
		if len(subject.Namespace) > 0 {
			saNamespace = subject.Namespace
		}
		if len(saNamespace) == 0 {
			return false
		}
		// use a more efficient comparison for RBAC checking
		return serviceaccount.MatchesUsername(saNamespace, subject.Name, user.GetName())
	default:
		return false
	}
}

serviceaccount.MatchesUsername对比 serviceaccount的全名system:serviceaccount:namespace:name逐次进行对比

// MatchesUsername checks whether the provided username matches the namespace and name without
// allocating. Use this when checking a service account namespace and name against a known string.
func MatchesUsername(namespace, name string, username string) bool {
	if !strings.HasPrefix(username, ServiceAccountUsernamePrefix) {
		return false
	}
	username = username[len(ServiceAccountUsernamePrefix):]

	if !strings.HasPrefix(username, namespace) {
		return false
	}
	username = username[len(namespace):]

	if !strings.HasPrefix(username, ServiceAccountUsernameSeparator) {
		return false
	}
	username = username[len(ServiceAccountUsernameSeparator):]

	return username == name
}

再根据clusterRoleBinding.RoleRef从informer获取rules

			rules, err := r.GetRoleReferenceRules(clusterRoleBinding.RoleRef, "")

遍历rule,传入找到的clusterRoleBinding,调用visit进行比对

			sourceDescriber.binding = clusterRoleBinding
			sourceDescriber.subject = &clusterRoleBinding.Subjects[subjectIndex]
			for i := range rules {
				if !visitor(sourceDescriber, &rules[i], nil) {
					return
				}
			}

资源型
。对比request和rule的verb是否一致

func VerbMatches(rule *rbacv1.PolicyRule, requestedVerb string) bool {
	for _, ruleVerb := range rule.Verbs {
		if ruleVerb == rbacv1.VerbAll {
			return true
		}
		if ruleVerb == requestedVerb {
			return true
		}
	}

	return false
}

非资源型的
对比request和rule的url和verb

func NonResourceURLMatches(rule *rbacv1.PolicyRule, requestedURL string) bool {
	for _, ruleURL := range rule.NonResourceURLs {
		if ruleURL == rbacv1.NonResourceAll {
			return true
		}
		if ruleURL == requestedURL {
			return true
		}
		if strings.HasSuffix(ruleURL, "*") && strings.HasPrefix(requestedURL, strings.TrimRight(ruleURL, "*")) {
			return true
		}
	}

	return false
}

再校验RoleBinding

通过informer获取roleBinding列表

		if roleBindings, err := r.roleBindingLister.ListRoleBindings(namespace); err != nil {
			if !visitor(nil, nil, err) {
				return
			}

遍历对比subject

- 一样的appliesTo判断subject

			for _, roleBinding := range roleBindings {
				subjectIndex, applies := appliesTo(user, roleBinding.Subjects, namespace)
				if !applies {
					continue
				}

根据informer获取匹配到的roleBinding的rules对象

				rules, err := r.GetRoleReferenceRules(roleBinding.RoleRef, namespace)
				if err != nil {
					if !visitor(nil, nil, err) {
						return
					}
					continue
				}

调用visit方法遍历rule进行匹配
        如果匹配中了,allowed置为true

				sourceDescriber.binding = roleBinding
				sourceDescriber.subject = &roleBinding.Subjects[subjectIndex]
				for i := range rules {
					if !visitor(sourceDescriber, &rules[i], nil) {
						return
					}
				}

3.7 audit审计功能说明和源码阅读

audit审计的总结

Auditing
Kubernetes 审计 (Auditing) 功能提供了与安全相关的、按时间顺序排列的记录集,记录每个用户、使用 KubernetesAPI的应用以及控制面自身引发的活动

审计功能使得集群管理员能处理以下问题

发生了什么?
什么时候发生的?
谁触发的?
活动发生在哪个(些) 对象上?
在哪观察到的?
它从哪触发的?
活动的后续处理行为是什么?

审计策略 由粗到细粒度增长

None- 符合这条规则的日志将不会记录。

Metadata- 记录请求的元数据(请求的用户、时间戳、资源、动词等等),但是不记录请求或者响应的消息体

Request- 记录事件的元数据和请求的消息体,但是不记录响应的消息体。 这不适用于非资源类型的请求

- RequestResponse - 记录事件的元数据,请求和响应的消息体。这不适用于非资源类型的请求。

审计功能介绍
随文档学习
地https://kubernetes.io/zh/docs/tasks/debug-application-cluster/audit/

audit源码阅读

入口位置D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-apiserver\app\server.go

buildGenericConfig

	lastErr = s.Audit.ApplyTo(genericConfig)
	if lastErr != nil {
		return
	}

1.从配置的 --audit-policy-file加载audit策略
你可以使用--audit-policy-file 标志将包含策略的文件传递给 kube-apiserver
如果不设置该标志,则不记录事件
rules 字段 必须在审计策略文件中提供。没有 (0)规则的策略将被视为非法配置

	// 1. Build policy evaluator
	evaluator, err := o.newPolicyRuleEvaluator()
	if err != nil {
		return err
	}

2.从配置的 --audit-log-path设置logBackend

	// 2. Build log backend
	var logBackend audit.Backend
	w, err := o.LogOptions.getWriter()
	if err != nil {
		return err
	}
	if w != nil {
		if evaluator == nil {
			klog.V(2).Info("No audit policy file provided, no events will be recorded for log backend")
		} else {
			logBackend = o.LogOptions.newBackend(w)
		}
	}

如果后端--audit-log-path="-"代表记录到标准输出

func (o *AuditLogOptions) getWriter() (io.Writer, error) {
	if !o.enabled() {
		return nil, nil
	}

	if o.Path == "-" {
		return os.Stdout, nil
	}

	if err := o.ensureLogFile(); err != nil {
		return nil, fmt.Errorf("ensureLogFile: %w", err)
	}

	return &lumberjack.Logger{
		Filename:   o.Path,
		MaxAge:     o.MaxAge,
		MaxBackups: o.MaxBackups,
		MaxSize:    o.MaxSize,
		Compress:   o.Compress,
	}, nil
}

ensureLogFile 尝试打开一下log文件,做日志文件是否可用的验证·

底层使用 https://github.com/natefinch/lumberjack,是个带有滚动功能的日志库
获取到日志writer对象后校验下有没有evaluator
- 如果没有evaluator,打印条提示日志
- 如果将backend设置为w

	if w != nil {
		if evaluator == nil {
			klog.V(2).Info("No audit policy file provided, no events will be recorded for log backend")
		} else {
			logBackend = o.LogOptions.newBackend(w)
		}
	}

3根据配置构建webhook的 后端

	// 3. Build webhook backend
	var webhookBackend audit.Backend
	if o.WebhookOptions.enabled() {
		if evaluator == nil {
			klog.V(2).Info("No audit policy file provided, no events will be recorded for webhook backend")
		} else {
			if c.EgressSelector != nil {
				var egressDialer utilnet.DialFunc
				egressDialer, err = c.EgressSelector.Lookup(egressselector.ControlPlane.AsNetworkContext())
				if err != nil {
					return err
				}
				webhookBackend, err = o.WebhookOptions.newUntruncatedBackend(egressDialer)
			} else {
				webhookBackend, err = o.WebhookOptions.newUntruncatedBackend(nil)
			}
			if err != nil {
				return err
			}
		}
	}

4 如果有webhook就把它封装为dynamicBackend

	// 4. Apply dynamic options.
	var dynamicBackend audit.Backend
	if webhookBackend != nil {
		// if only webhook is enabled wrap it in the truncate options
		dynamicBackend = o.WebhookOptions.TruncateOptions.wrapBackend(webhookBackend, groupVersion)
	}

5设置审计的策略计算对象 evaluator

	// 5. Set the policy rule evaluator
	c.AuditPolicyRuleEvaluator = evaluator

6把logBackend和 dynamicBackend 做union

	// 6. Join the log backend with the webhooks
	c.AuditBackend = appendBackend(logBackend, dynamicBackend)
func appendBackend(existing, newBackend audit.Backend) audit.Backend {
	if existing == nil {
		return newBackend
	}
	if newBackend == nil {
		return existing
	}
	return audit.Union(existing, newBackend)
}

7.最终的运行方法

backend接口方法

D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\audit\types.go

type Sink interface {
	// ProcessEvents handles events. Per audit ID it might be that ProcessEvents is called up to three times.
	// Errors might be logged by the sink itself. If an error should be fatal, leading to an internal
	// error, ProcessEvents is supposed to panic. The event must not be mutated and is reused by the caller
	// after the call returns, i.e. the sink has to make a deepcopy to keep a copy around if necessary.
	// Returns true on success, may return false on error.
	ProcessEvents(events ...*auditinternal.Event) bool
}

type Backend interface {
	Sink

	// Run will initialize the backend. It must not block, but may run go routines in the background. If
	// stopCh is closed, it is supposed to stop them. Run will be called before the first call to ProcessEvents.
	Run(stopCh <-chan struct{}) error

	// Shutdown will synchronously shut down the backend while making sure that all pending
	// events are delivered. It can be assumed that this method is called after
	// the stopCh channel passed to the Run method has been closed.
	Shutdown()

	// Returns the backend PluginName.
	String() string
}

最终调用audit的ProcessEvents方法,以log举例,位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\plugin\pkg\audit\log\backend.go

func (b *backend) ProcessEvents(events ...*auditinternal.Event) bool {
	success := true
	for _, ev := range events {
		success = b.logEvent(ev) && success
	}
	return success
}

func (b *backend) logEvent(ev *auditinternal.Event) bool {
	line := ""
	switch b.format {
	case FormatLegacy:
		line = audit.EventString(ev) + "\n"
	case FormatJson:
		bs, err := runtime.Encode(b.encoder, ev)
		if err != nil {
			audit.HandlePluginError(PluginName, err, ev)
			return false
		}
		line = string(bs[:])
	default:
		audit.HandlePluginError(PluginName, fmt.Errorf("log format %q is not in list of known formats (%s)",
			b.format, strings.Join(AllowedFormats, ",")), ev)
		return false
	}
	if _, err := fmt.Fprint(b.out, line); err != nil {
		audit.HandlePluginError(PluginName, err, ev)
		return false
	}
	return true
}

8 http侧调用的handler

D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\endpoints\filters\audit.go

// WithAudit decorates a http.Handler with audit logging information for all the
// requests coming to the server. Audit level is decided according to requests'
// attributes and audit policy. Logs are emitted to the audit sink to
// process events. If sink or audit policy is nil, no decoration takes place.
func WithAudit(handler http.Handler, sink audit.Sink, policy audit.PolicyRuleEvaluator, longRunningCheck request.LongRunningRequestCheck) http.Handler {
	if sink == nil || policy == nil {
		return handler
	}
	return http.HandlerFunc(func(w http.ResponseWriter, req *http.Request) {
		auditContext, err := evaluatePolicyAndCreateAuditEvent(req, policy)
		if err != nil {
			utilruntime.HandleError(fmt.Errorf("failed to create audit event: %v", err))
			responsewriters.InternalError(w, req, errors.New("failed to create audit event"))
			return
		}

		ev := auditContext.Event
		if ev == nil || req.Context() == nil {
			handler.ServeHTTP(w, req)
			return
		}

		req = req.WithContext(audit.WithAuditContext(req.Context(), auditContext))

		ctx := req.Context()
		omitStages := auditContext.RequestAuditConfig.OmitStages

		ev.Stage = auditinternal.StageRequestReceived
		if processed := processAuditEvent(ctx, sink, ev, omitStages); !processed {
			audit.ApiserverAuditDroppedCounter.WithContext(ctx).Inc()
			responsewriters.InternalError(w, req, errors.New("failed to store audit event"))
			return
		}

		// intercept the status code
		var longRunningSink audit.Sink
		if longRunningCheck != nil {
			ri, _ := request.RequestInfoFrom(ctx)
			if longRunningCheck(req, ri) {
				longRunningSink = sink
			}
		}
		respWriter := decorateResponseWriter(ctx, w, ev, longRunningSink, omitStages)

		// send audit event when we leave this func, either via a panic or cleanly. In the case of long
		// running requests, this will be the second audit event.
		defer func() {
			if r := recover(); r != nil {
				defer panic(r)
				ev.Stage = auditinternal.StagePanic
				ev.ResponseStatus = &metav1.Status{
					Code:    http.StatusInternalServerError,
					Status:  metav1.StatusFailure,
					Reason:  metav1.StatusReasonInternalError,
					Message: fmt.Sprintf("APIServer panic'd: %v", r),
				}
				processAuditEvent(ctx, sink, ev, omitStages)
				return
			}

			// if no StageResponseStarted event was sent b/c neither a status code nor a body was sent, fake it here
			// But Audit-Id http header will only be sent when http.ResponseWriter.WriteHeader is called.
			fakedSuccessStatus := &metav1.Status{
				Code:    http.StatusOK,
				Status:  metav1.StatusSuccess,
				Message: "Connection closed early",
			}
			if ev.ResponseStatus == nil && longRunningSink != nil {
				ev.ResponseStatus = fakedSuccessStatus
				ev.Stage = auditinternal.StageResponseStarted
				processAuditEvent(ctx, longRunningSink, ev, omitStages)
			}

			ev.Stage = auditinternal.StageResponseComplete
			if ev.ResponseStatus == nil {
				ev.ResponseStatus = fakedSuccessStatus
			}
			processAuditEvent(ctx, sink, ev, omitStages)
		}()
		handler.ServeHTTP(respWriter, req)
	})
}
tail -f /var/log/audit/audit.log

3.8 admission准入控制器功能和源码解读

什么是准入控制插件

- 准入控制器是一段代码,它会在请求通过认证和授权之后对象被持久化之前拦截到达 API服务器的请求

- 准入控制过程分为两个阶段。第一阶段,运行变更准入控制器。第二阶段,运行验证准入控制器- - 控制器需要编译进 kube-apiserver 二进制文件,并且只能由集群管理员配置。
- 如果任何一个阶段的任何控制器拒绝了该请求,则整个请求将立即被拒绝,并向终端用户返回一个错误。

本节重点总结

准入控制插件的作用

        - 开启高级特性

什么是准入控制插件

文档地址 https://kubernetesio/zh/docs/reference/access-authn-authz/admission-controllers/

- 准入控制器是一段代码,它会在请求通过认证和授权之后、对象被持久化之前拦截到达 API 服务器的请求

- 准入控制过程分为两个阶段。第一阶段,运行变更准入控制器。第二阶段,运行验证准入控制器- - 控制器需要编译进 kube-apiserver 二进制文件,并且只能由集群管理员配置。

- 如果任何一个阶段的任何控制器拒绝了该请求,则整个请求将立即被拒绝,并向终端用户返回一个错误。
为什么需要准入控制器?
- Kubernetes 的许多高级功能都要求启用一个准入控制器,以便正确地支持该特性。

- 因此,没有正确配置准入控制器的 Kubernetes API 服务器是不完整的,它无法支持你期望的所有特性。 

按照是否可以修改对象分类
- 准入控制器可以执行“验证(Validating)”和/或“变更(Mutating)” 操作

- 变更(mutating)控制器可以修改被其接受的对象;验证 (validating)控制器则不行
按照静态动态分类
- 静态的就是固定的单一功能,如AlwaysPullImages 修改每一个新创建的 Pod 的镜像拉取策略为 Always

动态的如有两个特殊的控制器:MutatingAdmissionWebhook 和 Validating.AdmissionWebhook.。

- 它们根据API 中的配置,分别执行变更和验证准入控制 webhook。

- 相当于可以调用外部的http请求准入控制插件

源码阅读

入口在D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-apiserver\app\server.go

	pluginInitializers, admissionPostStartHook, err = admissionConfig.New(proxyTransport, genericConfig.EgressSelector, serviceResolver, genericConfig.TracerProvider)

admissionConfig.New初始化准入控制器的配置
New函数设置准入所需的插件和webhook准入的启动钩子函数

// New sets up the plugins and admission start hooks needed for admission
func (c *Config) New(proxyTransport *http.Transport, egressSelector *egressselector.EgressSelector, serviceResolver webhook.ServiceResolver, tp *trace.TracerProvider) ([]admission.PluginInitializer, genericapiserver.PostStartHookFunc, error) {
	webhookAuthResolverWrapper := webhook.NewDefaultAuthenticationInfoResolverWrapper(proxyTransport, egressSelector, c.LoopbackClientConfig, tp)
	webhookPluginInitializer := webhookinit.NewPluginInitializer(webhookAuthResolverWrapper, serviceResolver)

	var cloudConfig []byte
	if c.CloudConfigFile != "" {
		var err error
		cloudConfig, err = ioutil.ReadFile(c.CloudConfigFile)
		if err != nil {
			klog.Fatalf("Error reading from cloud configuration file %s: %#v", c.CloudConfigFile, err)
		}
	}
	clientset, err := kubernetes.NewForConfig(c.LoopbackClientConfig)
	if err != nil {
		return nil, nil, err
	}

	discoveryClient := cacheddiscovery.NewMemCacheClient(clientset.Discovery())
	discoveryRESTMapper := restmapper.NewDeferredDiscoveryRESTMapper(discoveryClient)
	kubePluginInitializer := NewPluginInitializer(
		cloudConfig,
		discoveryRESTMapper,
		quotainstall.NewQuotaConfigurationForAdmission(),
	)

	admissionPostStartHook := func(context genericapiserver.PostStartHookContext) error {
		discoveryRESTMapper.Reset()
		go utilwait.Until(discoveryRESTMapper.Reset, 30*time.Second, context.StopCh)
		return nil
	}

	return []admission.PluginInitializer{webhookPluginInitializer, kubePluginInitializer}, admissionPostStartHook, nil
}

其中用到的准入初始化接口为 Pluginlnitializer,位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\kubeapiserver\admission\initializer.go

同时含有对应的Initialize方法,作用是提供初始化的数据

// PluginInitializer is used for initialization of the Kubernetes specific admission plugins.
type PluginInitializer struct {
	cloudConfig        []byte
	restMapper         meta.RESTMapper
	quotaConfiguration quota.Configuration
}// Initialize checks the initialization interfaces implemented by each plugin
// and provide the appropriate initialization data
func (i *PluginInitializer) Initialize(plugin admission.Interface) {
	if wants, ok := plugin.(WantsCloudConfig); ok {
		wants.SetCloudConfig(i.cloudConfig)
	}

	if wants, ok := plugin.(WantsRESTMapper); ok {
		wants.SetRESTMapper(i.restMapper)
	}

	if wants, ok := plugin.(initializer.WantsQuotaConfiguration); ok {
		wants.SetQuotaConfiguration(i.quotaConfiguration)
	}
}

同时还初始化了quto配额的准入

	kubePluginInitializer := NewPluginInitializer(
		cloudConfig,
		discoveryRESTMapper,
		quotainstall.NewQuotaConfigurationForAdmission(),
	)

生成一个webhook启动的钩子,每30秒重置一下discoveryRESTMapper,重置内部缓存的 Discovery

	admissionPostStartHook := func(context genericapiserver.PostStartHookContext) error {
		discoveryRESTMapper.Reset()
		go utilwait.Until(discoveryRESTMapper.Reset, 30*time.Second, context.StopCh)
		return nil
	}

s.Admission.ApplyTo 初始化准入控制

	err = s.Admission.ApplyTo(
		genericConfig,
		versionedInformers,
		kubeClientConfig,
		utilfeature.DefaultFeatureGate,
		pluginInitializers...)

根据 传入的控制器列表和推荐的计算开启的和关闭的

	if a.PluginNames != nil {
		// pass PluginNames to generic AdmissionOptions
		a.GenericAdmission.EnablePlugins, a.GenericAdmission.DisablePlugins = computePluginNames(a.PluginNames, a.GenericAdmission.RecommendedPluginOrder)
	}

PluginNames代表--admission-control传入的
a.GenericAdmission.RecommendedPluginOrder代表官方所有的,AllOrderedPlugins,位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\kubeapiserver\options\plugins.go

使用computePluginNames算差集得到开启的和关闭的

// explicitly disable all plugins that are not in the enabled list
func computePluginNames(explicitlyEnabled []string, all []string) (enabled []string, disabled []string) {
	return explicitlyEnabled, sets.NewString(all...).Difference(sets.NewString(explicitlyEnabled...)).List()
}

底层的ApplyTo分析

D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\server\options\admission.go

func (a *AdmissionOptions) ApplyTo(){}

根据传入关闭的、传入开启的、推荐的等插件列表计算真正要开启的列表

// enabledPluginNames makes use of RecommendedPluginOrder, DefaultOffPlugins,
// EnablePlugins, DisablePlugins fields
// to prepare a list of ordered plugin names that are enabled.
func (a *AdmissionOptions) enabledPluginNames() []string {
	allOffPlugins := append(a.DefaultOffPlugins.List(), a.DisablePlugins...)
	disabledPlugins := sets.NewString(allOffPlugins...)
	enabledPlugins := sets.NewString(a.EnablePlugins...)
	disabledPlugins = disabledPlugins.Difference(enabledPlugins)

	orderedPlugins := []string{}
	for _, plugin := range a.RecommendedPluginOrder {
		if !disabledPlugins.Has(plugin) {
			orderedPlugins = append(orderedPlugins, plugin)
		}
	}

	return orderedPlugins
}

根据配置文件读取配置 admission-control-config-file

	pluginsConfigProvider, err := admission.ReadAdmissionConfiguration(pluginNames, a.ConfigFile, configScheme)
	if err != nil {
		return fmt.Errorf("failed to read plugin config: %v", err)
	}

初始化genericInitializer

	clientset, err := kubernetes.NewForConfig(kubeAPIServerClientConfig)
	if err != nil {
		return err
	}
	genericInitializer := initializer.New(clientset, informers, c.Authorization.Authorizer, features)
	initializersChain := admission.PluginInitializers{}
	pluginInitializers = append(pluginInitializers, genericInitializer)
	initializersChain = append(initializersChain, pluginInitializers...)

NewFromPlugins执行所有启用的准入插件

D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\admission\plugins.go

遍历插件,调用InitPlugin初始化对应的实例

for _, pluginName := range pluginNames {
		pluginConfig, err := configProvider.ConfigFor(pluginName)
		if err != nil {
			return nil, err
		}

		plugin, err := ps.InitPlugin(pluginName, pluginConfig, pluginInitializer)
		if err != nil {
			return nil, err
		}
		if plugin != nil {
			if decorator != nil {
				handlers = append(handlers, decorator.Decorate(plugin, pluginName))
			} else {
				handlers = append(handlers, plugin)
			}

			if _, ok := plugin.(MutationInterface); ok {
				mutationPlugins = append(mutationPlugins, pluginName)
			}
			if _, ok := plugin.(ValidationInterface); ok {
				validationPlugins = append(validationPlugins, pluginName)
			}
		}
	}

InitPlugin
·调用getPlugin从Plugins获取plugin实例

// InitPlugin creates an instance of the named interface.
func (ps *Plugins) InitPlugin(name string, config io.Reader, pluginInitializer PluginInitializer) (Interface, error) {
	if name == "" {
		klog.Info("No admission plugin specified.")
		return nil, nil
	}

	plugin, found, err := ps.getPlugin(name, config)
	if err != nil {
		return nil, fmt.Errorf("couldn't init admission plugin %q: %v", name, err)
	}
	if !found {
		return nil, fmt.Errorf("unknown admission plugin: %s", name)
	}

	pluginInitializer.Initialize(plugin)
	// ensure that plugins have been properly initialized
	if err := ValidateInitialization(plugin); err != nil {
		return nil, fmt.Errorf("failed to initialize admission plugin %q: %v", name, err)
	}

	return plugin, nil
}

getPlugin

// getPlugin creates an instance of the named plugin.  It returns `false` if the
// the name is not known. The error is returned only when the named provider was
// known but failed to initialize.  The config parameter specifies the io.Reader
// handler of the configuration file for the cloud provider, or nil for no configuration.
func (ps *Plugins) getPlugin(name string, config io.Reader) (Interface, bool, error) {
	ps.lock.Lock()
	defer ps.lock.Unlock()
	f, found := ps.registry[name]
	if !found {
		return nil, false, nil
	}

	config1, config2, err := splitStream(config)
	if err != nil {
		return nil, true, err
	}
	if !PluginEnabledFn(name, config1) {
		return nil, true, nil
	}

	ret, err := f(config2)
	return ret, true, err
}

其中最关键的就是去ps.registry map中获取插件,对应的value就是工厂函数

// Factory is a function that returns an Interface for admission decisions.
// The config parameter provides an io.Reader handler to the factory in
// order to load specific configurations. If no configuration is provided
// the parameter is nil.
type Factory func(config io.Reader) (Interface, error)

type Plugins struct {
	lock     sync.Mutex
	registry map[string]Factory
}

追踪可以发现这些工厂函数是在 RegisterAllAdmissionPlugins被注册的

D:\Workspace\Go\src\k8s.io\kubernetes\pkg\kubeapiserver\options\plugins.go

// RegisterAllAdmissionPlugins registers all admission plugins.
// The order of registration is irrelevant, see AllOrderedPlugins for execution order.
func RegisterAllAdmissionPlugins(plugins *admission.Plugins) {
	admit.Register(plugins) // DEPRECATED as no real meaning
	alwayspullimages.Register(plugins)
	antiaffinity.Register(plugins)
	defaulttolerationseconds.Register(plugins)
	defaultingressclass.Register(plugins)
	denyserviceexternalips.Register(plugins)
	deny.Register(plugins) // DEPRECATED as no real meaning
	eventratelimit.Register(plugins)
	extendedresourcetoleration.Register(plugins)
	gc.Register(plugins)
	imagepolicy.Register(plugins)
	limitranger.Register(plugins)
	autoprovision.Register(plugins)
	exists.Register(plugins)
	noderestriction.Register(plugins)
	nodetaint.Register(plugins)
	label.Register(plugins) // DEPRECATED, future PVs should not rely on labels for zone topology
	podnodeselector.Register(plugins)
	podtolerationrestriction.Register(plugins)
	runtimeclass.Register(plugins)
	resourcequota.Register(plugins)
	podsecurity.Register(plugins)
	podsecuritypolicy.Register(plugins)
	podpriority.Register(plugins)
	scdeny.Register(plugins)
	serviceaccount.Register(plugins)
	setdefault.Register(plugins)
	resize.Register(plugins)
	storageobjectinuseprotection.Register(plugins)
	certapproval.Register(plugins)
	certsigning.Register(plugins)
	certsubjectrestriction.Register(plugins)
}
ps -ef |grep apiserver |grep admission-control

以alwayspullimages.Register(plugins)为例
- 那么对应的工厂函数为

D:\Workspace\Go\src\k8s.io\kubernetes\plugin\pkg\admission\alwayspullimages\admission.go

// Register registers a plugin
func Register(plugins *admission.Plugins) {
	plugins.Register(PluginName, func(config io.Reader) (admission.Interface, error) {
		return NewAlwaysPullImages(), nil
	})
}

对应就是初始化一个AlwaysPulllmages

// NewAlwaysPullImages creates a new always pull images admission control handler
func NewAlwaysPullImages() *AlwaysPullImages {
	return &AlwaysPullImages{
		Handler: admission.NewHandler(admission.Create, admission.Update),
	}
}

对应就会有修改对象的的变更准入控制器方法Admit

// Admit makes an admission decision based on the request attributes
func (a *AlwaysPullImages) Admit(ctx context.Context, attributes admission.Attributes, o admission.ObjectInterfaces) (err error) {
	// Ignore all calls to subresources or resources other than pods.
	if shouldIgnore(attributes) {
		return nil
	}
	pod, ok := attributes.GetObject().(*api.Pod)
	if !ok {
		return apierrors.NewBadRequest("Resource was marked with kind Pod but was unable to be converted")
	}

	pods.VisitContainersWithPath(&pod.Spec, field.NewPath("spec"), func(c *api.Container, _ *field.Path) bool {
		c.ImagePullPolicy = api.PullAlways
		return true
	})

	return nil
}

上面的的是把pod的ImagePullPolicy改为api.PullAlways

对应的文档地址 https://kubernetes.io/zh/docs/reference/access-authn-authz/admission-controllers/#alwayspullimages
。该准入控制器会修改每一个新创建的 Pod 的镜像拉取策略为 Always

。这在多租户集群中是有用的,这样用户就可以放心,他们的私有镜像只能被那些有凭证的人使用

。如果没有这个准入控制器,一旦镜像被拉取到节点上,任何用户的 Pod 都可以通过已了解到的镜像的名称(假设 Pod 被调度到正确的节点上)来使用它,而不需要对镜像进行任何授权检查

。当启用这个准入控制器时,总是在启动容器之前拉取镜像,这意味着需要有效的凭证

。同时对应还有校验的方法Validate

// Validate makes sure that all containers are set to always pull images
func (*AlwaysPullImages) Validate(ctx context.Context, attributes admission.Attributes, o admission.ObjectInterfaces) (err error) {
	if shouldIgnore(attributes) {
		return nil
	}

	pod, ok := attributes.GetObject().(*api.Pod)
	if !ok {
		return apierrors.NewBadRequest("Resource was marked with kind Pod but was unable to be converted")
	}

	var allErrs []error
	pods.VisitContainersWithPath(&pod.Spec, field.NewPath("spec"), func(c *api.Container, p *field.Path) bool {
		if c.ImagePullPolicy != api.PullAlways {
			allErrs = append(allErrs, admission.NewForbidden(attributes,
				field.NotSupported(p.Child("imagePullPolicy"), c.ImagePullPolicy, []string{string(api.PullAlways)}),
			))
		}
		return true
	})
	if len(allErrs) > 0 {
		return utilerrors.NewAggregate(allErrs)
	}

	return nil
}

第4章 自定义准入控制器,完成nginx sidecar的注入

4.1定义准入控制器需求分析

编写一个准入控制器,实现自动注入nginx sidecar pod
- 编写准入控制器,并运行
- 最终的效果就是指定命名空间下的应用pod都会被注入一个简单的nginx sidecar
istio 自动注入envoy 说明

 https://istio.io/latest/img/service-mesh.svg
- 现在非常火热的的 Service Mesh 应用istio 就是通过k8s apiserver的 mutating webhooks 来自动将Envoy这个 sidecar 容器注入到 Pod 中去的,相关文档https://istio.io/docs/setup/kubernetes/sidecar-injection/。

- 为了利用 Istio 的所有功能,网格中的 pod 必须运行lstio sidecar 代理。

- 当在 pod 的命名空间中启用时,自动注入会在 pod 创建时使用准入控制器注入代理配置,最后你的pod旁边有envoy 运行了
流程说明
- 检查集群中是否启用了admission webhook 控制器,并根据需要进行配置。
- 编写mutating webhook代码
  。启动tls-http server
  。实现/mutate方法
      。当用户调用create/update 方法创建/更新 pod时
      。apiserver调用这个mutating webhook,修改其中的方法,添加nginx sidecar容器
      。返回给apiserver,达到注入的目的

- 创建证书完成ca签名
- 创建MutatingWebhookConfiguration

- 部署服务验证注入结果

4.2 检查k8s集群准入配置和其他准备工作

什么是准入控制插件
- k8s集群检查操作
- 新建项目 kube-mutating-webhook-inject-pod,准备工作

本节重点总结:

- k8s集群检查操作
- 新建项目 kube-mutating-webhook-inject-pod,准备工作
k8s集群检查操作
检查k8s集群否启用了准入注册 API:
- 执行kubectl api-versions lgrep admission。

- 如果有下面的结果说明已启用

kubectl api-versions |grep admission
admissionregistration.k8s.io/v1

检查 apiserver 中启用了MutatingAdmissionWebhook和ValidatingAdmissionWebhook两个准入控制插件
- 1.20以上的版本默认已经启用,在默认启用的命令行中enable-admission-plugins

/usr/local/bin/kube-apiserver -h | grep enable-admission-plugins
    --admission-plugins strings             Admission is divided into two phases. In the first  phase, only mutating admission plugins run. In the second phase, only validating admission plugins run. 
    --enable-admission plugins strings      admission plugins that should be enabled in addition to default enabled ones

编写 webhook
-  新建项目kube-mutating-webhook-inject-pod

go mod init kube-mutating-webhook-inject-pod

注入sidecar容器的配置文件设计
- 因为要注入一个容器,需要定义容器的相关配置所以复用k8s pod中container段的yaml

- 同时要挂载注入容器的配置,所以要复用k8s pod 中volumes的yaml
- 新建config.yaml如下

```yaml
containers:
  - name: sidecar-nginx
    image: nginx:1.12.2
    imagePullPolicy: IfNotPresent
    ports:
      - containerPort: 80
    volumeMounts:
      - name: nginx-conf
        mountPath: /etc/nginx
volumes:
  - name: nginx-conf
    configMap:
      name: nginx-configmap
```

对应的go代码
新建 pkg/webhook.go

package main

import (
	corev1 "k8s.io/api/core/v1"
)

type Config struct {
	Containers []corevl.Container `yaml:"containers"`
	Volumes    []corev1.Volume    `yaml: "volumes"`
}

解析配置文件的函数

func loadConfig(configFile string) (*Config, error) {
	data, err := ioutil.ReadFile(configFile)
	if err != nil {
		return nil, err
	}
	glog.Infof("New configuration: sha256sum %x", sha256.Sum256(data))
	var cfg Config
	if err := yaml.Unmarshal(data, &cfg); err != nil {
		return nil, err
	}
	return &cfg, nil
}

编写webhook server的配置

// Webhook Server options
type webHookSvrOptions struct {
	port           int    // 监听https的端口
	certFile       string // https x509 证书路径
	keyFile        string // https x509 证书私钥路径
	sidecarCfgFile string // 注入sidecar容器的配置文件路径
}

在main中通过命令行传入默认值并解析

package main

import (
	"flag"

	"github.com/golang/glog"
)

func main() {
	var runOption webHookSvrOptions

	// get command line parameters
	flag.IntVar(&runOption.port, "port", 8443, "Webhook server port.")
	flag.StringVar(&runOption.certFile, "tlsCertFile", "/etc/webhook/certs/cert.pem", "File containing the x509 Certificate for HTTPS.")
	flag.StringVar(&runOption.keyFile, "tlsKeyFile", "/etc/webhook/certs/key.pem", "File containing the x509 private key to --tlsCertFile.")
	//flag.StringVar(&runOption.sidecarCfgFile,  "sidecarCfgFile", "/etc/webhook/config/sidecarconfig.yaml", "File containing the mutation configuration.")
	flag.StringVar(&runOption.sidecarCfgFile, "sidecarCfgFile", "config.yaml", "File containing the mutation configuration.")
	flag.Parse()

	sidecarConfig, err := loadConfig(runOption.sidecarCfgFile)
	if err != nil {
		glog.Errorf("Failed to load configuration: %v", err)
		return
	}
	glog.Infof("[sidecarConfig:%v]", sidecarConfig)
}

加载tls x509证书

    pair, err := tls.LoadX509KeyPair(runOption.certFile, runOption.keyFile)
	if err != nil {
	    glog.Errorf("Failed to load key pair: %v", err)
        return
	}

定义webhookhttp server,并构造

        webhook.go中

type webhookServer struct {
	sidecarConfig *Config      // 注入sidecar容器的配置
	server        *http.Server // http serer
}

main中

	webhooksvr := &webhookServer{
		sidecarConfig: sidecarConfig,
		server: &http.Server{
			Addr:      fmt.Sprintf(":%v", runOption.port),
			TLSConfig: &tls.Config{Certificates: []tls.Certificate{pair}},
		},
	}

webhookServer的mutate handler并关联path
        webhook.go

func (ws *webhookServer) serveMutate(w http.ResponseWriter, r *http.Request) {

}

main.go

	mux := httpNewServeMux()
	mux.HandleFunc("/mutate", webhooksvr.serveMutate)
	webhooksvr.serverHandler = mux

	// start webhook server in new grountine
	go func() {
		if err := webhooksvr.server.ListenAndServeTLS("", ""); err != nil {
			glog.Errorf("Failed to listen and serve webhook server: %v", err)
		}
	}()

        意思是请求/mutate 由webhooksvr.serveMutate处理

main中监听退出信号

	// listening OS shutdown singal
	signalChan := make(chan os.Signal, 1)
	signal.Notify(signalChan, syscall.SIGINT, syscall.SIGTERM)
	<-signalChan

	glog.Infof("Got 0S shutdown signal, shutting down webhook server gracefully...")
	webhooksvr.server.Shutdown(context.Background())

本节完成红色框中的mutating admission webhooks

代码仓库地址GitHub - yunixiangfeng/k8s-exercise

k8s-exercise/kube-mutating-webhook-inject-pod at main · yunixiangfeng/k8s-exercise · GitHub

4.3 注入sidecar的mutatePod注入函数编写

什么是准入控制插件
serveMutate编写

- 准入控制请求参数校验

- 根据annotation标签判断是否需要注入sidecar

- mutatePod 注入函数编写

- 生成注入容器和volume的patch函数

serveMutate编写
普通校验请求
。 serveMutate方法
。 body是否为空
。 req header的Content-Type是否为application/json

// webhookServer的mutate handler
func (ws *webhookServer) serveMutate(w http.ResponseWriter, r *http.Request) {
	var body []byte
	if r.Body !=nil {
		if data,err := ioutil.ReadAll(r.Body); err == nil {
			body = data
		}
	}
	if len(body) == 0 {
		glog.Error("empty body")
		http.Error(w, "empty body", http.StatusBadRequest)
		return
	}
	// verify the content type is accurate
	contentType := r.Header.Get("Content-Type")
	if contentType !="application/json" {
		glog.Errorf("Content-Type=%s,  expect application/json", contentType)
		http.Error(w, "invalid Content-Type, expect `application/json`", http.StatusUnsupportedMediaType)
		return
	}
}

准入控制请求参数校验
- 构造准入控制的审查对象包括请求和响应
- 然后使用UniversalDeserializer解析传入的申请
- 如果出错就设置响应为报错的信息
- 没出错就调用mutatePod生成响应

	// 构造准入控制器的响应
	var admissionResponse *v1beta1.AdmissionResponse
	// 构造准入控制的审查对象 包括请求和响应
	// 然后使用UniversalDeserializer解析传入的申请
	// 如果出错就设置响应为报错的信息
	// 没出错就调用mutatePod生成响应
	ar := v1beta1.AdmissionReview{}
	if _, _, err := deserializer.Decode(body, nil, &ar); err != nil {
		glog.Errorf("Can't decode body: %v", err)
		admissionResponse = &v1beta1.AdmissionResponse{
			Result: &metav1.Status{
				Message: err.Error(),
			},
		}
	} else {
		admissionResponse = ws.mutatePod(&ar)
	}

解析器使用UniversalDeserializer

D:\Workspace\Go\pkg\mod\k8s.io\apimachinery@v0.24.2\pkg\runtime\serializer\codec_factory.go

import (
	"crypto/sha256"
	"io/ioutil"
	"net/http"

	"github.com/golang/glog"
	"gopkg.in/yaml.v2"
	"k8s.io/api/admission/v1beta1"
	corev1 "k8s.io/api/core/v1"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/apimachinery/pkg/runtime"
	"k8s.io/apimachinery/pkg/runtime/serializer"
)

var (
	runtimeScheme = runtime.NewScheme()
	codecs        = serializer.NewCodecFactory(runtimeScheme)
	deserializer  = codecs.UniversalDeserializer()
	// (https://github .com/kubernetes/kubernetes/issues/57982)
	defaulter = runtime.ObjectDefaulter(runtimeScheme)
)

写入响应
- 构造最终响应对象admissionReview

- 给response赋值
- json解析后用 w.write写入

	//构造最终响应对象 admissionReview
	// 给response贼值
	//json解析后用 w.write写入
	admissionReview := v1beta1.AdmissionReview{}
	if admissionResponse != nil {
		admissionReview.Response = admissionResponse
		if ar.Request != nil {
			admissionReview.Response.UID = ar.Request.UID
		}
	}

	resp, err := json.Marshal(admissionReview)
	if err != nil {
		glog.Errorf("Can't encode response: %v", err)
		http.Error(w, fmt.Sprintf("could not encode response: %v", err),
			http.StatusInternalServerError)
	}
	glog.Infof("Ready to write reponse ...")
	if _, err := w.Write(resp); err != nil {
		glog.Errorf("Can't write response: %v", err)
		http.Error(w, fmt.Sprintf("could not write response: %v", err), http.StatusInternalServerError)
	}

mutatePod注入函数编写
- 将请求中的对象解析为pod,如果出错就返回

func (ws *webhookServer) mutatePod(ar *v1beta1.AdmissionReview) *v1beta1.AdmissionResponse {
	// 将请求中的对象解析为pod,如果出错就返回
	req := ar.Request
	var pod corev1.Pod
	if err := json.Unmarshal(req.Object.Raw, &pod); err != nil {
		glog.Errorf("Could not unmarshal raw object: %v", err)
		return &v1beta1.AdmissionResponse{
			Result: &metav1.Status{
				Message: err.Error(),
			},
		}
	}
	
}

是否需要注入判断

	// 是否需要注入判断
	if !mutationRequired(ignoredNamespaces, &pod.ObjectMeta) {
		glog.Infof("Skipping mutation for %s/%s due to policy check", pod.Namespace, pod.Name)
		return &v1beta1.AdmissionResponser{
			Allowed: true,
		}
	}

mutationRequired判断函数, 判断这个pod资源要不要注入

1.如果pod在高权限的ns中,不注入
2.如果pod annotations中标记为已注入就不再注入了

3.如果pod annotations中配置不愿意注入就不注入

// 判断这个pod资源要不要注入
// 1.如果pod在高权限的ns中,不注入
// 2.如果pod annotations中标记为已注入就不再注入了
// 3.如果pod annotations中配置不愿意注入就不注入
func mutationRequired(ignoredList []string, metadata *metav1.ObjectMeta) bool {
	// skip special kubernete system namespaces
	for _, namespace := range ignoredList {
		if metadata.Namespace == namespace {
			glog.Infof("skip mutation for %v for it's in special namespace:%v", metadata.Name, metadata.Namespace)
			return false
		}
	}

	annotations := metadata.GetAnnotations()
	if annotations == nil {
		annotations = map[string]string{}
	}

	// 如果 annotation中 标记为已注入就不再注入了
	status := annotations[admissionWebhookAnnotationStatusKey]
	if strings.ToLower(status) == "injected" {
		return false
	}
	// 如果pod中配置不愿意注入就不注入
	switch strings.ToLower(annotations[admissionWebhookAnnotationInjectKey]) {
	default:
		return false
	case "true":
		return false
	}
}

相关的常量定义

const (
	// 代表这个pod是否要注入 = ture代表要注入
	admissionWebhookAnnotationInjectKey = "sidecar-injector-webhook.xiaoyi/need _inject"
	// 代表判断pod已经注入过的标志 = injected代表已经注入了,就不再注入
	admissionWebhookAnnotationStatusKey = "sidecar-injector-webhook.xiaoyi/status"
)

// 为了安全,不给这两个ns中的pod注入 sidecar
var ignoredNamespaces = []string{
	metav1.NamespaceSystem,
	metav1.NamespacePublic,
}

添加默认的配置
https://github.com/kubernetes/kubernetes/pull/58025

defaulter = runtime.ObjectDefaulter(runtimeScheme)
func applyDefaultsWorkaround(containers []corev1.Container, volumes []corev1.Volume) {
	defaulter .Default(&corev1Pod{
		Spec: corev1.PodSpec{
			Containers: containers,
			Volumes:volumes,
		},
	})
}

定义pathoption

type patchOperation struct {
	Op    string      `json:"op"`              // 动作
	Path  string      `json:"path"`            // 操作的path
	Value interface{} `json:"value,omitempty"` //值
}

生成容器端的patch函数

// 添加容器的patch
// 如果是第一个patch 需要在path末尾添加 /-
func addContainer(target, added []corev1.Container, basePath string) (patch []patchOperation) {
	first := len(target) == 0
	var value interface{}
	for _, add := range added {
		value = add
		path := basePath
		if first {
			first = false
			value =[]corev1.Container{add}
		} else {
			path = path +"/-"
		}
		patch = append(patch, patchOperation{
			Op: "add",
			Path: path,
			Value: value,
		})
    }
	return patch
}

生成添加volume的patch函数

func addVolume(target, added []corev1.Volume, basePath string) (patch []patchOperation) {
	first := len(target) == 0
	var value interface{}
	for _, add := range added {
		value = add
		path := basePath
		if first {
			first = false
			value =[]corev1.Volume{add}
		} else {
			path = path +"/-"
		}
		patch = append(patch, patchOperation{
			Op: "add",
   			Path: path,
			Value: value,
		})
	}
    return patch
}

更新annotation的patch

func updateAnnotation(target map[string]string, added map[string]string) (patch []patchOperation) {
	for key, value := range added {
		if target == nil || target[key] == "" {
			target = map[string]string{}
			patch = append(patch, patchOperation{
				Op: "add",
				Path: "/metadata/annotations",
				Value: map[string]string{
					key: value,
				},
			})
		} else {
			patch = append(patch, patchOperation{
				Op: "replace",
				Path: "/metadata/annotations/" + key,
				Value: value,
			})	
		}
	}
	return patch	
}

最终的patch调用

func createPatch(pod *corev1.Pod, sidecarConfig *Config, annotations map[string]string) ([]byte,
 error) {
	var patch []patchOperation

	patch = append(patch, addContainer(pod.Spec.Containers, sidecarConfig.Containers, "/spec/containers")...)
	patch = append(patch,addVolume(pod.Spec.Volumes, sidecarConfig.Volumes, "/spec/volumes")...)
	patch = append(patch, updateAnnotation(pod.Annotations, annotations)...)
	return json.Marshal(patch)
}

调用patch 生成patch option
- mutatePod方法中

	annotations := map[string]string{admissionWebhookAnnotationStatusKey: "injected"}
	patchBytes, err := createPatch(&pod, ws.sidecarConfig, annotations)
	if err != nil {
		return &v1beta1.AdmissionResponse{
			Result: &metav1.Status{
				Message: err.Error(),
			},
		}
	}
	glog.Infof("AdmissionResponse: patch=%v\n", string(patchBytes))
	return &v1beta1.AdmissionResponse{
		Allowed: true,
		Patch: patchBytes,
		PatchType: func() *v1beta1.PatchType {
			pt := v1beta1.PatchTypeJSONPatch
			return &pt
		}(),
	}
	return nil
}
	// Workaround: https://githubcom/kubernetes/kubernetes/issues/57982
	glog,Infof("[before applyDefaultsWorkaround][ws.sidecarConfig.Containers:%+v][ws.sidecarconfig.Volumes:%+v]", ws.sidecarConfig.Containers[0], ws.sidecarConfig.Volumes[0])
	applyDefaultsWorkaround(ws.sidecarConfig.Containers, ws.sidecarConfig.Volumes)
	glog.Infof("[after applyDefaultsWorkaround][ws.sidecarConfig.Containers:%+v][ws.sidecarConfig.Volumes:%+v]", ws.sidecarConfig.Containers[0], ws.sidecarConfig.Volumes[0])

	// 这里构造一个本次已注入sidecar的annotations
	annotations := map[string]string{admissionWebhookAnnotationStatusKey: "injected"}

本节重点总结:

serveMutate编写
。准入控制请求参数校验
。根据annotation标签判断是否需要注入sidecarmutatePod注入函数编写
。生成注入容器和volume的patch函数

4.4 搭镜像部署并运行注入sidecar验证

本节重点总结:

创建ca证书,通过csr让apiserver签名
获取审批后的证书,用它创建MutatingWebhookConfiguration

编译打镜像
makefile

IMAGE_NAME ?= sidecar-injector

PWD := $(shell pwd)
BASE_DIR := $(shell basename $(PWD))

export GOPATH ?= $(GOPATH_DEFAULT)

IMAGE_TAG ?= $(shell date +v%Y%m%d)-$(shell git describe --match=$(git rev-parse --short-8 HEAD) --tags --always --dirty
build:
	@echo "Building the $(IMAGE_NAME) binary..."
	@CGO_ENABLED=0 go build -o $(IMAGE_NAME) ./pkg/
build-linux:
	@echo "Building the $(IMAGE_NAME) binary for Docker (linux)..."
	@GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -o $(IMAGE_NAME) ./pkg/
#################################
# image section
#################################

image: build-image

build-image: build-linux
	@echo "Building the docker image: $(IMAGE_NAME)..."
	@docker build -t $(IMAGE_NAME) f Dockerfile .

.PHONY: all build image

dockerfile

FROM alpine:latest
# set environment variables
ENV SIDECAR_INJECTOR=/usr/local/bin/sidecar-injector \
    USER UID=1001 \
    USER_NAME=sidecar-injector

COPY sidecar-injector /usr/local/bin/sidecar-injector

# set entrypoint
ENTRYPOINT ["/usr/local/bin/sidecar-injector"]

# switch to non-root user
USER ${USER_UID}

打包代码kube-mutating-webhook-inject-pod.zip

拷贝到k8s集群节点,

打镜像
运行make build-image


将镜像导出并传输到其他节点导入

docker save sidecar-injector > a.tar
scp a.tar k8s-worker02:~
ctr --namespace k8s.io images import a.tar

部署
创建ns nginx-injection,最终部署到这个ns中的容器会被注入nginx sidecar

kubectl create ns nginx-injection

创建ns sidecar-injector,我们的这个mutate webhook服务运行的ns

kubectl create ns sidecar-injector

创建ca证书,并让apiserver签名
01生成证书签名请求配置文件csr.conf

cat <<EOF > csr.conf
[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name
[req_distinguished_name]
[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage = serverAuth
subjectAltName = @alt_names
[alt_names]
DNS.1 = sidecar-injector-webhook-svc
DNS.2 = sidecar-injector-webhook-svc.sidecar-injector
DNS.3 = sidecar-injector-webhook-svc.sidecar-injector.svc
EOF

02 opensslgenrsa 命生成RSA 私有秘钥

openssl genrsa -out server-key.pem 2048

03 生成证书请求文件验证证书请求文件和创建根CA

openssl req -new -key server-key.pem -subj "/CN=sidecar-injector-webhook-svc.sidecar-injector.svc" -out server.csr -config csr.conf

删除之前的csr请求

kubectl delete csr sidecar-injector-webhook-svc.sidecar-injector

申请csr CertificateSigningRequest

cat <<EOF | kubectl create -f -
apiVersion: certificates.k8s.io/v1beta1
kind: CertificateSigningRequest
metadata:
  name: sidecar-injector-webhook-svc.sidecar-injector
spec:
  groups:
  - system:authenticated
  request: $(< server.csr base64 | tr -d '\n')
  usages:
  - digital signature
  - key encipherment
  - server auth
EOF

检查csr

 kubectl get csr

审批csr

kubectl certificate approve sidecar-injector-webhook-svc.sidecar-injector
certificatesigningrequest.certificates.k8s.io/sidecar-injector-webhook-svc.sidecar-injector approve

获取签名后的证书

serverCert=$(kubectl get csr sidecar-injector-webhook-svc.sidecar-injector -o jsonpath='{ .status.certificate}')
echo "${serverCert}" | openssl base64 -d -A -out server-cert.pem

使用证书创建secret

kubectl create secret generic sidecar-injector-webhook-certs \
--from-file=keypem=server-key.pem \
--from-file=cert.pem=server-cert.pem \
--dry-run=client -o yaml |
kubectl -n sidecar-injector apply -f -

检查证书

kubectl get secret -n sidecar-injector
NAME                         TYPE                                DATA  AGE
default-token-hvgnl            kubernetes.io/service-account-token  3  25m
sidecar-injector-webhookcerts  Opaque                               2  25m

获取CA_BUNDLE并替换 mutatingwebhook中的CA_BUNDLE占位

CA_BUNDLE=$(kubectl config view --raw --minify --flatten -o jsonpath='{.clusters[].cluster.certificate-authority-data}')

if [ -z "${CA_BUNDLE}" ]; then
    CA_BUNDLE=$(kubectl get secrets -o jsonpath="{.items[?(@.metadata.annotations['kubernetes\.io/service-account\.name']=='default')].data.ca\.crt}")
fi

替换

cat deploy/mutating_webhook.yaml | sed -e "s|\${CA_BUNDLE}|${CA_BUNDLE}|g" >  deploy/mutatingwebhook-ca-bundle.yaml

检查结果

cat deploy/mutatingwebhook-ca-bundle.yaml

上述两个步骤可以直接运行脚本
脚本如下

chmod +x ./deploy/*.sh
./deploy/webhook-create-signed-cert.sh \
    --service sidecar-injector-webhook-svc \
    --secret sidecar-injector-webhook-certs \
    --namespace sidecar-injector

cat deploy/mutating_webhook.yaml | \
    deploy/webhook-patch-ca-bundle.sh > \
    deploy/mutatingwebhook-ca-bundle.yaml

这里重用Istio 项目中的生成的证书签名请求脚本。通过发送请求到apiserver,获取认证信息,然后使用获得的结果来创建需要的 secret 对象。
部署yaml

01先部署sidecar-injector
部署

kubectl create -f deploy/inject_configmap.yaml
kubectl create -f deploy/inject_deployment.yaml
kubectl create -f deploy/inject_service.yaml

检查

kubectl get pod -n sidecar-injector

kubectl get svc -n sidecar-injector

02 部署 mutatingwebhook

kubectl create -f deploy/mutatingwebhook-ca-bundle.yaml

检查

kubectl get MutatingWebhookConfiguration -A

03 部署nginx-sidecar 运行所需的configmap

kubectl create -f deploy/nginx_configmap.yaml

04 创建一个namespace ,并打上标签 sidecar-injection=enabled

kubectl create ns nginx-injection
kubectl label namespace nginx-injection nginx-sidecar-injection=enabled

sidecar-injection=enabled和 MutatingWebhookConfiguration中的ns过滤器相同

namespaceSelector:
  matchLabels:
    nginx-sidecar-injection: enabled

检查标签结果,最终部署到这里的pod都判断是否要注入sidecar

kubectl get ns -L nginx-sidecar-injection

 05 向nginx-injection中部署一个pod
annotations中配置的sidecar-injector-webhooknginxsidecar/need_inject:"true"代表需要注入

apiVersion: v1
kind: Pod
metadata:
  namespace: nginx-injection
  name: test-alpine-inject01
  labels:
    role: myrole
  annotations:
    sidecar-injector-webhook.nginx.sidecar/need_inject: "true"
spec:
  containers:
    - image: alpine
      command:
        - /bin/sh
        - "-c"
        - "sleep 60m"
      imagePullPolicy: IfNotPresent
      name: alpine
  restartPolicy: Always

部署

kubectl create -f test_sleep_deployment.yaml

查看结果,可以看到test-alpine-inject-01 pod中被注入了nginx sidecar,curl这个pod的ip访问80端口,可以看到nginx sidecar的响应

kubectl get pod -n nginx-injection -o wide

curl pod_ip

 06 观察sidecar-injector的日志

apiserver过来访问 sidecar-injector,然后经过判断后给该pod 注入了sidecar

07 部署一个不需要注入sidecar的pod 

sidecar-injector-webhooknginxsidecar/need inject:"false"明确指出不需要如注入

apiVersion: v1
kind: Pod
metadata:
  namespace: nginx-injection
  name: test-alpine-inject02
  labels:
    role: myrole
  annotations:
    sidecar-injector-webhook.nginx.sidecar/need_inject: "false"
spec:
  containers:
    - image: alpine
      command:
        - /bin/sh
        - "-c"
        - "sleep 60m"
      imagePullPolicy: IfNotPresent
      name: alpine
  restartPolicy: Always

观察部署结果,test-alpine-inject-02 只运行了一个容器

 观察sidecar-injector的日志,可以看到[skip mutation][reason=pod not need]

第5章 API核心服务的处理流程

5.1 API核心server的启动流程

本节重点总结 :

通用的GenericApiServerNew函数

apiserver核心服务的初始化

最终的apiserver启动流程

通用的GenericApiServerNew函数

之前我们分析了使用buildGenericConfig构建api核心服务的配置
然后回到CreateServerChain函数中,位置

D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-apiserver\app\server.go

发现会调用三个server的create函数,传入对应的配置初始化

。 apiExtensionsServer API扩展服务,主要针对CRD
。 kubeAPIServer API核心服务,包括常见的Pod/Deployment/Service

。apiExtensionsServer API聚合服务,主要针对metrics

代码如下

	apiExtensionsServer, err := createAPIExtensionsServer(apiExtensionsConfig, genericapiserver.NewEmptyDelegateWithCustomHandler(notFoundHandler))
	if err != nil {
		return nil, err
	}

	kubeAPIServer, err := CreateKubeAPIServer(kubeAPIServerConfig, apiExtensionsServer.GenericAPIServer)
	if err != nil {
		return nil, err
	}

	// aggregator comes last in the chain
	aggregatorConfig, err := createAggregatorConfig(*kubeAPIServerConfig.GenericConfig, completedOptions.ServerRunOptions, kubeAPIServerConfig.ExtraConfig.VersionedInformers, serviceResolver, kubeAPIServerConfig.ExtraConfig.ProxyTransport, pluginInitializer)
	if err != nil {
		return nil, err
	}
	aggregatorServer, err := createAggregatorServer(aggregatorConfig, kubeAPIServer.GenericAPIServer, apiExtensionsServer.Informers)
	if err != nil {
		// we don't need special handling for innerStopCh because the aggregator server doesn't create any go routines
		return nil, err
	}

对应三个server的create函数都会调用 completedConfig的New

。因为他们三个server
。如 kubeAPIServer
s,err := c.GenericConfig.New("kube-apiserver",delegationTarget)
。还有apiExtensionsServer
genericServer, err := c.GenericConfig.New("apiextensions-apiserver", delegationTarget)
。还有 createAggregatorServer
genericServer,err := c.GenericConfig.ew("kube-aggregator", delegationTarget)

completedConfig的New生成通用的GenericAPIServer

位置

D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\server\config.go

New 创建一个新的服务器,它在逻辑上将处理链与传入的服务器组合在一起。
name用于区分日志记录。
初始化handler

	handlerChainBuilder := func(handler http.Handler) http.Handler {
		return c.BuildHandlerChainFunc(handler, c.Config)
	}
	apiServerHandler := NewAPIServerHandler(name, c.Serializer, handlerChainBuilder, delegationTarget.UnprotectedHandler())

用各种参数实例化一个GenericAPIServer

	s := &GenericAPIServer{
		discoveryAddresses:         c.DiscoveryAddresses,
		LoopbackClientConfig:       c.LoopbackClientConfig,
		legacyAPIGroupPrefixes:     c.LegacyAPIGroupPrefixes,
		admissionControl:           c.AdmissionControl,
		Serializer:                 c.Serializer,
		AuditBackend:               c.AuditBackend,
		Authorizer:                 c.Authorization.Authorizer,
		delegationTarget:           delegationTarget,
		EquivalentResourceRegistry: c.EquivalentResourceRegistry,
		HandlerChainWaitGroup:      c.HandlerChainWaitGroup,
		Handler:                    apiServerHandler,

		listedPathProvider: apiServerHandler,

		minRequestTimeout:     time.Duration(c.MinRequestTimeout) * time.Second,
		ShutdownTimeout:       c.RequestTimeout,
		ShutdownDelayDuration: c.ShutdownDelayDuration,
		SecureServingInfo:     c.SecureServing,
		ExternalAddress:       c.ExternalAddress,

		openAPIConfig:           c.OpenAPIConfig,
		openAPIV3Config:         c.OpenAPIV3Config,
		skipOpenAPIInstallation: c.SkipOpenAPIInstallation,

		postStartHooks:         map[string]postStartHookEntry{},
		preShutdownHooks:       map[string]preShutdownHookEntry{},
		disabledPostStartHooks: c.DisabledPostStartHooks,

		healthzChecks:    c.HealthzChecks,
		livezChecks:      c.LivezChecks,
		readyzChecks:     c.ReadyzChecks,
		livezGracePeriod: c.LivezGracePeriod,

		DiscoveryGroupManager: discovery.NewRootAPIsHandler(c.DiscoveryAddresses, c.Serializer),

		maxRequestBodyBytes: c.MaxRequestBodyBytes,
		livezClock:          clock.RealClock{},

		lifecycleSignals:       c.lifecycleSignals,
		ShutdownSendRetryAfter: c.ShutdownSendRetryAfter,

		APIServerID:           c.APIServerID,
		StorageVersionManager: c.StorageVersionManager,

		Version: c.Version,

		muxAndDiscoveryCompleteSignals: map[string]<-chan struct{}{},
	}

添加钩子函数

先从传入的server中获取

	// first add poststarthooks from delegated targets
	for k, v := range delegationTarget.PostStartHooks() {
		s.postStartHooks[k] = v
	}

	for k, v := range delegationTarget.PreShutdownHooks() {
		s.preShutdownHooks[k] = v
	}

再从提前配置comoletedConfig中获取 

	// add poststarthooks that were preconfigured.  Using the add method will give us an error if the same name has already been registered.
	for name, preconfiguredPostStartHook := range c.PostStartHooks {
		if err := s.AddPostStartHook(name, preconfiguredPostStartHook.hook); err != nil {
			return nil, err
		}
	}

比如在之前生成GenericConfig配置的admissionPostStartHook准入控制hook,位置

D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-apiserver\app\server.go

	if err := config.GenericConfig.AddPostStartHook("start-kube-apiserver-admission-initializer", admissionPostStartHook); err != nil {
		return nil, nil, nil, err
	}

对应的hook方法在admissionNew中,位置

D:\Workspace\Go\src\k8s.io\kubernetes\pkg\kubeapiserver\admission\config.go

	admissionPostStartHook := func(context genericapiserver.PostStartHookContext) error {
		discoveryRESTMapper.Reset()
		go utilwait.Until(discoveryRESTMapper.Reset, 30*time.Second, context.StopCh)
		return nil
	}

注册generic-apiserver-start-informers的hook

	genericApiServerHookName := "generic-apiserver-start-informers"
	if c.SharedInformerFactory != nil {
		if !s.isPostStartHookRegistered(genericApiServerHookName) {
			err := s.AddPostStartHook(genericApiServerHookName, func(context PostStartHookContext) error {
				c.SharedInformerFactory.Start(context.StopCh)
				return nil
			})
			if err != nil {
				return nil, err
			}
		}
		// TODO: Once we get rid of /healthz consider changing this to post-start-hook.
		err := s.AddReadyzChecks(healthz.NewInformerSyncHealthz(c.SharedInformerFactory))
		if err != nil {
			return nil, err
		}
	}

注册apiserver中的限流策略 hook
具体的内容在限流那章节中讲解

	const priorityAndFairnessConfigConsumerHookName = "priority-and-fairness-config-consumer"
	if s.isPostStartHookRegistered(priorityAndFairnessConfigConsumerHookName) {
	} else if c.FlowControl != nil {
		err := s.AddPostStartHook(priorityAndFairnessConfigConsumerHookName, func(context PostStartHookContext) error {
			go c.FlowControl.MaintainObservations(context.StopCh)
			go c.FlowControl.Run(context.StopCh)
			return nil
		})
		if err != nil {
			return nil, err
		}
		// TODO(yue9944882): plumb pre-shutdown-hook for request-management system?
	} else {
		klog.V(3).Infof("Not requested to run hook %s", priorityAndFairnessConfigConsumerHookName)
	}


	// Add PostStartHooks for maintaining the watermarks for the Priority-and-Fairness and the Max-in-Flight filters.
	if c.FlowControl != nil {
		const priorityAndFairnessFilterHookName = "priority-and-fairness-filter"
		if !s.isPostStartHookRegistered(priorityAndFairnessFilterHookName) {
			err := s.AddPostStartHook(priorityAndFairnessFilterHookName, func(context PostStartHookContext) error {
				genericfilters.StartPriorityAndFairnessWatermarkMaintenance(context.StopCh)
				return nil
			})
			if err != nil {
				return nil, err
			}
		}
	} else {
		const maxInFlightFilterHookName = "max-in-flight-filter"
		if !s.isPostStartHookRegistered(maxInFlightFilterHookName) {
			err := s.AddPostStartHook(maxInFlightFilterHookName, func(context PostStartHookContext) error {
				genericfilters.StartMaxInFlightWatermarkMaintenance(context.StopCh)
				return nil
			})
			if err != nil {
				return nil, err
			}
		}
	}

添加健康检查

	for _, delegateCheck := range delegationTarget.HealthzChecks() {
		skip := false
		for _, existingCheck := range c.HealthzChecks {
			if existingCheck.Name() == delegateCheck.Name() {
				skip = true
				break
			}
		}
		if skip {
			continue
		}
		s.AddHealthChecks(delegateCheck)
	}

通过设置liveness 容忍度为0,要求立即发现传入的server不可用

位置

D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\server\healthz.go

// AddHealthChecks adds HealthCheck(s) to health endpoints (healthz, livez, readyz) but
// configures the liveness grace period to be zero, which means we expect this health check
// to immediately indicate that the apiserver is unhealthy.
func (s *GenericAPIServer) AddHealthChecks(checks ...healthz.HealthChecker) error {
	// we opt for a delay of zero here, because this entrypoint adds generic health checks
	// and not health checks which are specifically related to kube-apiserver boot-sequences.
	return s.addHealthChecks(0, checks...)
}

初始化api路由的installAPI
。添加/和/index.html的路由规则

	if c.EnableIndex {
		routes.Index{}.Install(s.listedPathProvider, s.Handler.NonGoRestfulMux)
	}

添加/debug/pprof 分析的路由规则,用于性能分析

	if c.EnableProfiling {
		routes.Profiling{}.Install(s.Handler.NonGoRestfulMux)
		if c.EnableContentionProfiling {
			goruntime.SetBlockProfileRate(1)
		}
		// so far, only logging related endpoints are considered valid to add for these debug flags.
		routes.DebugFlags{}.Install(s.Handler.NonGoRestfulMux, "v", routes.StringFlagPutHandler(logs.GlogSetter))
	}

添加/metrics 指标监控的路由规则 

	if c.EnableMetrics {
		if c.EnableProfiling {
			routes.MetricsWithReset{}.Install(s.Handler.NonGoRestfulMux)
		} else {
			routes.DefaultMetrics{}.Install(s.Handler.NonGoRestfulMux)
		}
	}

添加/version 版本信息的路由规则
routes.Version(Version: c.Version).Install(s.Handler,GoRestfulContainer)
开启服务发现
if c.EnableDiscovery (
s.Handler,GoRestfulContainer.Add(s.DiscoveryGroupHanager.WebService())

apiserver 核心服务的初始化

位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\controlplane\instance.go

// New returns a new instance of Master from the given config.
// Certain config fields will be set to a default value if unset.
// Certain config fields must be specified, including:
//   KubeletClientConfig
func (c completedConfig) New(delegationTarget genericapiserver.DelegationTarget) (*Instance, error) { ... }

上面提到的初始化通用的server

	s, err := c.GenericConfig.New("kube-apiserver", delegationTarget)

并且用通用配置实例化master 实例

	m := &Instance{
		GenericAPIServer:          s,
		ClusterAuthenticationInfo: c.ExtraConfig.ClusterAuthenticationInfo,
	}

注册核心资源的api 

	// install legacy rest storage

	if err := m.InstallLegacyAPI(&c, c.GenericConfig.RESTOptionsGetter); err != nil {
		return nil, err
	}


注册api

	if err := m.InstallAPIs(c.ExtraConfig.APIResourceConfigSource, c.GenericConfig.RESTOptionsGetter, restStorageProviders...); err != nil {
		return nil, err
	}

最终的apiserver启动流程
回到Run函数通过CreateServerChain拿到创建的3个server,执行run即可

位置D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-apiserver\app\server.go

// Run runs the specified APIServer.  This should never exit.
func Run(completeOptions completedServerRunOptions, stopCh <-chan struct{}) error {
	// To help debugging, immediately log version
	klog.Infof("Version: %+v", version.Get())

	klog.InfoS("Golang settings", "GOGC", os.Getenv("GOGC"), "GOMAXPROCS", os.Getenv("GOMAXPROCS"), "GOTRACEBACK", os.Getenv("GOTRACEBACK"))

	server, err := CreateServerChain(completeOptions, stopCh)
	if err != nil {
		return err
	}

	prepared, err := server.PrepareRun()
	if err != nil {
		return err
	}

	return prepared.Run(stopCh)
}

preparedGenericAPIServer中的Run 

位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\server\genericapiserver.go 

	stoppedCh, listenerStoppedCh, err := s.NonBlockingRun(stopHttpServerCh, shutdownTimeout)

调用preparedGenericAPIServer的NonBlockingRun

	if s.SecureServingInfo != nil && s.Handler != nil {
		var err error
		stoppedCh, listenerStoppedCh, err = s.SecureServingInfo.Serve(s.Handler, shutdownTimeout, internalStopCh)
		if err != nil {
			close(internalStopCh)
			close(auditStopCh)
			return nil, nil, err
		}
	}

最终调用Serve运行secure http server,
位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\server\secure_serving.go

// Serve runs the secure http server. It fails only if certificates cannot be loaded or the initial listen call fails.
// The actual server loop (stoppable by closing stopCh) runs in a go routine, i.e. Serve does not block.
// It returns a stoppedCh that is closed when all non-hijacked active requests have been processed.
// It returns a listenerStoppedCh that is closed when the underlying http Server has stopped listening.
func (s *SecureServingInfo) Serve(handler http.Handler, shutdownTimeout time.Duration, stopCh <-chan struct{}) (<-chan struct{}, <-chan struct{}, error) {
	if s.Listener == nil {
		return nil, nil, fmt.Errorf("listener must not be nil")
	}

	tlsConfig, err := s.tlsConfig(stopCh)
	if err != nil {
		return nil, nil, err
	}

	secureServer := &http.Server{
		Addr:           s.Listener.Addr().String(),
		Handler:        handler,
		MaxHeaderBytes: 1 << 20,
		TLSConfig:      tlsConfig,

		IdleTimeout:       90 * time.Second, // matches http.DefaultTransport keep-alive timeout
		ReadHeaderTimeout: 32 * time.Second, // just shy of requestTimeoutUpperBound
	}

	// At least 99% of serialized resources in surveyed clusters were smaller than 256kb.
	// This should be big enough to accommodate most API POST requests in a single frame,
	// and small enough to allow a per connection buffer of this size multiplied by `MaxConcurrentStreams`.
	const resourceBody99Percentile = 256 * 1024

	http2Options := &http2.Server{
		IdleTimeout: 90 * time.Second, // matches http.DefaultTransport keep-alive timeout
	}

	// shrink the per-stream buffer and max framesize from the 1MB default while still accommodating most API POST requests in a single frame
	http2Options.MaxUploadBufferPerStream = resourceBody99Percentile
	http2Options.MaxReadFrameSize = resourceBody99Percentile

	// use the overridden concurrent streams setting or make the default of 250 explicit so we can size MaxUploadBufferPerConnection appropriately
	if s.HTTP2MaxStreamsPerConnection > 0 {
		http2Options.MaxConcurrentStreams = uint32(s.HTTP2MaxStreamsPerConnection)
	} else {
		http2Options.MaxConcurrentStreams = 250
	}

	// increase the connection buffer size from the 1MB default to handle the specified number of concurrent streams
	http2Options.MaxUploadBufferPerConnection = http2Options.MaxUploadBufferPerStream * int32(http2Options.MaxConcurrentStreams)

	if !s.DisableHTTP2 {
		// apply settings to the server
		if err := http2.ConfigureServer(secureServer, http2Options); err != nil {
			return nil, nil, fmt.Errorf("error configuring http2: %v", err)
		}
	}

	// use tlsHandshakeErrorWriter to handle messages of tls handshake error
	tlsErrorWriter := &tlsHandshakeErrorWriter{os.Stderr}
	tlsErrorLogger := log.New(tlsErrorWriter, "", 0)
	secureServer.ErrorLog = tlsErrorLogger

	klog.Infof("Serving securely on %s", secureServer.Addr)
	return RunServer(secureServer, s.Listener, shutdownTimeout, stopCh)
}

5.2 scheme和RESTStorage的初始化

本节重点总结

Scheme 定义了资源序列化和反序列化的方法以及资源类型和版本的对应关系;

这里我们可以理解成一张纪录表

所有的k8s资源必须要注册到scheme表中才可以使用

RESTStorage定义了一种资源该如何curd,如何和存储打交道
- 各个资源创建的restStore 塞入restStorageMap中

- map的key是 资源/子资源的名称, value是对应的restStore

InstallLegacyAPl

。上节课讲到apiserver 核心服务初始化的时候会创建restStorage

并用restStorage初始化核心服务 

入口地址D:\Workspace\Go\src\k8s.io\kubernetes\pkg\controlplane\instance.go

// InstallLegacyAPI will install the legacy APIs for the restStorageProviders if they are enabled.
func (m *Instance) InstallLegacyAPI(c *completedConfig, restOptionsGetter generic.RESTOptionsGetter) error {
	legacyRESTStorageProvider := corerest.LegacyRESTStorageProvider{
		StorageFactory:              c.ExtraConfig.StorageFactory,
		ProxyTransport:              c.ExtraConfig.ProxyTransport,
		KubeletClientConfig:         c.ExtraConfig.KubeletClientConfig,
		EventTTL:                    c.ExtraConfig.EventTTL,
		ServiceIPRange:              c.ExtraConfig.ServiceIPRange,
		SecondaryServiceIPRange:     c.ExtraConfig.SecondaryServiceIPRange,
		ServiceNodePortRange:        c.ExtraConfig.ServiceNodePortRange,
		LoopbackClientConfig:        c.GenericConfig.LoopbackClientConfig,
		ServiceAccountIssuer:        c.ExtraConfig.ServiceAccountIssuer,
		ExtendExpiration:            c.ExtraConfig.ExtendExpiration,
		ServiceAccountMaxExpiration: c.ExtraConfig.ServiceAccountMaxExpiration,
		APIAudiences:                c.GenericConfig.Authentication.APIAudiences,
	}
	legacyRESTStorage, apiGroupInfo, err := legacyRESTStorageProvider.NewLegacyRESTStorage(c.ExtraConfig.APIResourceConfigSource, restOptionsGetter)
	if err != nil {
		return fmt.Errorf("error building core storage: %v", err)
	}
	if len(apiGroupInfo.VersionedResourcesStorageMap) == 0 { // if all core storage is disabled, return.
		return nil
	}

	controllerName := "bootstrap-controller"
	coreClient := corev1client.NewForConfigOrDie(c.GenericConfig.LoopbackClientConfig)
	bootstrapController, err := c.NewBootstrapController(legacyRESTStorage, coreClient, coreClient, coreClient, coreClient.RESTClient())
	if err != nil {
		return fmt.Errorf("error creating bootstrap controller: %v", err)
	}
	m.GenericAPIServer.AddPostStartHookOrDie(controllerName, bootstrapController.PostStartHook)
	m.GenericAPIServer.AddPreShutdownHookOrDie(controllerName, bootstrapController.PreShutdownHook)

	if err := m.GenericAPIServer.InstallLegacyAPIGroup(genericapiserver.DefaultLegacyAPIPrefix, &apiGroupInfo); err != nil {
		return fmt.Errorf("error in registering group versions: %v", err)
	}
	return nil
}

NewLegacyRESTStorage分析 

位置 D:\Workspace\Go\src\k8s.io\kubernetes\pkg\registry\core\rest\storage_core.go 

func (c LegacyRESTStorageProvider) NewLegacyRESTStorage(apiResourceConfigSource serverstorage.APIResourceConfigSource, restOptionsGetter generic.RESTOptionsGetter) (LegacyRESTStorage, genericapiserver.APIGroupInfo, error) {
	apiGroupInfo := genericapiserver.APIGroupInfo{
		PrioritizedVersions:          legacyscheme.Scheme.PrioritizedVersionsForGroup(""),
		VersionedResourcesStorageMap: map[string]map[string]rest.Storage{},
		Scheme:                       legacyscheme.Scheme,
		ParameterCodec:               legacyscheme.ParameterCodec,
		NegotiatedSerializer:         legacyscheme.Codecs,
	}

·legacyscheme.Scheme是k8s的重要结构体Scheme 的默认实例

Scheme和k8s的资源 

位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apimachinery\pkg\runtime\scheme.go

Scheme 定义了资源序列化和反序列化的方法以及资源类型和版本的对应关系,这里我们可以理解成一张记录表

k8s的资源

运维人员在创建资源的时候,可能只关注kind (如deployment,本能的忽略分组和版本信息)

但是k8s的资源定位中只说deployment是不准确的
因为k8s系统支持多个Group,每个Group支持多个Version,每个Version支持多个Resource

其中部分资源同时会拥有自己的子资源(即SubResource)。例如,Deployment资源拥有Status子资源
资源组、资源版本、资源、子资源的完整表现形式为///
以常用的Deployment资源为例,其完整表现形式为apps/v1/deployments/status 

。其中apps代码资源组
。v1代表版本
deployments代表resource
。status代表子资源

为了方便资源管理和有序迭代,资源有Group (组)和Version (版本)的概念

 

Group:被称为资源组,在Kubernetes API Server中也可称其为APIGroup。
Version:被称为资源版本,在Kubernetes API Server中也可称其为APIVersions。
Resource:被称为资源,在Kubernetes API Server中也可称其为APIResource。.
Kind:资源种类,描述Resource的种类,与Resource为同一级别。 

什么是Scheme

k8s系统拥有众多资源,每一种资源就是一个资源类型
这些资源类型需要有统一的注册、存储、查询、管理等机制
目前k8s系统中的所有资源类型都已注册到Scheme资源注册表中,其是一个内存型的资源注册表,拥有如下特点:
。支持注册多种资源类型,包括内部版本和外部版本。
。支持多种版本转换机制。
。支持不同资源的序列化/反序列化机制。

Scheme资源注册表支持两种资源类型 (Type)的注册

分别是UnversionedType和KnownType资源类型,分别介绍如下
UnversionedType: 无版本资源类型
这是一个早期Kubernetes系统中的概念,它主要应用于某些没有版本的资源类型
该类型的资源对象并不需要进行转换
在目前的Kubernetes发行版本中,无版本类型已被弱化,几乎所有的资源对象都拥有版本

但在metav1元数据中还有部分类型,它们既属于meta.k8s.io/v1又属于UnversionedType无版本资源类型,例如:
o metav1.Status
o metav1.APIVersions
o metav1.APIGroupList
o metav1.APIGroup
o metav1APIResourceList 

KnownType: 是目前Kubernetes最常用的资源类型
也可称其为“拥有版本的资源类型”。在scheme资源注册表中,UnversionedType资源类型的对象通过schemeAddUnversionedTypes方法进行注册

KnownType资源类型的对象通过schemeAddKnownTypes方法进行注册

Scheme结构体定义

代码位置

D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apimachinery\pkg\runtime\scheme.go

	s := &Scheme{
		gvkToType:                 map[schema.GroupVersionKind]reflect.Type{},
		typeToGVK:                 map[reflect.Type][]schema.GroupVersionKind{},
		unversionedTypes:          map[reflect.Type]schema.GroupVersionKind{},
		unversionedKinds:          map[string]reflect.Type{},
		fieldLabelConversionFuncs: map[schema.GroupVersionKind]FieldLabelConversionFunc{},
		defaulterFuncs:            map[reflect.Type]func(interface{}){},
		versionPriority:           map[string][]string{},
		schemeName:                naming.GetNameFromCallsite(internalPackages...),
	}

 具体定义如下

。gvkToType:存储GVK与Type的映射关系
· typeToGVK:存储Type与GVK的映射关系,一个Type会对应一个或多个GVK。
· unVersionedTypes: 存储UnversionedType与GVK的映射关系。
。unversionedKinds:存储Kind (资源种类)名称与UnversionedType的映射关系
Scheme资源注册表通过Go语言的map结构实现映射关系
这些映射关系可以实现高效的正向和反向检索,从Scheme资源注册表中检索某个GVK的Type,它的时间复杂度O(1)

如何使用Scheme

获取scheme对象

var Scheme = runtime.NewScheme()

定义注册方法AddToScheme
通过runtime.NewScheme实例化一个新的Scheme资源注册表。注册资源类型到Scheme资源注册表有两种方式:

。通过schemeAddKnownTypes方法注册KnownType类型的对象。

。通过schemeAddUnversionedTypes方法注册UnversionedType类型的对象

实例代码

func init() {
	metav1.AddToGroupVersion(Scheme, schema.GroupVersion{Version: "v1"})
	AddToScheme(Scheme)
}

获取解码对象 

var Codecs = serializer.NewCodecFactory(Scheme)
var ParameterCodec = runtime.NewParameterCodec(Scheme)

实际举例
比如我们之前写的 webhook-mutation的准入控制器注入sidecar。

runtimeScheme代表初始化这个注册表。

codecs和deserializer 是解码编码相关的对象

最后可以调用deserializer.Decode解码参数为 v1beta1.AdmissionReview资源

接着回到NewLegacyRESTStorage分析

创建api group info对象

这里就是用了我们上面提到的scheme 

	apiGroupInfo := genericapiserver.APIGroupInfo{
		PrioritizedVersions:          legacyscheme.Scheme.PrioritizedVersionsForGroup(""),
		VersionedResourcesStorageMap: map[string]map[string]rest.Storage{},
		Scheme:                       legacyscheme.Scheme,
		ParameterCodec:               legacyscheme.ParameterCodec,
		NegotiatedSerializer:         legacyscheme.Codecs,
	}

创建LegacyRESTStorage

    restStorage := LegacyRESTStorage{}

使用各种资源的NewREST创建RESTStorage,以configmap为例

	configMapStorage, err := configmapstore.NewREST(restOptionsGetter)
	if err != nil {
		return LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, err
	}

RESTStorage定义了一种资源该如何curd,如何和存储打交道

confimap的 NewREST 

位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\registry\core\configmap\storage\storage.go

// REST implements a RESTStorage for ConfigMap
type REST struct {
	*genericregistry.Store
}

// NewREST returns a RESTStorage object that will work with ConfigMap objects.
func NewREST(optsGetter generic.RESTOptionsGetter) (*REST, error) {
	store := &genericregistry.Store{
		NewFunc:                  func() runtime.Object { return &api.ConfigMap{} },
		NewListFunc:              func() runtime.Object { return &api.ConfigMapList{} },
		PredicateFunc:            configmap.Matcher,
		DefaultQualifiedResource: api.Resource("configmaps"),

		CreateStrategy: configmap.Strategy,
		UpdateStrategy: configmap.Strategy,
		DeleteStrategy: configmap.Strategy,

		TableConvertor: printerstorage.TableConvertor{TableGenerator: printers.NewTableGenerator().With(printersinternal.AddHandlers)},
	}
	options := &generic.StoreOptions{
		RESTOptions: optsGetter,
		AttrFunc:    configmap.GetAttrs,
		TriggerFunc: map[string]storage.IndexerFunc{"metadata.name": configmap.NameTriggerFunc},
	}
	if err := store.CompleteWithOptions(options); err != nil {
		return nil, err
	}
	return &REST{store}, nil
}

· NewFunc代表get一个对象时的方法
。NewListFunc代表list对象时的方法
· PredicateFunc 返回与提供的标签对应的匹配器和字段。如果object 匹配给定的字段和标签选择器则返回真
。DefaultQualifiedResource 是资源的复数名称。
· CreateStrategy 代表创建的策略
· UpdateStrategy代表更新的策略
· DeleteStrategy 代表删除的策略
。TableConvertor 代表输出为表格的方法
。options代表选项,并使用store.CompleteWithOptionsoptions)做校验 

// CompleteWithOptions updates the store with the provided options and
// defaults common fields.
func (e *Store) CompleteWithOptions(options *generic.StoreOptions) error {
	if e.DefaultQualifiedResource.Empty() {
		return fmt.Errorf("store %#v must have a non-empty qualified resource", e)
	}
	if e.NewFunc == nil {
		return fmt.Errorf("store for %s must have NewFunc set", e.DefaultQualifiedResource.String())
	}
	if e.NewListFunc == nil {
		return fmt.Errorf("store for %s must have NewListFunc set", e.DefaultQualifiedResource.String())
	}
	if (e.KeyRootFunc == nil) != (e.KeyFunc == nil) {
		return fmt.Errorf("store for %s must set both KeyRootFunc and KeyFunc or neither", e.DefaultQualifiedResource.String())
	}

	if e.TableConvertor == nil {
		return fmt.Errorf("store for %s must set TableConvertor; rest.NewDefaultTableConvertor(e.DefaultQualifiedResource) can be used to output just name/creation time", e.DefaultQualifiedResource.String())
	}

pod的newStore

位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\registry\core\pod\storage\storage.go

// NewStorage returns a RESTStorage object that will work against pods.
func NewStorage(optsGetter generic.RESTOptionsGetter, k client.ConnectionInfoGetter, proxyTransport http.RoundTripper, podDisruptionBudgetClient policyclient.PodDisruptionBudgetsGetter) (PodStorage, error) {

	store := &genericregistry.Store{
		NewFunc:                  func() runtime.Object { return &api.Pod{} },
		NewListFunc:              func() runtime.Object { return &api.PodList{} },
		PredicateFunc:            registrypod.MatchPod,
		DefaultQualifiedResource: api.Resource("pods"),

		CreateStrategy:      registrypod.Strategy,
		UpdateStrategy:      registrypod.Strategy,
		DeleteStrategy:      registrypod.Strategy,
		ResetFieldsStrategy: registrypod.Strategy,
		ReturnDeletedObject: true,

		TableConvertor: printerstorage.TableConvertor{TableGenerator: printers.NewTableGenerator().With(printersinternal.AddHandlers)},
	}
	options := &generic.StoreOptions{
		RESTOptions: optsGetter,
		AttrFunc:    registrypod.GetAttrs,
		TriggerFunc: map[string]storage.IndexerFunc{"spec.nodeName": registrypod.NodeNameTriggerFunc},
		Indexers:    registrypod.Indexers(),
	}
	if err := store.CompleteWithOptions(options); err != nil {
		return PodStorage{}, err
	}

	statusStore := *store
	statusStore.UpdateStrategy = registrypod.StatusStrategy
	statusStore.ResetFieldsStrategy = registrypod.StatusStrategy
	ephemeralContainersStore := *store
	ephemeralContainersStore.UpdateStrategy = registrypod.EphemeralContainersStrategy

	bindingREST := &BindingREST{store: store}
	return PodStorage{
		Pod:                 &REST{store, proxyTransport},
		Binding:             &BindingREST{store: store},
		LegacyBinding:       &LegacyBindingREST{bindingREST},
		Eviction:            newEvictionStorage(store, podDisruptionBudgetClient),
		Status:              &StatusREST{store: &statusStore},
		EphemeralContainers: &EphemeralContainersREST{store: &ephemeralContainersStore},
		Log:                 &podrest.LogREST{Store: store, KubeletConn: k},
		Proxy:               &podrest.ProxyREST{Store: store, ProxyTransport: proxyTransport},
		Exec:                &podrest.ExecREST{Store: store, KubeletConn: k},
		Attach:              &podrest.AttachREST{Store: store, KubeletConn: k},
		PortForward:         &podrest.PortForwardREST{Store: store, KubeletConn: k},
	}, nil
}

podStore 返回的是PodStorage,和其它资源不同的是下面会有很多subresource 子资源的restStore

// PodStorage includes storage for pods and all sub resources
type PodStorage struct {
	Pod                 *REST
	Binding             *BindingREST
	LegacyBinding       *LegacyBindingREST
	Eviction            *EvictionREST
	Status              *StatusREST
	EphemeralContainers *EphemeralContainersREST
	Log                 *podrest.LogREST
	Proxy               *podrest.ProxyREST
	Exec                *podrest.ExecREST
	Attach              *podrest.AttachREST
	PortForward         *podrest.PortForwardREST
}

再回到NewLegacyRESTStorage中

。利用上面各个资源创建的restStore 塞入Storage

Map中map的key是资源/子资源的名称,value是对应的 Storage

	storage := map[string]rest.Storage{}
	if resource := "pods"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
		storage[resource] = podStorage.Pod
		storage[resource+"/attach"] = podStorage.Attach
		storage[resource+"/status"] = podStorage.Status
		storage[resource+"/log"] = podStorage.Log
		storage[resource+"/exec"] = podStorage.Exec
		storage[resource+"/portforward"] = podStorage.PortForward
		storage[resource+"/proxy"] = podStorage.Proxy
		storage[resource+"/binding"] = podStorage.Binding
		if podStorage.Eviction != nil {
			storage[resource+"/eviction"] = podStorage.Eviction
		}
		if utilfeature.DefaultFeatureGate.Enabled(features.EphemeralContainers) {
			storage[resource+"/ephemeralcontainers"] = podStorage.EphemeralContainers
		}

	}
	if resource := "bindings"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
		storage[resource] = podStorage.LegacyBinding
	}

	if resource := "podtemplates"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
		storage[resource] = podTemplateStorage
	}

	if resource := "replicationcontrollers"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
		storage[resource] = controllerStorage.Controller
		storage[resource+"/status"] = controllerStorage.Status
		if legacyscheme.Scheme.IsVersionRegistered(schema.GroupVersion{Group: "autoscaling", Version: "v1"}) {
			storage[resource+"/scale"] = controllerStorage.Scale
		}
	}

	if resource := "services"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
		storage[resource] = serviceRESTStorage
		storage[resource+"/proxy"] = serviceRESTProxy
		storage[resource+"/status"] = serviceStatusStorage
	}

	if resource := "endpoints"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
		storage[resource] = endpointsStorage
	}

	if resource := "nodes"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
		storage[resource] = nodeStorage.Node
		storage[resource+"/proxy"] = nodeStorage.Proxy
		storage[resource+"/status"] = nodeStorage.Status
	}

	if resource := "events"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
		storage[resource] = eventStorage
	}

	if resource := "limitranges"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
		storage[resource] = limitRangeStorage
	}

	if resource := "resourcequotas"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
		storage[resource] = resourceQuotaStorage
		storage[resource+"/status"] = resourceQuotaStatusStorage
	}

	if resource := "namespaces"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
		storage[resource] = namespaceStorage
		storage[resource+"/status"] = namespaceStatusStorage
		storage[resource+"/finalize"] = namespaceFinalizeStorage
	}

	if resource := "secrets"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
		storage[resource] = secretStorage
	}

	if resource := "serviceaccounts"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
		storage[resource] = serviceAccountStorage
		if serviceAccountStorage.Token != nil {
			storage[resource+"/token"] = serviceAccountStorage.Token
		}
	}

	if resource := "persistentvolumes"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
		storage[resource] = persistentVolumeStorage
		storage[resource+"/status"] = persistentVolumeStatusStorage
	}

	if resource := "persistentvolumeclaims"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
		storage[resource] = persistentVolumeClaimStorage
		storage[resource+"/status"] = persistentVolumeClaimStatusStorage
	}

	if resource := "configmaps"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
		storage[resource] = configMapStorage
	}

	if resource := "componentstatuses"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
		storage[resource] = componentstatus.NewStorage(componentStatusStorage{c.StorageFactory}.serversToValidate)
	}

最终将上述storage塞入apiGrouplnfo的VersionedResourcesStorageMap中

这是一个双层map,第一层的key是版本,然后是资源名称,最后是对应的资源存储

    if len(storage) > 0 {

        apiGroupInfo.VersionedResourcesStorageMap["v1"] = storage

    }

5.3 apiserver中Pod数据的保存

kube-apiserver createPod数据时的保存

pod的restStorage

·上节课我们知道每种资源有对应的Storage,其中定义了如何跟存储打交道
。比如pod的位置 

	store := &genericregistry.Store{
		NewFunc:                  func() runtime.Object { return &api.Pod{} },
		NewListFunc:              func() runtime.Object { return &api.PodList{} },
		PredicateFunc:            registrypod.MatchPod,
		DefaultQualifiedResource: api.Resource("pods"),

		CreateStrategy:      registrypod.Strategy,
		UpdateStrategy:      registrypod.Strategy,
		DeleteStrategy:      registrypod.Strategy,
		ResetFieldsStrategy: registrypod.Strategy,
		ReturnDeletedObject: true,

		TableConvertor: printerstorage.TableConvertor{TableGenerator: printers.NewTableGenerator().With(printersinternal.AddHandlers)},
	}

创建pod时apiserver是如何保存数据的

pod的资源对应的就是原始的store
。rest store底层是 genericregistry.Store,下面我们来分析一下genericregistry.Store的create 创建方法 

// REST implements a RESTStorage for pods
type REST struct {
	*genericregistry.Store
	proxyTransport http.RoundTripper
}

位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\registry\generic\registry\store.go

create方法解析

先调用BeginCreate 

	if e.BeginCreate != nil {
		fn, err := e.BeginCreate(ctx, obj, options)
		if err != nil {
			return nil, err
		}
		finishCreate = fn
		defer func() {
			finishCreate(ctx, false)
		}()
	}

然后是BeforeCreate

if err := rest.BeforeCreate(e.CreateStrategy, ctx, obj); err != nil {return nil, err}

在BeforeCreate又会调用 strategy的PrepareForCreate

strategy.PrepareForCreate(ctx,obj) 

那么对应pod中就在D:\Workspace\Go\src\k8s.io\kubernetes\pkg\registry\core\pod\strategy.go

// PrepareForCreate clears fields that are not allowed to be set by end users on creation.
func (podStrategy) PrepareForCreate(ctx context.Context, obj runtime.Object) {
	pod := obj.(*api.Pod)
	pod.Status = api.PodStatus{
		Phase:    api.PodPending,
		QOSClass: qos.GetPodQOS(pod),
	}

	podutil.DropDisabledPodFields(pod, nil)

	applySeccompVersionSkew(pod)
}

pod PrepareForCreate解读

先把pod状态设置为pending

	pod.Status = api.PodStatus{
		Phase:    api.PodPending,

去掉一些字段

	podutil.DropDisabledPodFields(pod, nil)

	applySeccompVersionSkew(pod)

。通过 GetPodQOS获取pod的 qos

GetPodQOS解读 

kubernetes 中的 Qos 合理分配node上的有限资源

简介
QoS(Quality of Service) 即服务质量
Qos 是一种控制机制,它提供了针对不同用户或者不同数据流采用相应不同的优先级
或者是根据应用程序的要求,保证数据流的性能达到一定的水准。kubernetes 中有三种 Qos,分别为:
。 Guaranteed:pod的requests 与limits 设定的值相等:
。Burstable:pod requests 小于limits 的值目不为0;
。BestEffort: pod 的 requests 与 limits 均为0;

三者的优先级如下所示,依次递增:
BestEffort -> Burstable -> Guaranteed
不同 Qos 的本质区别
在调度时调度器只会根据request 值进行调度;
二是当系统OOM上时对于处理不同OOMScore 的进程表现不同,也就是说当系统OOM 时,首先会kill掉 BestEffort pod 的进程,若系统依然处于OOM 状态,然后才会 kill 掉 Burstable pod,最后是Guaranteed pod;
资源的requests和limits
我们知道在k8s中为了达到容器资源限制的目录,在yaml文件中有cpu和内存的 requests和limits配置

对这两个参数可以简单理解为根据requests进行调度,根据limits进行运行限制。

举例下面的配置代表cpu 申请100m,限制1000m。内存申请100Mi,限制2500Mi

resources:
  requests:
    cpu: 100m
    memory: 100Mi
  limits:
    cpu: 1000m
    memory: 250GMi

。API优先级和公平性
https://kubernetes.io/h/docs/concepts/cluster-administration/flow-control/
代码解读
位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\apis\core\helper\qos\qos.go

首先遍历pod中的容器处理 resource.request

	for _, container := range allContainers {
		// process requests
		for name, quantity := range container.Resources.Requests {
			if !isSupportedQoSComputeResource(name) {
				continue
			}
			if quantity.Cmp(zeroQuantity) == 1 {
				delta := quantity.DeepCopy()
				if _, exists := requests[name]; !exists {
					requests[name] = delta
				} else {
					delta.Add(requests[name])
					requests[name] = delta
				}
			}
		}
		// process limits
		qosLimitsFound := sets.NewString()
		for name, quantity := range container.Resources.Limits {
			if !isSupportedQoSComputeResource(name) {
				continue
			}
			if quantity.Cmp(zeroQuantity) == 1 {
				qosLimitsFound.Insert(string(name))
				delta := quantity.DeepCopy()
				if _, exists := limits[name]; !exists {
					limits[name] = delta
				} else {
					delta.Add(limits[name])
					limits[name] = delta
				}
			}
		}

		if !qosLimitsFound.HasAll(string(core.ResourceMemory), string(core.ResourceCPU)) {
			isGuaranteed = false
		}
	}

然后遍历处理 resource.limit

		// process limits
		qosLimitsFound := sets.NewString()
		for name, quantity := range container.Resources.Limits {
			if !isSupportedQoSComputeResource(name) {
				continue
			}
			if quantity.Cmp(zeroQuantity) == 1 {
				qosLimitsFound.Insert(string(name))
				delta := quantity.DeepCopy()
				if _, exists := limits[name]; !exists {
					limits[name] = delta
				} else {
					delta.Add(limits[name])
					limits[name] = delta
				}
			}
		}

判定规则
如果limit和request都没设置就是 BestEffort

	if len(requests) == 0 && len(limits) == 0 {
		return core.PodQOSBestEffort
	}

如果limit和request相等就是 Guaranteed

	if isGuaranteed &&
		len(requests) == len(limits) {
		return core.PodQOSGuaranteed
	}

。否则就是Burstable

然后是真正的操作存储的create 

	if err := e.Storage.Create(ctx, key, obj, out, ttl, dryrun.IsDryRun(options.DryRun)); err != nil {

Storage调用的是 DryRunnableStorage的create

位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\registry\generic\registry\dryrun.go

func (s *DryRunnableStorage) Create(ctx context.Context, key string, obj, out runtime.Object, ttl uint64, dryRun bool) error {
	if dryRun {
		if err := s.Storage.Get(ctx, key, storage.GetOptions{}, out); err == nil {
			return storage.NewKeyExistsError(key, 0)
		}
		return s.copyInto(obj, out)
	}
	return s.Storage.Create(ctx, key, obj, out, ttl)
}

如果是dryRun就是空跑,不存储在etcd中,只是将资源的结果返回

etcdv3 的create

位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\storage\etcd3\store.go

// Create implements storage.Interface.Create.
func (s *store) Create(ctx context.Context, key string, obj, out runtime.Object, ttl uint64) error {
	if version, err := s.versioner.ObjectResourceVersion(obj); err == nil && version != 0 {
		return errors.New("resourceVersion should not be set on objects to be created")
	}
	if err := s.versioner.PrepareObjectForStorage(obj); err != nil {
		return fmt.Errorf("PrepareObjectForStorage failed: %v", err)
	}
	data, err := runtime.Encode(s.codec, obj)
	if err != nil {
		return err
	}
	key = path.Join(s.pathPrefix, key)

	opts, err := s.ttlOpts(ctx, int64(ttl))
	if err != nil {
		return err
	}

	newData, err := s.transformer.TransformToStorage(ctx, data, authenticatedDataString(key))
	if err != nil {
		return storage.NewInternalError(err.Error())
	}

	startTime := time.Now()
	txnResp, err := s.client.KV.Txn(ctx).If(
		notFound(key),
	).Then(
		clientv3.OpPut(key, string(newData), opts...),
	).Commit()
	metrics.RecordEtcdRequestLatency("create", getTypeName(obj), startTime)
	if err != nil {
		return err
	}
	if !txnResp.Succeeded {
		return storage.NewKeyExistsError(key, 0)
	}

	if out != nil {
		putResp := txnResp.Responses[0].GetResponsePut()
		return decode(s.codec, s.versioner, data, out, putResp.Header.Revision)
	}
	return nil
}

收尾工作

。如果有AfterCreate和Decorator就调用 

	if e.AfterCreate != nil {
		e.AfterCreate(out, options)
	}
	if e.Decorator != nil {
		e.Decorator(out)
	}

本节重点总结:

。kube-apiserver createPod数据时的保存

。架构图 

5.4 apiserver中的限流策略源码解读

k8s支持多种限流配置

>为了防止突发流量影响apiserver可用性,k8s支持多种限流配置,包括:
·- MaxInFlightLimit,server级别整体限流
·- Client限流
- EventRateLimit,限制event
·- APF,更细力度的限制配置

MaxInFlightLimit限流

apiserver默认可设置最大并发量(集群级别,区分只读与修改操作作)

通过参数--max-requests-inflight代表只读请求

--max-mutating-requests-inflight代表修改请求
。可以简单实现限流

源码解读

。入口GenericAPIServer.New中的添加hook
位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\server\config.go

// Add PostStartHooks for maintaining the watermarks for the Priority-and-Fairness and the Max-in-Flight filters.
	if c.FlowControl != nil {
		const priorityAndFairnessFilterHookName = "priority-and-fairness-filter"
		if !s.isPostStartHookRegistered(priorityAndFairnessFilterHookName) {
			err := s.AddPostStartHook(priorityAndFairnessFilterHookName, func(context PostStartHookContext) error {
				genericfilters.StartPriorityAndFairnessWatermarkMaintenance(context.StopCh)
				return nil
			})
			if err != nil {
				return nil, err
			}
		}
	} else {
		const maxInFlightFilterHookName = "max-in-flight-filter"
		if !s.isPostStartHookRegistered(maxInFlightFilterHookName) {
			err := s.AddPostStartHook(maxInFlightFilterHookName, func(context PostStartHookContext) error {
				genericfilters.StartMaxInFlightWatermarkMaintenance(context.StopCh)
				return nil
			})
			if err != nil {
				return nil, err
			}
		}
	}

意思是FlowControl为nil,代表未启用APF,API 服务器中的整体并发量将受到 kube-apiserver 的参数max-requests-inflight和--max-mutating-requests-inflight 的限制

启动metrics观测的函数 

// startWatermarkMaintenance starts the goroutines to observe and maintain the specified watermark.
func startWatermarkMaintenance(watermark *requestWatermark, stopCh <-chan struct{}) {
	// Periodically update the inflight usage metric.
	go wait.Until(func() {
		watermark.lock.Lock()
		readOnlyWatermark := watermark.readOnlyWatermark
		mutatingWatermark := watermark.mutatingWatermark
		watermark.readOnlyWatermark = 0
		watermark.mutatingWatermark = 0
		watermark.lock.Unlock()

		metrics.UpdateInflightRequestMetrics(watermark.phase, readOnlyWatermark, mutatingWatermark)
	}, inflightUsageMetricUpdatePeriod, stopCh)

	// Periodically observe the watermarks. This is done to ensure that they do not fall too far behind. When they do
	// fall too far behind, then there is a long delay in responding to the next request received while the observer
	// catches back up.
	go wait.Until(func() {
		watermark.readOnlyObserver.Add(0)
		watermark.mutatingObserver.Add(0)
	}, observationMaintenancePeriod, stopCh)
}

WithMaxInFlightLimit代表限流处理函数

调用的入口在 D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\server\config.go

	if c.FlowControl != nil {
		requestWorkEstimator := flowcontrolrequest.NewWorkEstimator(c.StorageObjectCountTracker.Get, c.FlowControl.GetInterestedWatchCount)
		handler = filterlatency.TrackCompleted(handler)
		handler = genericfilters.WithPriorityAndFairness(handler, c.LongRunningFunc, c.FlowControl, requestWorkEstimator)
		handler = filterlatency.TrackStarted(handler, "priorityandfairness")
	} else {
		handler = genericfilters.WithMaxInFlightLimit(handler, c.MaxRequestsInFlight, c.MaxMutatingRequestsInFlight, c.LongRunningFunc)
	}

解读

如果limitnum为0就不开启限流了 

	if nonMutatingLimit == 0 && mutatingLimit == 0 {
		return handler
	}

构造限流的chan,类型为长度=limit的的 bool chan

	var nonMutatingChan chan bool
	var mutatingChan chan bool
	if nonMutatingLimit != 0 {
		nonMutatingChan = make(chan bool, nonMutatingLimit)
		watermark.readOnlyObserver.SetDenominator(float64(nonMutatingLimit))
	}
	if mutatingLimit != 0 {
		mutatingChan = make(chan bool, mutatingLimit)
		watermark.mutatingObserver.SetDenominator(float64(mutatingLimit))
	}

检查是否是长时间运行的请求

		// Skip tracking long running events.
		if longRunningRequestCheck != nil && longRunningRequestCheck(r, requestInfo) {
			handler.ServeHTTP(w, r)
			return
		}

使用BasicLongRunningRequestCheck检查是否是watch或者pprof debug等长时间运行的请求,因为这些请求不受限制,位置D:\Workspace\Go\src\k8s.io\kubernetes\vendor\k8s.io\apiserver\pkg\server\filters\longrunning.go

// BasicLongRunningRequestCheck returns true if the given request has one of the specified verbs or one of the specified subresources, or is a profiler request.
func BasicLongRunningRequestCheck(longRunningVerbs, longRunningSubresources sets.String) apirequest.LongRunningRequestCheck {
	return func(r *http.Request, requestInfo *apirequest.RequestInfo) bool {
		if longRunningVerbs.Has(requestInfo.Verb) {
			return true
		}
		if requestInfo.IsResourceRequest && longRunningSubresources.Has(requestInfo.Subresource) {
			return true
		}
		if !requestInfo.IsResourceRequest && strings.HasPrefix(requestInfo.Path, "/debug/pprof/") {
			return true
		}
		return false
	}
}

检查是只读操作还是修改操作,决定使用哪个chan限制 

		var c chan bool
		isMutatingRequest := !nonMutatingRequestVerbs.Has(requestInfo.Verb)
		if isMutatingRequest {
			c = mutatingChan
		} else {
			c = nonMutatingChan
		}

如果队列未满,有空的位置,则更新下排队数字
使用select 向c中写入true,如果能写入到说明队列未满
记录下对应的指标

			select {
			case c <- true:
				// We note the concurrency level both while the
				// request is being served and after it is done being
				// served, because both states contribute to the
				// sampled stats on concurrency.
				if isMutatingRequest {
					watermark.recordMutating(len(c))
				} else {
					watermark.recordReadOnly(len(c))
				}
				defer func() {
					<-c
					if isMutatingRequest {
						watermark.recordMutating(len(c))
					} else {
						watermark.recordReadOnly(len(c))
					}
				}()
				handler.ServeHTTP(w, r)

			default:
				// at this point we're about to return a 429, BUT not all actors should be rate limited.  A system:master is so powerful
				// that they should always get an answer.  It's a super-admin or a loopback connection.
				if currUser, ok := apirequest.UserFrom(ctx); ok {
					for _, group := range currUser.GetGroups() {
						if group == user.SystemPrivilegedGroup {
							handler.ServeHTTP(w, r)
							return
						}
					}
				}
				// We need to split this data between buckets used for throttling.
				metrics.RecordDroppedRequest(r, requestInfo, metrics.APIServerComponent, isMutatingRequest)
				metrics.RecordRequestTermination(r, requestInfo, metrics.APIServerComponent, http.StatusTooManyRequests)
				tooManyRequests(r, w)
			}

default代表队列已满,但是如果请求的group中含有 system:masters,则放行。

因为apiserver认为这个组是很重要的请求,不能被限流

				if currUser, ok := apirequest.UserFrom(ctx); ok {
					for _, group := range currUser.GetGroups() {
						if group == user.SystemPrivilegedGroup {
							handler.ServeHTTP(w, r)
							return
						}
					}
				}

group=system:masters 对应的clusterRole 为cluster-admin

队列已满,如果请求的group中没有 system:masters,则返http 429错误 

·http429代表当前有太多请求了,请重试

并设置response 的header Retry-After =1

				// We need to split this data between buckets used for throttling.
				metrics.RecordDroppedRequest(r, requestInfo, metrics.APIServerComponent, isMutatingRequest)
				metrics.RecordRequestTermination(r, requestInfo, metrics.APIServerComponent, http.StatusTooManyRequests)
				tooManyRequests(r, w)

Client限流

例如client-go默认的qps为5,但是只支持客户端限流,只能由各个发起端限制

集群管理员无法控制用户行为。

EventRateLimit

EventRateLimit在1.13之后支持,只限制event请求
集成在apiserver内部webhoook中
可配置某个用户、namespace、server等event操作限制,通过webhook形式实现。
和文档一起学习 

https://kubernetes.io/zh/docs/reference/access-authn-authz/admission-controllers/#eventratelimit
原理

具体原理可以参考提案,每个eventratelimit 配置使用一个单独的令牌桶限速器

每次event操作,遍历每个匹配的限速器检查是否能获取令牌,如果可以允许请求,否则返回429.
优点
实现简单,允许一定量的并发
可支持server/namespace/user等级别的限流
缺点

仅支持event,通过webhook实现只能拦截修改类请求
·所有namespace的限流相同,没有优先级

APF,更细力度的限制配置

。API优先级和公平性(APF)是MaxlnFlightLimit限流的一种替代方案,设计文档见提案

·API 优先级和公平性(1.15以上,alpha版本),以更细粒度(byUser,byNamespace) 对请求进行分

类和隔离。支持突发流量,通过使用公平排队技术从队列中分发请求从而避免饥饿。

APF限流通过两种资源,PriorityLevelConfigurations定义隔离类型和可处理的并发预算量,还可以调整排队行为。FlowSchemas用于对每个入站请求进行分类,并与一PriorityLevelConfigurations相匹配。

可对用户或用户组或全局进行某些资源某些请求的限制,如限制default namespace写services put/patch请求。
优点
考虑情况较全面,支持优先级,白名单等
。可支持server/namespace/user/resource等细粒度级别的限流

缺点
配置复杂,不直观,需要对APF原理深入了解
。功能较新,缺少生产环境验证
文档地址
https://kubernetes.io/zh/docs/concepts/cluster-administration/flow-control/

5.5 apiserver重要对象和功能总结

apiserver总结

apiserver 启动三个server
三个APIServer底层均依赖通用的GenericServer,使用go-restful对外提供RESTful风格的API服务kube-apiserver 对请求进行Authentication、Authorization和Admission三层验证

完成验证后,请求会根据路由规则,触发到对应资源的handler,主要包括数据的预处理和保存·kube-apiserver的底层存储为etcd v3,它被抽象为一种RESTStorage,使请求和存储操作-一对应

apiserver中的三个server

。apiExtensionsServer API扩展服务,主要针对CRD

kubeAPIServer API核心服务,包括常见的Pod/Deployment/Service

apiExtensionsServer API聚合服务,主要针对metrics

Authentication的目的

Authorization 鉴权相关

Admission准入控制

自定义准入控制器实现注入nginx-sidecar容器

kube-apiserver createPod数据时的数据保存

什么是Scheme 

k8s系统拥有众多资源,每一种资源就是一个资源类型
这些资源类型需要有统一的注册、存储、查询、管理等机制
目前k8s系统中的所有资源类型都已注册到Scheme资源注册表中,其是一个内存型的资源注册表,拥有如下特点:
。支持注册多种资源类型,包括内部版本和外部版本。
。支持多种版本转换机制。
。支持不同资源的序列化/反序列化机制。

第6章 kube-scheduler调度pod的流程

6.1 kube-scheduler的启动流程

本节重点总结:

- 了解kube-scheduler的启动流程

- 了解clientset 的使用方法 

kube-scheduler入口

D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-scheduler\scheduler.go

	command := app.NewSchedulerCommand()
	code := cli.Run(command)
	os.Exit(code)

runCommond入口

D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-scheduler\app\server.go

// runCommand runs the scheduler.
func runCommand(cmd *cobra.Command, opts *options.Options, registryOptions ...Option) error {
	verflag.PrintAndExitIfRequested()

	// Activate logging as soon as possible, after that
	// show flags with the final logging configuration.
	if err := opts.Logs.ValidateAndApply(utilfeature.DefaultFeatureGate); err != nil {
		fmt.Fprintf(os.Stderr, "%v\n", err)
		os.Exit(1)
	}
	cliflag.PrintFlags(cmd.Flags())

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()
	go func() {
		stopCh := server.SetupSignalHandler()
		<-stopCh
		cancel()
	}()

	cc, sched, err := Setup(ctx, opts, registryOptions...)
	if err != nil {
		return err
	}

	return Run(ctx, cc, sched)
}

setup返回一个完整的配置和scheduler对象

opts.Config初始化配置解读

Apply应用配置

	c := &schedulerappconfig.Config{}
	if err := o.ApplyTo(c); err != nil {
		return nil, err
	}

使用 --kubeconfig传入的配置初始化kube config 

	// Prepare kube config.
	kubeConfig, err := createKubeConfig(c.ComponentConfig.ClientConnection, o.Master)
	if err != nil {
		return nil, err
	}

使用kube-config 创建kube-clients 返回的是client-set对象

	// Prepare kube clients.
	client, eventClient, err := createClients(kubeConfig)
	if err != nil {
		return nil, err
	}

createClients解读

。Clientset中包含一批rest.Interface的对象,如下

	cs.admissionregistrationV1, err = admissionregistrationv1.NewForConfigAndClient(&configShallowCopy, httpClient)
	if err != nil {
		return nil, err
	}
	cs.admissionregistrationV1beta1, err = admissionregistrationv1beta1.NewForConfigAndClient(&configShallowCopy, httpClient)
	if err != nil {
		return nil, err
	}
	cs.internalV1alpha1, err = internalv1alpha1.NewForConfigAndClient(&configShallowCopy, httpClient)
	if err != nil {
		return nil, err
	}
	cs.appsV1, err = appsv1.NewForConfigAndClient(&configShallowCopy, httpClient)
	if err != nil {
		return nil, err
	}

最终返回的Clientset

// Clientset contains the clients for groups. Each group has exactly one
// version included in a Clientset.
type Clientset struct {
	*discovery.DiscoveryClient
	admissionregistrationV1      *admissionregistrationv1.AdmissionregistrationV1Client
	admissionregistrationV1beta1 *admissionregistrationv1beta1.AdmissionregistrationV1beta1Client
	internalV1alpha1             *internalv1alpha1.InternalV1alpha1Client
	appsV1                       *appsv1.AppsV1Client
	appsV1beta1                  *appsv1beta1.AppsV1beta1Client
	appsV1beta2                  *appsv1beta2.AppsV1beta2Client
	authenticationV1             *authenticationv1.AuthenticationV1Client
	authenticationV1beta1        *authenticationv1beta1.AuthenticationV1beta1Client
	authorizationV1              *authorizationv1.AuthorizationV1Client
	authorizationV1beta1         *authorizationv1beta1.AuthorizationV1beta1Client
	autoscalingV1                *autoscalingv1.AutoscalingV1Client
	autoscalingV2                *autoscalingv2.AutoscalingV2Client
	autoscalingV2beta1           *autoscalingv2beta1.AutoscalingV2beta1Client
	autoscalingV2beta2           *autoscalingv2beta2.AutoscalingV2beta2Client
	batchV1                      *batchv1.BatchV1Client
	batchV1beta1                 *batchv1beta1.BatchV1beta1Client
	certificatesV1               *certificatesv1.CertificatesV1Client
	certificatesV1beta1          *certificatesv1beta1.CertificatesV1beta1Client
	coordinationV1beta1          *coordinationv1beta1.CoordinationV1beta1Client
	coordinationV1               *coordinationv1.CoordinationV1Client
	coreV1                       *corev1.CoreV1Client
	discoveryV1                  *discoveryv1.DiscoveryV1Client
	discoveryV1beta1             *discoveryv1beta1.DiscoveryV1beta1Client
	eventsV1                     *eventsv1.EventsV1Client
	eventsV1beta1                *eventsv1beta1.EventsV1beta1Client
	extensionsV1beta1            *extensionsv1beta1.ExtensionsV1beta1Client
	flowcontrolV1alpha1          *flowcontrolv1alpha1.FlowcontrolV1alpha1Client
	flowcontrolV1beta1           *flowcontrolv1beta1.FlowcontrolV1beta1Client
	flowcontrolV1beta2           *flowcontrolv1beta2.FlowcontrolV1beta2Client
	networkingV1                 *networkingv1.NetworkingV1Client
	networkingV1beta1            *networkingv1beta1.NetworkingV1beta1Client
	nodeV1                       *nodev1.NodeV1Client
	nodeV1alpha1                 *nodev1alpha1.NodeV1alpha1Client
	nodeV1beta1                  *nodev1beta1.NodeV1beta1Client
	policyV1                     *policyv1.PolicyV1Client
	policyV1beta1                *policyv1beta1.PolicyV1beta1Client
	rbacV1                       *rbacv1.RbacV1Client
	rbacV1beta1                  *rbacv1beta1.RbacV1beta1Client
	rbacV1alpha1                 *rbacv1alpha1.RbacV1alpha1Client
	schedulingV1alpha1           *schedulingv1alpha1.SchedulingV1alpha1Client
	schedulingV1beta1            *schedulingv1beta1.SchedulingV1beta1Client
	schedulingV1                 *schedulingv1.SchedulingV1Client
	storageV1beta1               *storagev1beta1.StorageV1beta1Client
	storageV1                    *storagev1.StorageV1Client
	storageV1alpha1              *storagev1alpha1.StorageV1alpha1Client
}

clientSet的使用
后面程序在list获取对象时就可以使用,比如之前ink8s-pod-metrics中获取node
nodes, err := clientset.CoreV1().Nodes().List(context.TODO(), metav.ListOptions))
获取pod
pods, err := clientset.CoreV1().Pods("kube-system").List(context.TODO(), metav1.ListOptions))

可以看出上面的node和pod使用的都是clientset中的coreV1()

开启选举主的锁配置

默认scheduler 启动的时候带上参数 --leader-elect=true,代表先选主再进行主流程,为了高可用部署

	// Set up leader election if enabled.
	var leaderElectionConfig *leaderelection.LeaderElectionConfig
	if c.ComponentConfig.LeaderElection.LeaderElect {
		// Use the scheduler name in the first profile to record leader election.
		schedulerName := corev1.DefaultSchedulerName
		if len(c.ComponentConfig.Profiles) != 0 {
			schedulerName = c.ComponentConfig.Profiles[0].SchedulerName
		}
		coreRecorder := c.EventBroadcaster.DeprecatedNewLegacyRecorder(schedulerName)
		leaderElectionConfig, err = makeLeaderElectionConfig(c.ComponentConfig.LeaderElection, kubeConfig, coreRecorder)
		if err != nil {
			return nil, err
		}
	}

创建informer工厂函数

	c.InformerFactory = scheduler.NewInformerFactory(client, 0)

然后就是setup中的New创建scheduler对象

	// Create the scheduler.
	sched, err := scheduler.New(cc.Client,
		cc.InformerFactory,
		cc.DynInformerFactory,
		recorderFactory,
		ctx.Done(),
		scheduler.WithComponentConfigVersion(cc.ComponentConfig.TypeMeta.APIVersion),
		scheduler.WithKubeConfig(cc.KubeConfig),
		scheduler.WithProfiles(cc.ComponentConfig.Profiles...),
		scheduler.WithPercentageOfNodesToScore(cc.ComponentConfig.PercentageOfNodesToScore),
		scheduler.WithFrameworkOutOfTreeRegistry(outOfTreeRegistry),
		scheduler.WithPodMaxBackoffSeconds(cc.ComponentConfig.PodMaxBackoffSeconds),
		scheduler.WithPodInitialBackoffSeconds(cc.ComponentConfig.PodInitialBackoffSeconds),
		scheduler.WithPodMaxInUnschedulablePodsDuration(cc.PodMaxInUnschedulablePodsDuration),
		scheduler.WithExtenders(cc.ComponentConfig.Extenders...),
		scheduler.WithParallelism(cc.ComponentConfig.Parallelism),
		scheduler.WithBuildFrameworkCapturer(func(profile kubeschedulerconfig.KubeSchedulerProfile) {
			// Profiles are processed during Framework instantiation to set default plugins and configurations. Capturing them for logging
			completedProfiles = append(completedProfiles, profile)
		}),
	)

run 运行调度策略

注册配置到configz

保存到全局的map中
可以通过/configz的https path获取到,代码如下

	// Configz registration.
	if cz, err := configz.New("componentconfig"); err == nil {
		cz.Set(cc.ComponentConfig)
	} else {
		return fmt.Errorf("unable to register configz: %s", err)
	}

通过curl命令获取/configz信息
修改我们之前使用的prometheus service account 中的clusterrole策略

在nonResourceURLs中添加"/configz "

vim rbac.yaml

kubectl apply -f rbac.yaml

kubectl get sa prometheus -n kube-system

应用后使用curl 获取token,再访问即可

启动事件广播管理器

。Event事件是k8s里的一个核心资源,下面的章节中详细讲解
// Prepare the event broadcaster .
cc.EventBroadcaster.StartRecordingToSink(ctx.Done())

初始化check

// Setup healthz checks .
var checks []healthz.HealthChecker
if cc.ComponentConfig.LeaderElection.LeaderElect {
checks = append(checks,cc.LeaderElection.WatchDog)}

等待选主结果的chan

waitingForLeader代表选主结果通知的chan。有两个地方会close
。在下面的选主成功时会close
。如果选主功能没开启会close
isLeader代表判断当前节点是否为leader的func,如果waitingForLeader被关闭了则代表当前节点会成为leader

	waitingForLeader := make(chan struct{})
	isLeader := func() bool {
		select {
		case _, ok := <-waitingForLeader:
			// if channel is closed, we are leading
			return !ok
		default:
			// channel is open, we are waiting for a leader
			return false
		}
	}

isLeader的应用

如果不是leader,那么/metrics/resources就不要注册到路由中了也就是非leader节点不能导出这些指标 

func installMetricHandler(pathRecorderMux *mux.PathRecorderMux, informers informers.SharedInformerFactory, isLeader func() bool) {
	configz.InstallHandler(pathRecorderMux)
	pathRecorderMux.Handle("/metrics", legacyregistry.HandlerWithReset())

	resourceMetricsHandler := resources.Handler(informers.Core().V1().Pods().Lister())
	pathRecorderMux.HandleFunc("/metrics/resources", func(w http.ResponseWriter, req *http.Request) {
		if !isLeader() {
			return
		}
		resourceMetricsHandler.ServeHTTP(w, req)
	})
}

buildHandlerChain构造httphandler链路

逐次添加鉴权、认证、info、cache、logging等handler

// buildHandlerChain wraps the given handler with the standard filters.
func buildHandlerChain(handler http.Handler, authn authenticator.Request, authz authorizer.Authorizer) http.Handler {
	requestInfoResolver := &apirequest.RequestInfoFactory{}
	failedHandler := genericapifilters.Unauthorized(scheme.Codecs)

	handler = genericapifilters.WithAuthorization(handler, authz, scheme.Codecs)
	handler = genericapifilters.WithAuthentication(handler, authn, failedHandler, nil)
	handler = genericapifilters.WithRequestInfo(handler, requestInfoResolver)
	handler = genericapifilters.WithCacheControl(handler)
	handler = genericfilters.WithHTTPLogging(handler)
	handler = genericfilters.WithPanicRecovery(handler, requestInfoResolver)

	return handler
}

cc.InformerFactory.Start 启动所有informer

D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\informers\factory.go

// Start initializes all requested informers.
func (f *sharedInformerFactory) Start(stopCh <-chan struct{}) {
	f.lock.Lock()
	defer f.lock.Unlock()

	for informerType, informer := range f.informers {
		if !f.startedInformers[informerType] {
			go informer.Run(stopCh)
			f.startedInformers[informerType] = true
		}
	}
}

WaitForCacheSync代表在执行调度前,先把通过informer把资源缓存到本地

    // Wait for all caches to sync before scheduling.

    cc.InformerFactory.WaitForCacheSync(ctx.Done())

 位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\informers\factory.go

// WaitForCacheSync waits for all started informers' cache were synced.
func (f *sharedInformerFactory) WaitForCacheSync(stopCh <-chan struct{}) map[reflect.Type]bool {
	informers := func() map[reflect.Type]cache.SharedIndexInformer {
		f.lock.Lock()
		defer f.lock.Unlock()

		informers := map[reflect.Type]cache.SharedIndexInformer{}
		for informerType, informer := range f.informers {
			if f.startedInformers[informerType] {
				informers[informerType] = informer
			}
		}
		return informers
	}()

	res := map[reflect.Type]bool{}
	for informType, informer := range informers {
		res[informType] = cache.WaitForCacheSync(stopCh, informer.HasSynced)
	}
	return res
}

开启LeaderElection选主的流程

。如果被选为主的话则执行schedRun​​​​​​

	// If leader election is enabled, runCommand via LeaderElector until done and exit.
	if cc.LeaderElection != nil {
		cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
			OnStartedLeading: func(ctx context.Context) {
				close(waitingForLeader)
				sched.Run(ctx)
			},
			OnStoppedLeading: func() {
				select {
				case <-ctx.Done():
					// We were asked to terminate. Exit 0.
					klog.InfoS("Requested to terminate, exiting")
					os.Exit(0)
				default:
					// We lost the lock.
					klog.ErrorS(nil, "Leaderelection lost")
					klog.FlushAndExit(klog.ExitFlushTimeout, 1)
				}
			},
		}
		leaderElector, err := leaderelection.NewLeaderElector(*cc.LeaderElection)
		if err != nil {
			return fmt.Errorf("couldn't create leader elector: %v", err)
		}

		leaderElector.Run(ctx)

		return fmt.Errorf("lost lease")
	}

6.2 kube-scheduler中的leaderelection选主机制解读

k8s leader election抢锁机制
-leaderelection 主要是利用了k8s API操作的原子性实现了一个分布式锁,在不断的竞争中进行选举
- 选中为leader的进行才会执行具体的业务代码,这在k8s中非常的常见。

为什么要选主
·- 在Kubernetes中,通常kube-schduler和kube-controller-manager都是多副本进行部署的来保证高可用
·- 而真正在工作的实例其实只有一个
- 这里就利用到leaderelection 的选主机制,保证leader是处于工作状态
- 并且在leader挂掉之后,从其他节点选取新的leader保证组件正常工作

源码解读

根据 --leader-elect=true 配置开启选主抢锁

位置D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-scheduler\app\options\options.go

	if c.ComponentConfig.LeaderElection.LeaderElect {
		// Use the scheduler name in the first profile to record leader election.
		schedulerName := corev1.DefaultSchedulerName
		if len(c.ComponentConfig.Profiles) != 0 {
			schedulerName = c.ComponentConfig.Profiles[0].SchedulerName
		}
		coreRecorder := c.EventBroadcaster.DeprecatedNewLegacyRecorder(schedulerName)
		leaderElectionConfig, err = makeLeaderElectionConfig(c.ComponentConfig.LeaderElection, kubeConfig, coreRecorder)
		if err != nil {
			return nil, err
		}
	}

抢锁配置初始化

makeLeaderElectionConfig 创建选主抢锁的配置

位置D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-scheduler\app\options\options.go

使用机器名+uuid作为标识

锁为资源锁resourcelock

resourcelock默认配置可以通过/configz获取

代码如下

// makeLeaderElectionConfig builds a leader election configuration. It will
// create a new resource lock associated with the configuration.
func makeLeaderElectionConfig(config componentbaseconfig.LeaderElectionConfiguration, kubeConfig *restclient.Config, recorder record.EventRecorder) (*leaderelection.LeaderElectionConfig, error) {
	hostname, err := os.Hostname()
	if err != nil {
		return nil, fmt.Errorf("unable to get hostname: %v", err)
	}
	// add a uniquifier so that two processes on the same host don't accidentally both become active
	id := hostname + "_" + string(uuid.NewUUID())

	rl, err := resourcelock.NewFromKubeconfig(config.ResourceLock,
		config.ResourceNamespace,
		config.ResourceName,
		resourcelock.ResourceLockConfig{
			Identity:      id,
			EventRecorder: recorder,
		},
		kubeConfig,
		config.RenewDeadline.Duration)
	if err != nil {
		return nil, fmt.Errorf("couldn't create resource lock: %v", err)
	}

	return &leaderelection.LeaderElectionConfig{
		Lock:            rl,
		LeaseDuration:   config.LeaseDuration.Duration,
		RenewDeadline:   config.RenewDeadline.Duration,
		RetryPeriod:     config.RetryPeriod.Duration,
		WatchDog:        leaderelection.NewLeaderHealthzAdaptor(time.Second * 20),
		Name:            "kube-scheduler",
		ReleaseOnCancel: true,
	}, nil
}

resourcelock资源锁的初始化

D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\tools\leaderelection\resourcelock\interface.go

// Manufacture will create a lock of a given type according to the input parameters
func New(lockType string, ns string, name string, coreClient corev1.CoreV1Interface, coordinationClient coordinationv1.CoordinationV1Interface, rlc ResourceLockConfig) (Interface, error) {
	endpointsLock := &endpointsLock{
		EndpointsMeta: metav1.ObjectMeta{
			Namespace: ns,
			Name:      name,
		},
		Client:     coreClient,
		LockConfig: rlc,
	}
	configmapLock := &configMapLock{
		ConfigMapMeta: metav1.ObjectMeta{
			Namespace: ns,
			Name:      name,
		},
		Client:     coreClient,
		LockConfig: rlc,
	}
	leaseLock := &LeaseLock{
		LeaseMeta: metav1.ObjectMeta{
			Namespace: ns,
			Name:      name,
		},
		Client:     coordinationClient,
		LockConfig: rlc,
	}
	switch lockType {
	case endpointsResourceLock:
		return nil, fmt.Errorf("endpoints lock is removed, migrate to %s", EndpointsLeasesResourceLock)
	case configMapsResourceLock:
		return nil, fmt.Errorf("configmaps lock is removed, migrate to %s", ConfigMapsLeasesResourceLock)
	case LeasesResourceLock:
		return leaseLock, nil
	case EndpointsLeasesResourceLock:
		return &MultiLock{
			Primary:   endpointsLock,
			Secondary: leaseLock,
		}, nil
	case ConfigMapsLeasesResourceLock:
		return &MultiLock{
			Primary:   configmapLock,
			Secondary: leaseLock,
		}, nil
	default:
		return nil, fmt.Errorf("Invalid lock-type %s", lockType)
	}
}

scheduler中抢锁的运行

。在scheduler的Run函数中,位置在D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-scheduler\app\server.go

	// If leader election is enabled, runCommand via LeaderElector until done and exit.
	if cc.LeaderElection != nil {
		cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
			OnStartedLeading: func(ctx context.Context) {
				close(waitingForLeader)
				sched.Run(ctx)
			},
			OnStoppedLeading: func() {
				select {
				case <-ctx.Done():
					// We were asked to terminate. Exit 0.
					klog.InfoS("Requested to terminate, exiting")
					os.Exit(0)
				default:
					// We lost the lock.
					klog.ErrorS(nil, "Leaderelection lost")
					klog.FlushAndExit(klog.ExitFlushTimeout, 1)
				}
			},
		}
		leaderElector, err := leaderelection.NewLeaderElector(*cc.LeaderElection)
		if err != nil {
			return fmt.Errorf("couldn't create leader elector: %v", err)
		}

		leaderElector.Run(ctx)

		return fmt.Errorf("lost lease")
	}

底层会调用leaderElector.Run开始执行抢锁选主

D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\tools\leaderelection\leaderelection.go

// Run starts the leader election loop. Run will not return
// before leader election loop is stopped by ctx or it has
// stopped holding the leader lease
func (le *LeaderElector) Run(ctx context.Context) {
	defer runtime.HandleCrash()
	defer func() {
		le.config.Callbacks.OnStoppedLeading()
	}()

	if !le.acquire(ctx) {
		return // ctx signalled done
	}
	ctx, cancel := context.WithCancel(ctx)
	defer cancel()
	go le.config.Callbacks.OnStartedLeading(ctx)
	le.renew(ctx)
}

通过acquire进行抢锁的动作

acquire会轮询调用tryAcquireOrRenew,如果抢到锁就返回true

如果ctx收到了退出信号就返回false 

// acquire loops calling tryAcquireOrRenew and returns true immediately when tryAcquireOrRenew succeeds.
// Returns false if ctx signals done.
func (le *LeaderElector) acquire(ctx context.Context) bool {
	ctx, cancel := context.WithCancel(ctx)
	defer cancel()
	succeeded := false
	desc := le.config.Lock.Describe()
	klog.Infof("attempting to acquire leader lease %v...", desc)
	wait.JitterUntil(func() {
		succeeded = le.tryAcquireOrRenew(ctx)
		le.maybeReportTransition()
		if !succeeded {
			klog.V(4).Infof("failed to acquire lease %v", desc)
			return
		}
		le.config.Lock.RecordEvent("became leader")
		le.metrics.leaderOn(le.config.Name)
		klog.Infof("successfully acquired lease %v", desc)
		cancel()
	}, le.config.RetryPeriod, JitterFactor, true, ctx.Done())
	return succeeded
}

tryAcquireOrRenew解读

。首先获取原有的锁(通过apiserver到etcd中获取)

如果错误是IsNotFound就创建资源,并且持有锁

	// 1. obtain or create the ElectionRecord
	oldLeaderElectionRecord, oldLeaderElectionRawRecord, err := le.config.Lock.Get(ctx)
	if err != nil {
		if !errors.IsNotFound(err) {
			klog.Errorf("error retrieving resource lock %v: %v", le.config.Lock.Describe(), err)
			return false
		}
		if err = le.config.Lock.Create(ctx, leaderElectionRecord); err != nil {
			klog.Errorf("error initially creating leader election record: %v", err)
			return false
		}

		le.setObservedRecord(&leaderElectionRecord)

		return true
	}

检查本地缓存和远端的锁对象,不一致就更新一下

	// 2. Record obtained, check the Identity & Time
	if !bytes.Equal(le.observedRawRecord, oldLeaderElectionRawRecord) {
		le.setObservedRecord(oldLeaderElectionRecord)

		le.observedRawRecord = oldLeaderElectionRawRecord
	}

判断持有的锁是否到期以及是否被自己持有

	if len(oldLeaderElectionRecord.HolderIdentity) > 0 &&
		le.observedTime.Add(le.config.LeaseDuration).After(now.Time) &&
		!le.IsLeader() {
		klog.V(4).Infof("lock is held by %v and has not yet expired", oldLeaderElectionRecord.HolderIdentity)
		return false
	}

自己现在是leader,但是分两组情况

le.IsLeader代表上一次也是leader,不需要变更信息

else代表首次变为leader,需要将leader切换+1

	// 3. We're going to try to update. The leaderElectionRecord is set to it's default
	// here. Let's correct it before updating.
	if le.IsLeader() {
		leaderElectionRecord.AcquireTime = oldLeaderElectionRecord.AcquireTime
		leaderElectionRecord.LeaderTransitions = oldLeaderElectionRecord.LeaderTransitions
	} else {
		leaderElectionRecord.LeaderTransitions = oldLeaderElectionRecord.LeaderTransitions + 1
	}

更新锁资源,这里如果在Get 和 Update 之有变化,将会更新失败

	// update the lock itself
	if err = le.config.Lock.Update(ctx, leaderElectionRecord); err != nil {
		klog.Errorf("Failed to update lock: %v", err)
		return false
	}

	le.setObservedRecord(&leaderElectionRecord)
	return true

如果上面的update等操作并发执行会怎么样
在le.config.Lock.Get0 中会获取到锁的对象,其中有一个resourceVersion 字段用于标识一个资源对象的内部版本,每次更新操作都会更新其值
如果一个更新操作附加上了resourceVersion 字段,那么apiserver 就会通过验证当前 resourceVersion的值与指定的值是否相匹配来确保在此次更新操作周期内没有其他的更新操作,从而保证了更新操作的原子性 

resourceVersion 在ObjectMeta中,位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apimachinery\pkg\apis\meta\v1\types.go

	// An opaque value that represents the internal version of this object that can
	// be used by clients to determine when objects have changed. May be used for optimistic
	// concurrency, change detection, and the watch operation on a resource or set of resources.
	// Clients must treat these values as opaque and passed unmodified back to the server.
	// They may only be valid for a particular resource or set of resources.
	//
	// Populated by the system.
	// Read-only.
	// Value must be treated as opaque by clients and .
	// More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency
	// +optional
	ResourceVersion string `json:"resourceVersion,omitempty" protobuf:"bytes,6,opt,name=resourceVersion"`

kube-scheduler lease对象查看

kubectl get lease -n kube-system
[root@k8s-master01 k8s-leaderelection]# kubectl get lease -n kube-system
NAME HOLDER AGE
kube-controller-manager k8s-master01_def02578-36f9-4a43-b700-66dd407ff612 161d

kube-scheduler k8s-master01_29da2906-54c1-4db1-9146-4bf8919b4cda 161d
可以看到我们现在是单master的环境
kube-scheduler 和kube-controller-manager都使用了lease做选主抢锁

在kube-system命名空间下
。holder代表当前锁被那个节点持有,为机器名+uuid

写代码体验leaderelection 的选主机制

新建项目 k8s-leaderelection 

PS D:\Workspace\Go\src\k8s-leaderelection> go mod init k8s-leaderelection
go: creating new go.mod: module k8s-leaderelection

新建 leaderelection.go

package main

import (
	"context"
	"flag"
	"os"
	"os/signal"
	"syscall"
	"time"

	"github.com/google/uuid"
	metav1 "k8s.io/apimechinery/pkg/apis/meta/v1"
	clientset "k8s.io/client-go/kubernetes"
	"k8s.io/client-go/rest"
	"k8s.io/client-go/tools/clientcmd"
	"k8s.io/client-go/tools/leaderelection"
	"k8s.io/client-go/tools/leaderelection/resourcelock"
	"k8s.io/klog"
)

// 初始化restconfig,如果指定了kubeconfig就用文件,否则使用InClusterConfig对应service account
func buildConfig(kubeconfig string) (*rest.Config, error) {
	if kubeconfig != "" {
		cfg, err := clientcmd.BuildConfigFromFlags("", kubeconfig)
		if err != nil {
			return nil, err
		}
		return cfg, nil
	}
	cfg, err := rest.InClusterConfig()
	if err != nil {
		return nil, err
	}
	return cfg, nil
}

func main() {
	klog.InitFlags(nil)

	var kubeconfig string
	var leaseLockName string
	var leaseLockNamespace string
	var id string

	flag.StringVar(&kubeconfig, "kubeconfig", "", "absolute path to the kubeconfig file")
	// 唯一的id
	flag.StringVar(&id, "id", uuid.New().String(), "the holder identity name")
	// lease-lock资源锁的名称
	flag.StringVar(&leaseLockName, "lease-lock-name", "the lease lock resource name")
	// 资源锁的namespace
	flag.StringVar(&leaseLockNamespace, "lease-lock-namespace", "the lease lock resource namespace")
	flag.Parse()

	if leaseLockName == "" {
		klog.Fatal("unable to get lease lock resource name (missing lease-lock-name flag).")
	}
	if leaseLockNamespace == "" {
		klog.Fatal("unable to get lease lock resource namespace (missing lease-lock-namespace flag).")
	}

	// leader election uses the Kubernetes API by writing to a// lock object, which can be a LeaseLock object (preferred),
	// a ConfigMap, or an Endpoints (deprecated) object.
	// Conflicting writes are detected and each client handles those actions
	// independently.
	config, err := buildConfig(kubeconfig)
	if err != nil {
		klog.Fatal(err)
	}

	// @创建clientset
	client := clientset.NewForConfigOrDie(config)

	run := func(ctx context.Context) {
		// complete your controller loop here
		klog.Info("Controller loop...")

		select {}
	}

	// use a Go context so we can tell the leaderelection code when we
	// want to step down
	// 抢锁停止的context
	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	// 监信号
	ch := make(chan os.Signal, 1)
	signal.Notify(ch, os.Interrupt, syscall.SIGTERM)
	go func() {
		<-ch
		klog.Info("Received termination, signaling shutdown")
		cancel()
	}()

	// we use the Lease lock type since edits to Leases are less common
	// and fewer objects in the cluster watch "all Leases"
	// 指定锁的资源对象,这里使用了Lease贷源,还支持configmap,endpoint,或者multilock(即多种亮合使用)
	lock := &resourcelock.LeaseLock{
		LeaseMeta: metav1.ObjectMeta{
			Name:      leaseLockName,
			Namespace: leaseLockNamespace,
		},
		Client: client.CoordinationV1(),
		LockConfig: resourcelock.ResourceLockConfig{
			Identity: id,
		},
	}
	// start the leader election code loop
	leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
		Lock: lock,
		// IMPORTANT: you MUST ensure that any code you have that
		// is protected by the lease must terminate **before**
		// you call cancel. Otherwise, you could have a background
		// loop still running and another process could
		// get elected before your background loop finished, violating
		// the stated goal of the lease.
		ReleaseOnCancel: true,
		LeaseDuration:   60 * time.Second, //租约时间
		RenewDeadline:   15 * time.Second, //更新租约的
		RetryPeriod:     5 * time.Second,  //非leader节点重试时间
		Callbacks: leaderelection.leaderCallbacks{
			OnStartedLeading: func(ctx context.Context) {
				//变为leader执行的业务代码
				// we're notified when we start - this is where you would
				// usually put your coder
				run(ctx)
			},
			OnStoppedLeading: func() {
				// 进程退出
				// we can do cleanup here
				klog.Infof("leader lost: %s", id)
				os.Exit(0)
			},
			OnNewLeader: func(identity string) {
				//当产生新的Leader后执行的方法
				// we 're notified when new leader elected
				if identity == id {
					// I just got the lock
					return
				}
				klog.Infof("new leader elected: %s", identity)
			},
		},
	})
}

解读一下

首先通过命令行传入 kubeconfig lease的名字和id等参数

通过buildConfig获取 restConfig
clientset.NewForConfigOrDie创建cntset

实例化resourcelock.LeaseLock,资源类型使用lease
。leaderelection.RunOrDie启动抢锁逻辑
。时间相关参数解读
。 LeaseDuration: 60*time.Second,//租约时间
。 RenewDeadline:  15*time.Second,//更新租约的
。 RetryPeriod: 5*time.Second//leader节点重试时间

事件回调说明.
。OnStartedLeading代表变为leader时执行什么,往往是业务代码,这里执行的是空的run
。OnStoppedLeading 代表进程退出
。OnNewLeader 当产生新的leader后执行的方法 

编译运行

go build
。运行,先启动id=1的成员

可以看到当前进程抢到了锁,被选为主,get一下lease 

再启动一个id=2的成员,可以看到1号为leader

这时候停止1号,可以看到2号获取了锁

k8sleaderelection抢锁机制

leaderelection主要是利用了k8s API操作的原子性实现了一个分布式锁,在不断的竞争中进行选举

选中为leader的进行才会执行具体的业务代码,这在k8s中非常的常见

kube-scheduler使用lease类型的资源锁选主,选主之后才能进行调度

这样做的目的是多副本进行部署的来保证高可用 

6.3 k8s的事件event和kube-scheduler中的事件广播器

什么是k8s的events

k8s的events是向您展示集群内部发生的事情的对象

- 例如调度程序做出了哪些决定

或者为什么某些 Pod 从节点中被逐出

哪些组件可以产生events

所有核心组件和扩展(操作符)都可以通过APServer创建事件
-k8s 多个组件均会产生 event

如何获取event 数据

get events直接获取

describe 资源获取

·比如创建pod时故意将容器的image 仓库名字写错

 

创建之后就可以describe 这个pod获取events,可以看到拉取镜像失败的events

Event事件管理机制要有三部分组成

EventRecorder:是事件生成者,k8s组件通过调用它的方法来生成事件 

EventBroadcaster: 事件广播器,负责消费EventRecorder产生的事件,然后分发给broadcasterWatcher;
broadcasterWatcher:用于定义事件的处理方式,如上报apiserver; 

events保存问题

Events量非常大,只能存在一个很短的时间
Event 一般会通过apiserver暂存在etcd集群中(最好是单独的etcd集群存储events,和集群业务数据的etcd分开)
为避免磁盘空间被填满,故强制执行保留策略:在最后一次的事件发生后,删除1小时之前发生的事件。

event有什么作用

下面图片来自网络可以基于event对k8s集群监控 

kube-scheduler中的event解读

EventBroadcaster的初始化

在初始化配置的Config中,位置D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-scheduler\app\options\options.go

	c.EventBroadcaster = events.NewEventBroadcasterAdapter(eventClient)

new

// NewEventBroadcasterAdapter creates a wrapper around new and legacy broadcasters to simplify
// migration of individual components to the new Event API.
func NewEventBroadcasterAdapter(client clientset.Interface) EventBroadcasterAdapter {
	eventClient := &eventBroadcasterAdapterImpl{}
	if _, err := client.Discovery().ServerResourcesForGroupVersion(eventsv1.SchemeGroupVersion.String()); err == nil {
		eventClient.eventsv1Client = client.EventsV1()
		eventClient.eventsv1Broadcaster = NewBroadcaster(&EventSinkImpl{Interface: eventClient.eventsv1Client})
	}
	// Even though there can soon exist cases when coreBroadcaster won't really be needed,
	// we create it unconditionally because its overhead is minor and will simplify using usage
	// patterns of this library in all components.
	eventClient.coreClient = client.CoreV1()
	eventClient.coreBroadcaster = record.NewBroadcaster()
	return eventClient
}

底层使用client-go中的eventBroadcasterlmpl,位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\tools\record\event.go

type eventBroadcasterImpl struct {
	*watch.Broadcaster
	sleepDuration time.Duration
	options       CorrelatorOptions
}

eventRecorder初始化

初始化scheduler的Setup函数中,位置D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-scheduler\app\server.go
    recorderFactory := getRecorderFactory(&cc)
recorderFactory代表生成eventRecorder的工厂函数

最终使用的是client-go中的recorderlmpl,位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\tools\events\event_recorder.go

type recorderImpl struct {
	scheme              *runtime.Scheme
	reportingController string
	reportingInstance   string
	*watch.Broadcaster
	clock clock.Clock
}

开启event 事件广播器

。在scheduler的run中,位置在D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-scheduler\app\server.go

	// Prepare the event broadcaster.
	cc.EventBroadcaster.StartRecordingToSink(ctx.Done())

client-go中的StartRecordingToSink解读

// StartRecordingToSink starts sending events received from the specified eventBroadcaster to the given sink.
func (e *eventBroadcasterAdapterImpl) StartRecordingToSink(stopCh <-chan struct{}) {
	if e.eventsv1Broadcaster != nil && e.eventsv1Client != nil {
		e.eventsv1Broadcaster.StartRecordingToSink(stopCh)
	}
	if e.coreBroadcaster != nil && e.coreClient != nil {
		e.coreBroadcaster.StartRecordingToSink(&typedv1core.EventSinkImpl{Interface: e.coreClient.Events("")})
	}
}

startRecordingEvents解读 

D:\Workspace\Go\src\k8s.io\kubernetes\vendor\k8s.io\client-go\tools\events\event_broadcaster.go

func (e *eventBroadcasterImpl) startRecordingEvents(stopCh <-chan struct{}) {
	eventHandler := func(obj runtime.Object) {
		event, ok := obj.(*eventsv1.Event)
		if !ok {
			klog.Errorf("unexpected type, expected eventsv1.Event")
			return
		}
		e.recordToSink(event, clock.RealClock{})
	}
	stopWatcher := e.StartEventWatcher(eventHandler)
	go func() {
		<-stopCh
		stopWatcher()
	}()
}

启动一个eventHandler,执行recordToSink写入后端存储

在StartEventWatcher消费ResultChan队列中的event,传递给eventHandler处理

// StartEventWatcher starts sending events received from this EventBroadcaster to the given event handler function.
// The return value is used to stop recording
func (e *eventBroadcasterImpl) StartEventWatcher(eventHandler func(event runtime.Object)) func() {
	watcher := e.Watch()
	go func() {
		defer utilruntime.HandleCrash()
		for {
			watchEvent, ok := <-watcher.ResultChan()
			if !ok {
				return
			}
			eventHandler(watchEvent.Object)
		}
	}()
	return watcher.Stop
}

recordToSink发送逻辑

通过getKey生成event 的类型key,作为cache中的标识

func getKey(event *eventsv1.Event) eventKey {
	key := eventKey{
		action:              event.Action,
		reason:              event.Reason,
		reportingController: event.ReportingController,
		regarding:           event.Regarding,
	}
	if event.Related != nil {
		key.related = *event.Related
	}
	return key
}

.Event.series中记录的是这个event的次数和最近一次的时间

用上面的key去eventCache中寻找,如果找到了同时series存在就更新下相关的次数和时间.

			isomorphicEvent, isIsomorphic := e.eventCache[eventKey]
			if isIsomorphic {
				if isomorphicEvent.Series != nil {
					isomorphicEvent.Series.Count++
					isomorphicEvent.Series.LastObservedTime = metav1.MicroTime{Time: clock.Now()}
					return nil
				}

不然的话创建新的并返回

				isomorphicEvent.Series = &eventsv1.EventSeries{
					Count:            1,
					LastObservedTime: metav1.MicroTime{Time: clock.Now()},
				}
				return isomorphicEvent

然后用获取到的evToRecord 发送,并更新缓存 

		if evToRecord != nil {
			recordedEvent := e.attemptRecording(evToRecord)
			if recordedEvent != nil {
				recordedEventKey := getKey(recordedEvent)
				e.mu.Lock()
				defer e.mu.Unlock()
				e.eventCache[recordedEventKey] = recordedEvent
			}
		}

attemptRecording代表带重试发送,底层调用的就是sink的方法,位置 D:\Workspace\Go\src\k8s.io\kubernetes\vendor\k8s.io\client-go\tools\events\interfaces.go

// EventSink knows how to store events (client-go implements it.)
// EventSink must respect the namespace that will be embedded in 'event'.
// It is assumed that EventSink will return the same sorts of errors as
// client-go's REST client.
type EventSink interface {
	Create(event *eventsv1.Event) (*eventsv1.Event, error)
	Update(event *eventsv1.Event) (*eventsv1.Event, error)
	Patch(oldEvent *eventsv1.Event, data []byte) (*eventsv1.Event, error)
}

Event.series的作用
就像重复的连接不上mysql的错误日志会打很多条一样
k8s的同一个event也会产生多条,那么去重降噪是有必要的通过event.Action类型event.Reason原因 eventReportingController产生的源等信息组成唯一的key如果cache中有key的记录,那么更新这种event的发生的次数和最近的时间就可以了

kube-scheduler中的调用 eventRecorder产生的事件

eventRecorder对象

在scheduler的frameworklmpl中,位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\framework\runtime\framework.go

// EventRecorder returns an event recorder.
func (f *frameworkImpl) EventRecorder() events.EventRecorder {
	return f.eventRecorder
}

scheduler处理调度pod失败的地方

D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\scheduler.go

// handleSchedulingFailure records an event for the pod that indicates the
// pod has failed to schedule. Also, update the pod condition and nominated node name if set.
func (sched *Scheduler) handleSchedulingFailure(fwk framework.Framework, podInfo *framework.QueuedPodInfo, err error, reason string, nominatingInfo *framework.NominatingInfo) {
	sched.Error(podInfo, err)

	// Update the scheduling queue with the nominated pod information. Without
	// this, there would be a race condition between the next scheduling cycle
	// and the time the scheduler receives a Pod Update for the nominated pod.
	// Here we check for nil only for tests.
	if sched.SchedulingQueue != nil {
		sched.SchedulingQueue.AddNominatedPod(podInfo.PodInfo, nominatingInfo)
	}

	pod := podInfo.Pod
	msg := truncateMessage(err.Error())
	fwk.EventRecorder().Eventf(pod, nil, v1.EventTypeWarning, "FailedScheduling", "Scheduling", msg)
	if err := updatePod(sched.client, pod, &v1.PodCondition{
		Type:    v1.PodScheduled,
		Status:  v1.ConditionFalse,
		Reason:  reason,
		Message: err.Error(),
	}, nominatingInfo); err != nil {
		klog.ErrorS(err, "Error updating pod", "pod", klog.KObj(pod))
	}
}

写一个pod的,故意让scheduler调度失败,查看相关的event

这里让pod要调度到 disktype=ssd的node上

创建后获取event,可以看到调度失败的event
kubectl get event

分析下kube-scheduler 记录event的代码

Eventf参数和字段分析

regarding 关注哪种资源的event 这里传入的是pod

related 还关联哪些资源,这里传入的是nil

eventtype代表是warning的还是normal的,这里是v1.EventTypeWarning

reason 原因,这里传入的是 FailedScheduling调度失败

action 代表执行哪个动作,这里传入的是Scheduling

note代表详细信息,这里传入的是错误信息msg 

最后给node节点打上disktype=ssd的标签让pod正常调度

可以看到

pod调度正常也会有相关的event 

反查正常的eventf源码,位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\scheduler_one.go

func (sched *Scheduler) finishBinding(fwk framework.Framework, assumed *v1.Pod, targetNode string, err error) {
	if finErr := sched.Cache.FinishBinding(assumed); finErr != nil {
		klog.ErrorS(finErr, "Scheduler cache FinishBinding failed")
	}
	if err != nil {
		klog.V(1).InfoS("Failed to bind pod", "pod", klog.KObj(assumed))
		return
	}

	fwk.EventRecorder().Eventf(assumed, nil, v1.EventTypeNormal, "Scheduled", "Binding", "Successfully assigned %v/%v to %v", assumed.Namespace, assumed.Name, targetNode)
}

本节重点总结

。k8s的events是展示集群内部发生的事情的对象
很多组件都可以产生event,数据会上传到apiserver,临时存储在etcd中,因为量太大了,有删除策略

Event事件管理机制要有三部分组成 

EventRecorder:是事件生成者,k8s组件通过调用它的方法来生成事件;

EventBroadcaster: 事件广播器负责消费EventRecorder产生的事件,然后分发给BroadcasterWatcher 

6.4 k8s的informer机制

informer机制的作用

Informer 机制在不依赖中间件的情况下保证消息的实时性可靠性和顺序性
降低了k8s各个组件跟 Etcd 与 k8s APIServer 的通信压力
-k8s中的informer 框架可以很方便的让每个子模块以及扩展程序拿到k8s中的资源信息。

informer机制的主要对象

- reflector 用来直接和 k8s api server 通信,内部实现了listwatch 机制
- DeltaFIFO:更新队列
- informer 是我们要监听的资源的一个代码抽象

- Indexer: Client-go 用来存储资源对象并自带索引功能的本地存储

informer机制的框架

架构图分为两部分
黄色图标是开发者需要自行开发的部分而

其它的部分是client-go 已经提供的,直接使用即可

informer机制的作用
k8s中的informer框架可以很方便的让每个子模块以及扩展程序拿到k8s中的资源信息
informer机制的作用
Informer 机制在不依赖中间件的情况下保证消息的实时性,可靠性和顺序性降低了k8s各个组件跟Etcd 与k8s APIServer 的通信压力

informer机制的主要对象

Reflector: reflector 用来直接和 k8s api server 通信,内部实现了 listwatch 机制
。listwatch 就是用来监听资源变化的
。一个listwatch 只对应一个资源
。这个资源可以是k8s中内部的资源也可以是自定义的资源
。当收到资源变化时(创建、删除、修改)时会将资源放到 Delta Fifo 队列中

。 Reflector 包会和apiServer 建立长连接

DeltaFIFO:更新队列
。FIFO 就是一个队列,拥有队列基本方法(ADD,UPDATE,DELETE,LIST,POP,CLOSE等)

。Delta 是一个资源对象存储,保存存储对象的消费类型,比如Added,Updated,Deleted,Sync等

informer:informer 是我们要监听的资源的一个代码抽象
。能够将 delta filo 队列中的数据弹出
。然后保存到本地缓存indexer中,也就是图中的步骤5
。同时将数据分发到自定义controller 中进行事件处理也就是图中的步骤6

Indexer:Client-go用来存储资源对象并自带索引功能的本地存储
。Reflector从 DeltaFIFO 中将消费出来的资源对象存储到Indexer 

。lndexer与Etcd 集群中的数据完全保持一致

。从而client-go 可以本地读取,减少 Kubernetes API和 Etcd 集群的压力

使用informer代码

新建项目k8s-informer

go mod init k8s-informer

informer.go

package main

import (
	"context"
	"flag"
	"log"
	"os"
	"os/signal"
	"path/filepath"
	"syscall"
	"time"

	v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/client-go/informers"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/tools/cache"
	"k8s.io/client-go/tools/clientcmd"
	"k8s.io/client-go/util/homedir"
	"k8s.io/klog"
)

func main() {
	var kubeconfig *string
	//如果是windows,那么会读取c:\Users\xxx\.kube\config 下面的配置文件
	//如果是Linux,那么会读收~/.kube/config 下面的配置文件
	if home := homedir.HomeDir(); home != "" {
		kubeconfig = flag.String("kubeconfig", filepath.Join(home, ".kube", "config"), "(optional) absolute path to the kubeconfig file")
	} else {
		kubeconfig = flag.String("kubeconfig", "", "absolute path to the kubeconfig file")
	}

	flag.Parse()

	config, err := clientcmd.BuildConfigFromFlags("", *kubeconfig)
	if err != nil {
		panic(err)
	}

	clientset, err := kubernetes.NewForConfig(config)
	if err != nil {
		panic(err)
	}

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	// 监信号
	ch := make(chan os.Signal, 1)
	signal.Notify(ch, os.Interrupt, syscall.SIGTERM)
	go func() {
		<-ch
		klog.Info("Received termination,signaling shutdown")
		cancel()
	}()

	//表示每分钟进行一次resync,resync会周期性地执行List操作
	sharedInformers := informers.NewSharedInformerFactory(clientset, time.Minute)

	informer := sharedInformers.Core().V1().Pods().Informer()
	informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFuncs: func(obj interface{}) {
			mObj := obj.(v1.Object)
			log.Printf("New Pod Added to store: %s", mObj.GetName())
		},
		UpdateFunc: func(oldObj, newobj interface{}) {
			oObj := oldObj.(v1.Object)
			nObj := newobj.(v1.Object)
			log.Printf("%s Pod Updated to %s", oObj.GetName(), nObj.GetName())
		},
		DeleteFunc: func(obj interface{}) {
			mObj := obj.(v1.Object)
			log.Printf("pod Deleted from Store: %s", mObj.GetName())
		},
	})
	informer.Run(ctx.Done())
}

解读一下

先通过kubeconfig创建 restclient.Config
config,err := clientcmd.BuildConfigFromFlags("", *kubeconfig)
然后创建和apiserver交互的 clientset
clientset,err := kubernetes.NewForConfig(config)
监听退出信号,并创建退出的context

使用SharedInformerFactory创建sharedInformer,传入的Resync时间是1分钟,代表1分钟执行list操作

然后创建pod资源的informer

添加EventHandler,并执行

。AddFunc代表新的资源创建时的回调

。UpdateFunc代表资源更新时的回调

o DeleteFunc代表资源删除时的回调代码如下 

编译执行

go build
./informer

效果查看

执行后拉取全量的结果

新增一个pod 

informer那边 

修改刚才创建的pod,添加标签信息
pod的yaml 

informer的update日志

每次的resync也会触发 update

6.5 kube-scheduler中的informer源码阅读

在kube-scheduler中的源码解读

初始化shardinformer

。入口在kube-scheduler的config中,位置
D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-scheduler\app\options\options.go

这里传入的resync=0,代表不进行周期性的list,而是通过第一次的全量list+增量的更新
    c.InformerFactory = scheduler.NewInformerFactory(client, 0)

    informerFactory := informers.NewSharedInformerFactory(cs, resyncPeriod)
。NewSharedInformerFactory方法最终会调用到NewSharedinformerFactoryWithOptions初始化一个sharedInformerFactory ,位置D:\Workspace\Go\src\k8s.io\kubernetes\vendor\k8s.io\client-go\informers\factory.go

// NewSharedInformerFactoryWithOptions constructs a new instance of a SharedInformerFactory with additional options.
func NewSharedInformerFactoryWithOptions(client kubernetes.Interface, defaultResync time.Duration, options ...SharedInformerOption) SharedInformerFactory {
	factory := &sharedInformerFactory{
		client:           client,
		namespace:        v1.NamespaceAll,
		defaultResync:    defaultResync,
		informers:        make(map[reflect.Type]cache.SharedIndexInformer),
		startedInformers: make(map[reflect.Type]bool),
		customResync:     make(map[reflect.Type]time.Duration),
	}

	// Apply all options
	for _, opt := range options {
		factory = opt(factory)
	}

	return factory
}

为何叫sharedinformer

sharedInformerFactory字段如下 

type sharedInformerFactory struct {
	client           kubernetes.Interface
	namespace        string
	tweakListOptions internalinterfaces.TweakListOptionsFunc
	lock             sync.Mutex
	defaultResync    time.Duration
	customResync     map[reflect.Type]time.Duration

	informers map[reflect.Type]cache.SharedIndexInformer
	// startedInformers is used for tracking which informers have been started.
	// This allows Start() to be called multiple times safely.
	startedInformers map[reflect.Type]bool
}

其中最重要的就是informers这个map,根据资源的类型更新对应的informer

一种资源会对应多个Informer,会导致效率低下,所以让一个资源对应一个sharedinformer,而一个sharedInformer内部自己维护多个Informer

pod-informer初始化

D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\scheduler.go

// NewInformerFactory creates a SharedInformerFactory and initializes a scheduler specific
// in-place podInformer.
func NewInformerFactory(cs clientset.Interface, resyncPeriod time.Duration) informers.SharedInformerFactory {
	informerFactory := informers.NewSharedInformerFactory(cs, resyncPeriod)
	informerFactory.InformerFor(&v1.Pod{}, newPodInformer)
	return informerFactory
}

我们可以看到使用informerFactory的InformerFor创建了pod的informer对象

具体的InformerFor为sharedInformerFactory的InformerFor,位置在

D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\informers\factory.go

// InternalInformerFor returns the SharedIndexInformer for obj using an internal
// client.
func (f *sharedInformerFactory) InformerFor(obj runtime.Object, newFunc internalinterfaces.NewInformerFunc) cache.SharedIndexInformer {
	f.lock.Lock()
	defer f.lock.Unlock()

	informerType := reflect.TypeOf(obj)
	informer, exists := f.informers[informerType]
	if exists {
		return informer
	}

	resyncPeriod, exists := f.customResync[informerType]
	if !exists {
		resyncPeriod = f.defaultResync
	}

	informer = newFunc(f.client, resyncPeriod)
	f.informers[informerType] = informer

	return informer
}

InformerFor解读
根据obj的反射类型在informers map中寻找informer
如果找到就返回,找不到就使用传入的newFunc 创建一个,并更新map
对应的 newFunc就是newPodInformer,位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\scheduler.go

// newPodInformer creates a shared index informer that returns only non-terminal pods.
func newPodInformer(cs clientset.Interface, resyncPeriod time.Duration) cache.SharedIndexInformer {
	selector := fmt.Sprintf("status.phase!=%v,status.phase!=%v", v1.PodSucceeded, v1.PodFailed)
	tweakListOptions := func(options *metav1.ListOptions) {
		options.FieldSelector = selector
	}
	return coreinformers.NewFilteredPodInformer(cs, metav1.NamespaceAll, resyncPeriod, nil, tweakListOptions)
}

底层的newPod在client-go中,位置 D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\informers\core\v1\pod.go

在其中能看到对应的 listFunc和WatchFunc

// NewFilteredPodInformer constructs a new informer for Pod type.
// Always prefer using an informer factory to get a shared informer instead of getting an independent
// one. This reduces memory footprint and number of connections to the server.
func NewFilteredPodInformer(client kubernetes.Interface, namespace string, resyncPeriod time.Duration, indexers cache.Indexers, tweakListOptions internalinterfaces.TweakListOptionsFunc) cache.SharedIndexInformer {
	return cache.NewSharedIndexInformer(
		&cache.ListWatch{
			ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
				if tweakListOptions != nil {
					tweakListOptions(&options)
				}
				return client.CoreV1().Pods(namespace).List(context.TODO(), options)
			},
			WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
				if tweakListOptions != nil {
					tweakListOptions(&options)
				}
				return client.CoreV1().Pods(namespace).Watch(context.TODO(), options)
			},
		},
		&corev1.Pod{},
		resyncPeriod,
		indexers,
	)
}

Indexer的创建

上面我们提到在 NewSharedIndexlnformer中会新建indexer存储位置 D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\tools\cache\store.go

// NewIndexer returns an Indexer implemented simply with a map and a lock.
func NewIndexer(keyFunc KeyFunc, indexers Indexers) Indexer {
	return &cache{
		cacheStorage: NewThreadSafeStore(indexers, Indices{}),
		keyFunc:      keyFunc,
	}
}

对应的底层数据结构为 threadSafeMap,位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\tools\cache\thread_safe_store.go

// threadSafeMap implements ThreadSafeStore
type threadSafeMap struct {
	lock  sync.RWMutex
	items map[string]interface{}

	// indexers maps a name to an IndexFunc
	indexers Indexers
	// indices maps a name to an Index
	indices Indices
}

D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\tools\cache\index.go

// Index maps the indexed value to a set of keys in the store that match on that value
type Index map[string]sets.String

// Indexers maps a name to an IndexFunc
type Indexers map[string]IndexFunc

// Indices maps a name to an Index
type Indices map[string]Index

。indices数据结构,看起来是三层的map,key 都是string
看起来 threadSafeMap.items存储具体的资源对象,indices是索引,加速查找

NewIndexer时传入的keyFunc

。使用的是MetaNamespaceKeyFunc,也就是对象的namespace/name 

D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\tools\cache\store.go

// MetaNamespaceKeyFunc is a convenient default KeyFunc which knows how to make
// keys for API objects which implement meta.Interface.
// The key uses the format <namespace>/<name> unless <namespace> is empty, then
// it's just <name>.
//
// TODO: replace key-as-string with a key-as-struct so that this
// packing/unpacking won't be necessary.
func MetaNamespaceKeyFunc(obj interface{}) (string, error) {
	if key, ok := obj.(ExplicitKey); ok {
		return string(key), nil
	}
	meta, err := meta.Accessor(obj)
	if err != nil {
		return "", fmt.Errorf("object has no meta: %v", err)
	}
	if len(meta.GetNamespace()) > 0 {
		return meta.GetNamespace() + "/" + meta.GetName(), nil
	}
	return meta.GetName(), nil
}

添加eventHandler

。在 scheduler 的的New中,addAllEventHandlers 

这些handler代表就是使用方能干些什么,informer拿到之后要更新存储,使用方如scheduler拿到后要调度pod

位置 D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\scheduler.go

    addAllEventHandlers(sched, informerFactory, dynInformerFactory, unionedGVKs(clusterEventMap))

添加调度pod时的回调,有对应的AddFunc、UpdateFunc、DeleteFunc

	// scheduled pod cache
	informerFactory.Core().V1().Pods().Informer().AddEventHandler(
		cache.FilteringResourceEventHandler{
			FilterFunc: func(obj interface{}) bool {
				switch t := obj.(type) {
				case *v1.Pod:
					return assignedPod(t)
				case cache.DeletedFinalStateUnknown:
					if _, ok := t.Obj.(*v1.Pod); ok {
						// The carried object may be stale, so we don't use it to check if
						// it's assigned or not. Attempting to cleanup anyways.
						return true
					}
					utilruntime.HandleError(fmt.Errorf("unable to convert object %T to *v1.Pod in %T", obj, sched))
					return false
				default:
					utilruntime.HandleError(fmt.Errorf("unable to handle object in %T: %T", sched, obj))
					return false
				}
			},
			Handler: cache.ResourceEventHandlerFuncs{
				AddFunc:    sched.addPodToCache,
				UpdateFunc: sched.updatePodInCache,
				DeleteFunc: sched.deletePodFromCache,
			},
		},
	)

启动informers

入口在 scheduler的run中,位置 D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-scheduler\app\server.go

	// Start all informers.
	cc.InformerFactory.Start(ctx.Done())
	// DynInformerFactory can be nil in tests.
	if cc.DynInformerFactory != nil {
		cc.DynInformerFactory.Start(ctx.Done())
	}

	// Wait for all caches to sync before scheduling.
	cc.InformerFactory.WaitForCacheSync(ctx.Done())
	// DynInformerFactory can be nil in tests.
	if cc.DynInformerFactory != nil {
		cc.DynInformerFactory.WaitForCacheSync(ctx.Done())
	}

对应的stat为sharedinformerFactory中,遍历informers map,启动每个informer。并且在启动前用startedInformers map做check 

// Start initializes all requested informers.
func (f *sharedInformerFactory) Start(stopCh <-chan struct{}) {
	f.lock.Lock()
	defer f.lock.Unlock()

	for informerType, informer := range f.informers {
		if !f.startedInformers[informerType] {
			go informer.Run(stopCh)
			f.startedInformers[informerType] = true
		}
	}
}

sharedIndexInformer Run解读

新建fifo队列

	fifo := NewDeltaFIFOWithOptions(DeltaFIFOOptions{
		KnownObjects:          s.indexer,
		EmitDeltaTypeReplaced: true,
	})

新建controller

	func() {
		s.startedLock.Lock()
		defer s.startedLock.Unlock()

		s.controller = New(cfg)
		s.controller.(*controller).clock = s.clock
		s.started = true
	}()

processor启动listeners
wg.StartwithChannel(processorStopCh,s.processor.run)
底层调用 processorListener的run 

D:\Workspace\Go\src\k8s.io\kubernetes\vendor\k8s.io\client-go\tools\cache\shared_informer.go

在processorListener的run中执行eventHandler注册的回调方法

func (p *processorListener) run() {
	// this call blocks until the channel is closed.  When a panic happens during the notification
	// we will catch it, **the offending item will be skipped!**, and after a short delay (one second)
	// the next notification will be attempted.  This is usually better than the alternative of never
	// delivering again.
	stopCh := make(chan struct{})
	wait.Until(func() {
		for next := range p.nextCh {
			switch notification := next.(type) {
			case updateNotification:
				p.handler.OnUpdate(notification.oldObj, notification.newObj)
			case addNotification:
				p.handler.OnAdd(notification.newObj)
			case deleteNotification:
				p.handler.OnDelete(notification.oldObj)
			default:
				utilruntime.HandleError(fmt.Errorf("unrecognized notification: %T", next))
			}
		}
		// the only way to get here is if the p.nextCh is empty and closed
		close(stopCh)
	}, 1*time.Second, stopCh)
}

运行controller

D:\Workspace\Go\src\k8s.io\kubernetes\vendor\k8s.io\client-go\tools\cache\controller.go

// Run begins processing items, and will continue until a value is sent down stopCh or it is closed.
// It's an error to call Run more than once.
// Run blocks; call via go.
func (c *controller) Run(stopCh <-chan struct{}) {
	defer utilruntime.HandleCrash()
	go func() {
		<-stopCh
		c.config.Queue.Close()
	}()
	r := NewReflector(
		c.config.ListerWatcher,
		c.config.ObjectType,
		c.config.Queue,
		c.config.FullResyncPeriod,
	)
	r.ShouldResync = c.config.ShouldResync
	r.WatchListPageSize = c.config.WatchListPageSize
	r.clock = c.clock
	if c.config.WatchErrorHandler != nil {
		r.watchErrorHandler = c.config.WatchErrorHandler
	}

	c.reflectorMutex.Lock()
	c.reflector = r
	c.reflectorMutex.Unlock()

	var wg wait.Group

	wg.StartWithChannel(stopCh, r.Run)

	wait.Until(c.processLoop, time.Second, stopCh)
	wg.Wait()
}

·其中新建reflector,r.Run 代表生产,往Queue里放数据

reflector.Run 生产者解读

。ListAndWatch会调用watchHandler
。watchHandler顾名思义,就是Watch到对应的事件,调用对应的Handler在watchhandler中可以看到对增删改等动作的处理,位置 

D:\Workspace\Go\src\k8s.io\kubernetes\vendor\k8s.io\client-go\tools\cache\reflector.go

			switch event.Type {
			case watch.Added:
				err := r.store.Add(event.Object)
				if err != nil {
					utilruntime.HandleError(fmt.Errorf("%s: unable to add watch event object (%#v) to store: %v", r.name, event.Object, err))
				}
			case watch.Modified:
				err := r.store.Update(event.Object)
				if err != nil {
					utilruntime.HandleError(fmt.Errorf("%s: unable to update watch event object (%#v) to store: %v", r.name, event.Object, err))
				}
			case watch.Deleted:
				// TODO: Will any consumers need access to the "last known
				// state", which is passed in event.Object? If so, may need
				// to change this.
				err := r.store.Delete(event.Object)
				if err != nil {
					utilruntime.HandleError(fmt.Errorf("%s: unable to delete watch event object (%#v) from store: %v", r.name, event.Object, err))
				}

消费者分析

 D:\Workspace\Go\src\k8s.io\kubernetes\vendor\k8s.io\client-go\tools\cache\shared_informer.go

func (s *sharedIndexInformer) HandleDeltas(obj interface{}) error {
	s.blockDeltas.Lock()
	defer s.blockDeltas.Unlock()

	if deltas, ok := obj.(Deltas); ok {
		return processDeltas(s, s.indexer, s.transform, deltas)
	}
	return errors.New("object given as Process argument is not Deltas")
}

过程如下
。在indexer中判断对象是否存在,存在更新,不存在就新增

同时调用distribute函数分发给listener 

本节重点总结:

informer机制的框架

6.6 kube-scheduler利用informer机制调度pod

Pod的调度是通过一个队列SchedulingQueue异步工作的

监听到对应pod事件后,放入队列

-有个消费者从队列中获取pod,进行调度

单个pod的调度主要分为3个步骤
根据Predict和Priority两个阶段,调用各自的算法插件选择最优的Node
- Assume这个Pod被调度到对应的Node,保存到cache

- 用extender和plugins进行验证,如果通过则绑定

回顾scheduler结构体

位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\scheduler.go

。可以看到其中和pod调度直接相关的两个字段

// Scheduler watches for new unscheduled pods. It attempts to find

// nodes that they fit on and writes bindings back to the api server.

type Scheduler struct {

    NextPod func() *framework.QueuedPodInfo  // 获取下一个需要调度的PodSchedulingQueue // 获取下一个需要调度的PodSchedulingQueue

    // SchedulingQueue holds pods to be scheduled

    SchedulingQueue internalqueue.SchedulingQueue // 等待调度的Pd队列,我们重点看看这个队列是什么

}

SchedulingQueue的初始化

在create函数中,创建了podQueue,位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\scheduler.go

	podQueue := internalqueue.NewSchedulingQueue(
		profiles[options.profiles[0].SchedulerName].QueueSortFunc(),
		informerFactory,
		internalqueue.WithPodInitialBackoffDuration(time.Duration(options.podInitialBackoffSeconds)*time.Second),
		internalqueue.WithPodMaxBackoffDuration(time.Duration(options.podMaxBackoffSeconds)*time.Second),
		internalqueue.WithPodNominator(nominator),
		internalqueue.WithClusterEventMap(clusterEventMap),
		internalqueue.WithPodMaxInUnschedulablePodsDuration(options.podMaxInUnschedulablePodsDuration),
	)

可以看出这是一个带有优先级的队列

// NewSchedulingQueue initializes a priority queue as a new scheduling queue.
func NewSchedulingQueue(
	lessFn framework.LessFunc,
	informerFactory informers.SharedInformerFactory,
	opts ...Option) SchedulingQueue {
	return NewPriorityQueue(lessFn, informerFactory, opts...)
}

为何要有优先级

因为由此pod比较主要,需要优先调度
调度优先级文档


获取集群默认的调度优先级

pod调度优先级实例,之前讲的prometheus statefulset中的配置 

NextPod的初始化

。可以看到就是从 podQueue中 pop一个

// MakeNextPodFunc returns a function to retrieve the next pod from a given
// scheduling queue
func MakeNextPodFunc(queue SchedulingQueue) func() *framework.QueuedPodInfo {
	return func() *framework.QueuedPodInfo {
		podInfo, err := queue.Pop()
		if err == nil {
			klog.V(4).InfoS("About to try and schedule pod", "pod", klog.KObj(podInfo.Pod))
			for plugin := range podInfo.UnschedulablePlugins {
				metrics.UnschedulableReason(plugin, podInfo.Pod.Spec.SchedulerName).Dec()
			}
			return podInfo
		}
		klog.ErrorS(err, "Error while retrieving next pod from scheduling queue")
		return nil
	}
}

pod信息什么时候推入SchedulingQueue

。还记得之前scheduler的New中会有添加回调的函数
    addAllEventHandlers(sched, informerFactory, dynInformerFactory, unionedGVKs(clusterEventMap))

过滤未调度的pod回调 

D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\eventhandlers.go 

	// scheduled pod cache
	informerFactory.Core().V1().Pods().Informer().AddEventHandler(
		cache.FilteringResourceEventHandler{
			FilterFunc: func(obj interface{}) bool {
				switch t := obj.(type) {
				case *v1.Pod:
					return assignedPod(t)
				case cache.DeletedFinalStateUnknown:
					if _, ok := t.Obj.(*v1.Pod); ok {
						// The carried object may be stale, so we don't use it to check if
						// it's assigned or not. Attempting to cleanup anyways.
						return true
					}
					utilruntime.HandleError(fmt.Errorf("unable to convert object %T to *v1.Pod in %T", obj, sched))
					return false
				default:
					utilruntime.HandleError(fmt.Errorf("unable to handle object in %T: %T", sched, obj))
					return false
				}
			},
			Handler: cache.ResourceEventHandlerFuncs{
				AddFunc:    sched.addPodToCache,
				UpdateFunc: sched.updatePodInCache,
				DeleteFunc: sched.deletePodFromCache,
			},
		},
	)

FilterFunc为一个过滤的函数,assignedPod代表pod信息中有node信息了,说明pod已被调度到node

// assignedPod selects pods that are assigned (scheduled and running).
func assignedPod(pod *v1.Pod) bool {
	return len(pod.Spec.NodeName) != 0
}

add对应的触发动作就是sched.addPodToCache。比如我们之i前创建的nginx-pod就该走这里

在下面的addPodToCache可以看到调用了SchedulingQueueAssignedPodAdded将pod推入队列中

func (sched *Scheduler) addPodToCache(obj interface{}) {
	pod, ok := obj.(*v1.Pod)
	if !ok {
		klog.ErrorS(nil, "Cannot convert to *v1.Pod", "obj", obj)
		return
	}
	klog.V(3).InfoS("Add event for scheduled pod", "pod", klog.KObj(pod))

	if err := sched.Cache.AddPod(pod); err != nil {
		klog.ErrorS(err, "Scheduler cache AddPod failed", "pod", klog.KObj(pod))
	}

	sched.SchedulingQueue.AssignedPodAdded(pod)
}

至此创建的pod入队出队我们都了解了 

执行调度

。我们可以追踪NextPod合适被调用,追查到在scheduleOne中

// scheduleOne does the entire scheduling workflow for a single pod. It is serialized on the scheduling algorithm's host fitting.
func (sched *Scheduler) scheduleOne(ctx context.Context) {
	podInfo := sched.NextPod()
}

在向上追查可以看到是在scheduler启动的时候,选主成功的OnStartedLeading回调中有sched.Run执行调度

	// If leader election is enabled, runCommand via LeaderElector until done and exit.
	if cc.LeaderElection != nil {
		cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
			OnStartedLeading: func(ctx context.Context) {
				close(waitingForLeader)
				sched.Run(ctx)
			},
			OnStoppedLeading: func() {
				select {
				case <-ctx.Done():
					// We were asked to terminate. Exit 0.
					klog.InfoS("Requested to terminate, exiting")
					os.Exit(0)
				default:
					// We lost the lock.
					klog.ErrorS(nil, "Leaderelection lost")
					klog.FlushAndExit(klog.ExitFlushTimeout, 1)
				}
			},
		}
		leaderElector, err := leaderelection.NewLeaderElector(*cc.LeaderElection)
		if err != nil {
			return fmt.Errorf("couldn't create leader elector: %v", err)
		}

		leaderElector.Run(ctx)

		return fmt.Errorf("lost lease")
	}

scheduleOne分析

podinfo 就是从队列中获取到的pod对象,检查pod的有效性

    podInfo := sched.NextPod()

    // pod could be nil when schedulerQueue is closed

    if podInfo == nil || podInfo.Pod == nil {

        return

    }

    pod := podInfo.Pod
根据定义的 pod.SpecSchedulerName查到对应的profile 

    fwk, err := sched.frameworkForPod(pod)

    if err != nil {

        // This shouldn't happen, because we only accept for scheduling the pods

        // which specify a scheduler name that matches one of the profiles.

        klog.ErrorS(err, "Error occurred")

        return

    }

根据调度算法获取结果

    scheduleResult, err := sched.SchedulePod(schedulingCycleCtx, fwk, state, pod)

调用assume对调度算法的结果进行验证

	// Tell the cache to assume that a pod now is running on a given node, even though it hasn't been bound yet.
	// This allows us to keep scheduling without waiting on binding to occur.
	assumedPodInfo := podInfo.DeepCopy()
	assumedPod := assumedPodInfo.Pod
	// assume modifies `assumedPod` by setting NodeName=scheduleResult.SuggestedHost
	err = sched.assume(assumedPod, scheduleResult.SuggestedHost)
	if err != nil {
		metrics.PodScheduleError(fwk.ProfileName(), metrics.SinceInSeconds(start))
		// This is most probably result of a BUG in retrying logic.
		// We report an error here so that pod scheduling can be retried.
		// This relies on the fact that Error will check if the pod has been bound
		// to a node and if so will not add it back to the unscheduled pods queue
		// (otherwise this would cause an infinite loop).
		sched.handleSchedulingFailure(fwk, assumedPodInfo, err, SchedulerError, clearNominatedNode)
		return
	}

下面的go func进行异步绑定
// bind the pod to its host asynchronously (we can do this b/c of the assumption step above).

        err := sched.bind(bindingCycleCtx, fwk, assumedPod, scheduleResult.SuggestedHost, state)
绑定成功后就会打几个metrics

        metrics.PodScheduled(fwk.ProfileName(), metrics.SinceInSeconds(start))

        metrics.PodSchedulingAttempts.Observe(float64(podInfo.Attempts))                        metrics.PodSchedulingDuration.WithLabelValues(getAttemptsLabel(podInfo)).Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp))

比如平均调度时间
scheduler_pod_scheduling_duration_seconds_sum /scheduler_pod_scheduling_duration_seconds_count

Schedule调度解析

D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\schedule_one.go 

对当前信息保存快照,如果快照中的node数量为0就返回没有节点可用

// schedulePod tries to schedule the given pod to one of the nodes in the node list.
// If it succeeds, it will return the name of the node.
// If it fails, it will return a FitError with reasons.
func (sched *Scheduler) schedulePod(ctx context.Context, fwk framework.Framework, state *framework.CycleState, pod *v1.Pod) (result ScheduleResult, err error) {
	trace := utiltrace.New("Scheduling", utiltrace.Field{Key: "namespace", Value: pod.Namespace}, utiltrace.Field{Key: "name", Value: pod.Name})
	defer trace.LogIfLong(100 * time.Millisecond)

	if err := sched.Cache.UpdateSnapshot(sched.nodeInfoSnapshot); err != nil {
		return result, err
	}
	trace.Step("Snapshotting scheduler cache and node infos done")

	if sched.nodeInfoSnapshot.NumNodes() == 0 {
		return result, ErrNoNodesAvailable
	}


	feasibleNodes, diagnosis, err := sched.findNodesThatFitPod(ctx, fwk, state, pod)
	if err != nil {
		return result, err
	}
	trace.Step("Computing predicates done")

	if len(feasibleNodes) == 0 {
		return result, &framework.FitError{
			Pod:         pod,
			NumAllNodes: sched.nodeInfoSnapshot.NumNodes(),
			Diagnosis:   diagnosis,
		}
	}

Predict阶段: 找到所有满足调度条件的节点feasibleNodes,不满足的就直接过滤

	feasibleNodes, diagnosis, err := sched.findNodesThatFitPod(ctx, fwk, state, pod)
	if err != nil {
		return result, err
	}
	trace.Step("Computing predicates done")

	if len(feasibleNodes) == 0 {
		return result, &framework.FitError{
			Pod:         pod,
			NumAllNodes: sched.nodeInfoSnapshot.NumNodes(),
			Diagnosis:   diagnosis,
		}
	}

如果Predict阶段只找到一个节点就用它

	// When only one node after predicate, just use it.
	if len(feasibleNodes) == 1 {
		return ScheduleResult{
			SuggestedHost:  feasibleNodes[0].Name,
			EvaluatedNodes: 1 + len(diagnosis.NodeToStatusMap),
			FeasibleNodes:  1,
		}, nil
	}

Priority阶段:通过打分,找到一个分数最高、也就是最优的节点

	priorityList, err := prioritizeNodes(ctx, sched.Extenders, fwk, state, pod, feasibleNodes)
	if err != nil {
		return result, err
	}

	host, err := selectHost(priorityList)
	trace.Step("Prioritizing done")

	return ScheduleResult{
		SuggestedHost:  host,
		EvaluatedNodes: len(feasibleNodes) + len(diagnosis.NodeToStatusMap),
		FeasibleNodes:  len(feasibleNodes),
	}, err

Predict 和 Priority
。Predict和 Priority 是选择调度节点的两个关键性步骤,它的底层调用了各种algorithm算法

。我们之前提到的NodeName 匹配属于Predict阶段

Assume验证解读

。将host 填入到 pod spec字段的nodename,假定分配到对应的节点上

。调用 SchedulerCache下的AssumePod测试,如果出错则验证失败

// assume signals to the cache that a pod is already in the cache, so that binding can be asynchronous.
// assume modifies `assumed`.
func (sched *Scheduler) assume(assumed *v1.Pod, host string) error {
	// Optimistically assume that the binding will succeed and send it to apiserver
	// in the background.
	// If the binding fails, scheduler will release resources allocated to assumed pod
	// immediately.
	assumed.Spec.NodeName = host

	if err := sched.Cache.AssumePod(assumed); err != nil {
		klog.ErrorS(err, "Scheduler cache AssumePod failed")
		return err
	}
	// if "assumed" is a nominated pod, we should remove it from internal cache
	if sched.SchedulingQueue != nil {
		sched.SchedulingQueue.DeleteNominatedPodIfExists(assumed)
	}

	return nil
}

AssumePod解读

·根据pod uid去cache中寻找, 正常是找不到的 

func (cache *cacheImpl) AssumePod(pod *v1.Pod) error {
	key, err := framework.GetPodKey(pod)
	if err != nil {
		return err
	}

	cache.mu.Lock()
	defer cache.mu.Unlock()
	if _, ok := cache.podStates[key]; ok {
		return fmt.Errorf("pod %v is in the cache, so can't be assumed", key)
	}

	return cache.addPod(pod, true)
}

cacheaddPod(pod)代表把pod 信息填入node中

// Assumes that lock is already acquired.
func (cache *cacheImpl) addPod(pod *v1.Pod, assumePod bool) error {
	key, err := framework.GetPodKey(pod)
	if err != nil {
		return err
	}
	n, ok := cache.nodes[pod.Spec.NodeName]
	if !ok {
		n = newNodeInfoListItem(framework.NewNodeInfo())
		cache.nodes[pod.Spec.NodeName] = n
	}
	n.info.AddPod(pod)
	cache.moveNodeInfoToHead(pod.Spec.NodeName)
	ps := &podState{
		pod: pod,
	}
	cache.podStates[key] = ps
	if assumePod {
		cache.assumedPods.Insert(key)
	}
	return nil
}

AddPodInfo 会更新node的信息,把新来的pod经加上去

// AddPodInfo adds pod information to this NodeInfo.
// Consider using this instead of AddPod if a PodInfo is already computed.
func (n *NodeInfo) AddPodInfo(podInfo *PodInfo) {
	res, non0CPU, non0Mem := calculateResource(podInfo.Pod)
	n.Requested.MilliCPU += res.MilliCPU
	n.Requested.Memory += res.Memory
	n.Requested.EphemeralStorage += res.EphemeralStorage
	if n.Requested.ScalarResources == nil && len(res.ScalarResources) > 0 {
		n.Requested.ScalarResources = map[v1.ResourceName]int64{}
	}
	for rName, rQuant := range res.ScalarResources {
		n.Requested.ScalarResources[rName] += rQuant
	}
	n.NonZeroRequested.MilliCPU += non0CPU
	n.NonZeroRequested.Memory += non0Mem
	n.Pods = append(n.Pods, podInfo)
	if podWithAffinity(podInfo.Pod) {
		n.PodsWithAffinity = append(n.PodsWithAffinity, podInfo)
	}
	if podWithRequiredAntiAffinity(podInfo.Pod) {
		n.PodsWithRequiredAntiAffinity = append(n.PodsWithRequiredAntiAffinity, podInfo)
	}

	// Consume ports when pods added.
	n.updateUsedPorts(podInfo.Pod, true)
	n.updatePVCRefCounts(podInfo.Pod, true)

	n.Generation = nextGeneration()
}

bind绑定操作解读

将assumed验证过的pod信息bind到node上 

// bind binds a pod to a given node defined in a binding object.
// The precedence for binding is: (1) extenders and (2) framework plugins.
// We expect this to run asynchronously, so we handle binding metrics internally.
func (sched *Scheduler) bind(ctx context.Context, fwk framework.Framework, assumed *v1.Pod, targetNode string, state *framework.CycleState) (err error) {
	defer func() {
		sched.finishBinding(fwk, assumed, targetNode, err)
	}()

	bound, err := sched.extendersBinding(assumed, targetNode)
	if bound {
		return err
	}
	bindStatus := fwk.RunBindPlugins(ctx, state, assumed, targetNode)
	if bindStatus.IsSuccess() {
		return nil
	}
	if bindStatus.Code() == framework.Error {
		return bindStatus.AsError()
	}
	return fmt.Errorf("bind status: %s, %v", bindStatus.Code().String(), bindStatus.Message())
}

底层的请求

// Bind delegates the action of binding a pod to a node to the extender.
func (h *HTTPExtender) Bind(binding *v1.Binding) error {
	var result extenderv1.ExtenderBindingResult
	if !h.IsBinder() {
		// This shouldn't happen as this extender wouldn't have become a Binder.
		return fmt.Errorf("unexpected empty bindVerb in extender")
	}
	req := &extenderv1.ExtenderBindingArgs{
		PodName:      binding.Name,
		PodNamespace: binding.Namespace,
		PodUID:       binding.UID,
		Node:         binding.Target.Name,
	}
	if err := h.send(h.bindVerb, req, &result); err != nil {
		return err
	}
	if result.Error != "" {
		return fmt.Errorf(result.Error)
	}
	return nil
}

解读一个最简单的过滤器node name

首先打开scheduler的插件目录,
D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\framework\plugins

可以看到一堆类似过滤器的目录,在其中找打node name,位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\framework\plugins\nodename\node_name.go

// Filter invoked at the filter extension point.
func (pl *NodeName) Filter(ctx context.Context, _ *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
	if nodeInfo.Node() == nil {
		return framework.NewStatus(framework.Error, "node not found")
	}
	if !Fits(pod, nodeInfo) {
		return framework.NewStatus(framework.UnschedulableAndUnresolvable, ErrReason)
	}
	return nil
}

// Fits actually checks if the pod fits the node.
func Fits(pod *v1.Pod, nodeInfo *framework.NodeInfo) bool {
	return len(pod.Spec.NodeName) == 0 || pod.Spec.NodeName == nodeInfo.Node().Name
}

// New initializes a new plugin and returns it.
func New(_ runtime.Object, _ framework.Handle) (framework.Plugin, error) {
	return &NodeName{}, nil
}

这里看到New函数,疑似注册

往上追查可以看到在registry有注册的动作,位置 D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\framework\plugins\registry.go

// NewInTreeRegistry builds the registry with all the in-tree plugins.
// A scheduler that runs out of tree plugins can register additional plugins
// through the WithFrameworkOutOfTreeRegistry option.
func NewInTreeRegistry() runtime.Registry {
	fts := plfeature.Features{
		EnablePodDisruptionBudget:           feature.DefaultFeatureGate.Enabled(features.PodDisruptionBudget),
		EnableReadWriteOncePod:              feature.DefaultFeatureGate.Enabled(features.ReadWriteOncePod),
		EnableVolumeCapacityPriority:        feature.DefaultFeatureGate.Enabled(features.VolumeCapacityPriority),
		EnableMinDomainsInPodTopologySpread: feature.DefaultFeatureGate.Enabled(features.MinDomainsInPodTopologySpread),
	}

	return runtime.Registry{
		selectorspread.Name:                  selectorspread.New,
		imagelocality.Name:                   imagelocality.New,
		tainttoleration.Name:                 tainttoleration.New,
		nodename.Name:                        nodename.New,

再向上追溯可以看到是Scheduler的New中调用的NewInTreeRegistry
registry := frameworkplugins.NewInTreeRegistry()

回到node name的Filter函数

。Filter调用Fits判断 pod的spec nodename 是否和目标node相等

追踪Filter调用过程,发现是RunFilterplugins遍历插件调用,位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\framework\runtime\framework.go

// RunFilterPlugins runs the set of configured Filter plugins for pod on
// the given node. If any of these plugins doesn't return "Success", the
// given node is not suitable for running pod.
// Meanwhile, the failure message and status are set for the given node.
func (f *frameworkImpl) RunFilterPlugins(
	ctx context.Context,
	state *framework.CycleState,
	pod *v1.Pod,
	nodeInfo *framework.NodeInfo,
) framework.PluginToStatus {
	statuses := make(framework.PluginToStatus)
	for _, pl := range f.filterPlugins {
		pluginStatus := f.runFilterPlugin(ctx, pl, state, pod, nodeInfo)
		if !pluginStatus.IsSuccess() {
			if !pluginStatus.IsUnschedulable() {
				// Filter plugins are not supposed to return any status other than
				// Success or Unschedulable.
				errStatus := framework.AsStatus(fmt.Errorf("running %q filter plugin: %w", pl.Name(), pluginStatus.AsError())).WithFailedPlugin(pl.Name())
				return map[string]*framework.Status{pl.Name(): errStatus}
			}
			pluginStatus.SetFailedPlugin(pl.Name())
			statuses[pl.Name()] = pluginStatus
		}
	}

	return statuses
}
func (f *frameworkImpl) runFilterPlugin(ctx context.Context, pl framework.FilterPlugin, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
	if !state.ShouldRecordPluginMetrics() {
		return pl.Filter(ctx, state, pod, nodeInfo)
	}
	startTime := time.Now()
	status := pl.Filter(ctx, state, pod, nodeInfo)
	f.metricsRecorder.observePluginDurationAsync(Filter, pl.Name(), status, metrics.SinceInSeconds(startTime))
	return status
}

最终追查到是findNodesThatPassFilters调用了RunFilterPluginsWithNominatedPods,位置在 

     feasibleNodes, err := sched.findNodesThatPassFilters(ctx, fwk, state, pod, diagnosis, nodes)

// findNodesThatPassFilters finds the nodes that fit the filter plugins.

func (sched *Scheduler) findNodesThatPassFilters(

    ctx context.Context,

    fwk framework.Framework,

    state *framework.CycleState,

    pod *v1.Pod,

    diagnosis framework.Diagnosis,

    nodes []*framework.NodeInfo) ([]*v1.Node, error) {

        status := fwk.RunFilterPluginsWithNominatedPods(ctx, state, pod, nodeInfo)

}

本节重点总结:

  

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值