一起学Kubernetes——设计概览

最新推荐文章于 2022-07-24 22:29:39 发布

zhxue123

最新推荐文章于 2022-07-24 22:29:39 发布

阅读量7.3k

点赞数

Kubernetes 设计概览

概览

Kubernetes基于Docker（一种基于CGroup的容器技术）来构建容器集群调度服务。用户可以通过它向集群申请运行一套容器。系统会自动从选出一些工作节点来运行所申请的容器，我们认为这个过程更适合称之为“调度（Scheduling）”而不是“协调（Orchestration）”。Kubernetes也提供了一些方法，可以使得容器之间相互发现，相互通信，同时也提供了一些方法，可以用来管理那些以紧耦合或者松耦合方式相互协作的容器。

重要概念

While Docker itself works with individual containers, Kubernetes provides higher-level organizational constructs in support of common cluster-level usage patterns, currently focused on service applications.

Docker本身用于单节点上容器的管理，而Kubernetes则提供了一种更高层的组织结构，用来支持一般的集群级的使用模式，目前主要集中在对应用服务的管理。

Pods

一个 pod (as in a pod of whales or pea pod) is a relatively tightly coupled group of containers that are scheduled onto the same physical node. In addition to defining the containers that run in the pod, the containers in the pod all use the same network namespace/IP (and port space), and define a set of shared storage volumes. Pods facilitate data sharing and IPC among their constituents, serve as units of scheduling, deployment, and horizontal scaling/replication, and share fate. In the future, they may share resources (LPC2013).

一个 pod (就像一个鲸群或者豌豆荚，都是紧紧聚在一起) 是一组相对紧耦合的容器，这些容器被调度到同一个物理节点上。.除了限制这些运行在pod中的容器外，一个pod中的所有容器都使用相同的网络名字空间/IP(以及端口空间), 也都划定了一套共享存储卷。 Pod使得容器间的数据共享以及进程间通信非常方便，它是Kubernetes进行调度、部署、水平扩展/副本，以及共享结果的基本单元。将来，或许pod之间也能共享资源。 (LPC2013).

虽然pods被用于对应用软件自上而下进行集成，但pod的主要动机是支持那些和应用软件在一起的、同被管理的辅助程序，比如：

内容管理系统, 文件和数据的加载软件, 本地缓存管理器等
日志和检查点备份，压缩、数据轮换备份、快照等
数据变更监控，日志监控，登录以及监控适配器，事件发布者等。
data change watchers, log tailers, logging and monitoring adapters, event publishers, etc.
proxies, bridges, and adapters
controllers, managers, configurators, and updaters

Labels

Loosely coupled cooperating pods are organized using key/value labels.

Each pod can have a set of key/value labels set on it, with at most one label with a particular key.

Individual labels are used to specify identifying metadata, and to convey the semantic purposes/roles of pods of containers. Examples of typical pod label keys include environment (e.g., with values dev, qa, or production), service, tier (e.g., with values frontend or backend), partition, and track (e.g., with values daily or weekly), but you are free to develop your own conventions.

Via a "label selector" the user can identify a set of pods. The label selector is the core grouping primitive in Kubernetes. It could be used to identify service replicas or shards, worker pool members, or peers in a distributed application.

Kubernetes currently supports two objects that use label selectors to keep track of their members, services andreplicationControllers:

service: A service is a configuration unit for the proxies that run on every worker node. It is named and points to one or more Pods.
replicationController: A replication controller takes a template and ensures that there is a specified number of "replicas" of that template running at any one time. If there are too many, it'll kill some. If there are too few, it'll start more.

The set of pods that a service targets is defined with a label selector. Similarly, the population of pods that areplicationController is monitoring is also defined with a label selector.

Pods may be removed from these sets by changing their labels. This flexibility may be used to remove pods from service for debugging, data recovery, etc.

For management convenience and consistency, services and replicationControllers may themselves have labels and would generally carry the labels their corresponding pods have in common.

Sets identified by labels and label selectors could be overlapping (think Venn diagrams). For instance, a service might point to all pods with tier in (frontend), environment in (prod). Now say you have 10 replicated pods that make up this tier. But you want to be able to 'canary' a new version of this component. You could set up a replicationController (with replicas set to 9) for the bulk of the replicas with labels tier=frontend, environment=prod, track=stable and another replicationController (withreplicas set to 1) for the canary with labels tier=frontend, environment=prod, track=canary. Now the service is covering both the canary and non-canary pods. But you can mess with the replicationControllers separately to test things out, monitor the results, etc.

Note that the superset described in the previous example is also heterogeneous. In long-lived, highly available, horizontally scaled, distributed, continuously evolving service applications, heterogeneity is inevitable, due to canaries, incremental rollouts, live reconfiguration, simultaneous updates and auto-scaling, hardware upgrades, and so on.

Pods may belong to multiple sets simultaneously, which enables representation of service substructure and/or superstructure. In particular, labels are intended to facilitate the creation of non-hierarchical, multi-dimensional deployment structures. They are useful for a variety of management purposes (e.g., configuration, deployment) and for application introspection and analysis (e.g., logging, monitoring, alerting, analytics). Without the ability to form sets by intersecting labels, many implicitly related, overlapping flat sets would need to be created, for each subset and/or superset desired, which would lose semantic information and be difficult to keep consistent. Purely hierarchically nested sets wouldn't readily support slicing sets across different dimensions.

Since labels can be set at pod creation time, no separate set add/remove operations are necessary, which makes them easier to use than manual set management. Additionally, since labels are directly attached to pods and label selectors are fairly simple, it's easy for users and for clients and tools to determine what sets they belong to. OTOH, with sets formed by just explicitly enumerating members, one would (conceptually) need to search all sets to determine which ones a pod belonged to.

The Kubernetes Node

When looking at the architecture of the system, we'll break it down to services that run on the worker node and services that play a "master" role.

The Kubernetes node has the services necessary to run Docker containers and be managed from the master systems.

The Kubernetes node design is an extension of the Container-optimized Google Compute Engine image. Over time the plan is for these images/nodes to merge and be the same thing used in different ways. It has the services necessary to run Docker containers and be managed from the master systems.

Each node runs Docker, of course. Docker takes care of the details of downloading images and running containers.

Kubelet

The second component on the node is called the kubelet. The Kubelet is the logical successor (and rewritten in go) of theContainer Agent that is part of the Compute Engine image.

The Kubelet works in terms of a container manifest. A container manifest (defined here) is a YAML file that describes a pod. The Kubelet takes a set of manifests that are provided in various mechanisms and ensures that the containers described in those manifests are started and continue running.

There are 4 ways that a container manifest can be provided to the Kubelet:

File Path passed as a flag on the command line. This file is rechecked every 20 seconds (configurable with a flag).
HTTP endpoint HTTP endpoint passed as a parameter on the command line. This endpoint is checked every 20 seconds (also configurable with a flag.)
etcd server The Kubelet will reach out and do a watch on an etcd server. The etcd path that is watched is/registry/hosts/$(hostname -f). As this is a watch, changes are noticed and acted upon very quickly.
HTTP server The kubelet can also listen for HTTP and respond to a simple API (underspec'd currently) to submit a new manifest.

Kubernetes Proxy

Each node also runs a simple network proxy. This reflects services as defined in the Kubernetes API on each node and can do simple TCP stream forwarding or round robin TCP forwarding across a set of backends.

The Kubernetes Master

The Kubernetes master is split into a set of components. These work together to provide an unified view of the cluster.

etcd

All persistent master state is stored in an instance of etcd. This provides a great way to store configuration data reliably. Withwatch support, coordinating components can be notified very quickly of changes.

Kubernetes API Server

This server serves up the main Kubernetes API.

It validates and configures data for 3 types of objects: pods, services, and replicationControllers.

Beyond just servicing REST operations, validating them and storing them in etcd, the API Server does two other things:

Schedules pods to worker nodes. Right now the scheduler is very simple.
Synchronize pod information (where they are, what ports they are exposing) with the service configuration.

Kubernetes Controller Manager Server

The replicationController type described above isn't strictly necessary for Kubernetes to be useful. It is really a service that is layered on top of the simple pod API. To enforce this layering, the logic for the replicationController is actually broken out into another server. This server watches etcd for changes to replicationController objects and then uses the public Kubernetes API to implement the replication algorithm.

Network Model

Kubernetes expands the default Docker networking model. The goal is to have each pod have an IP in a shared networking namespace that has full communication with other physical computers and containers across the network. In this way, it becomes much less necessary to map ports.

For the Google Compute Engine cluster configuration scripts, advanced routing is set up so that each VM has a extra 256 IP addresses that get routed to it. This is in addition to the 'main' IP address assigned to the VM that is NAT-ed for Internet access. The networking bridge (called cbr0 to differentiate it from docker0) is set up outside of Docker proper and only does NAT for egress network traffic that isn't aimed at the virtual network.

Ports mapped in from the 'main IP' (and hence the internet if the right firewall rules are set up) are proxied in user mode by Docker. In the future, this should be done with iptables by either the Kubelet or Docker: Issue #15.

Release Process

Right now "building" or "releasing" Kubernetes consists of some scripts (in release/) to create a tar of the necessary data and then uploading it to Google Cloud Storage. In the future we will generate Docker images for the bulk of the above described components: Issue #19.

GCE Cluster Configuration

The scripts and data in the cluster/ directory automates creating a set of Google Compute Engine VMs and installing all of the Kubernetes components. There is a single master node and a set of worker (called minion) nodes.

config-default.sh has a set of tweakable definitions/parameters for the cluster.

The heavy lifting of configuring the VMs is done by SaltStack.

The bootstrapping works like this:

The kube-up.sh script uses the GCE startup-script mechanism for both the master node and the minion nodes.
- For the minion, this simply configures and installs SaltStack. The network range that this minion is assigned is baked into the startup-script for that minion.
- For the master, the release files are downloaded from GCS and unpacked. Various parts (specifically the SaltStack configuration) are installed in the right places.
SaltStack then installs the necessary servers on each node.
- All go code is currently downloaded to each machine and compiled at install time.
- The custom networking bridge is configured on each minion before Docker is installed.
- Configuration (like telling the apiserver the hostnames of the minions) is dynamically created during the saltstack install.
After the VMs are started, the kube-up.sh script will call curl every 2 seconds until the apiserver starts responding.

kube-down.sh can be used to tear the entire cluster down. If you build a new release and want to update your cluster, you can usekube-push.sh to update and apply (highstate in salt parlance) the salt config.

Cluster Security

As there is no security currently built into the apiserver, the salt configuration will install nginx. nginx is configured to serve HTTPS with a self signed certificate. HTTP basic auth is used from the client to nginx. nginx then forwards the request on to theapiserver over plain old HTTP. Because a self signed certificate is used, access to the server should be safe from eavesdropping but is subject to "man in the middle" attacks. Access via the browser will result in warnings and tools like curl will require an "--insecure" flag.

All communication within the cluster (worker nodes to the master, for instance) occurs on the internal virtual network and should be safe from eavesdropping.

The password is generated randomly as part of the kube-up.sh script and stored in ~/.kubernetes_auth.

原文：https://github.com/GoogleCloudPlatform/kubernetes/blob/master/DESIGN.md