ChaosBlade权威指南-CSDN博客

通过模拟调用延迟、服务不可用、机器资源满载等，查看发生故障的节点或实例是否被自动隔离、下线，流量调度是否正确，预案是否有效，同时观察系统整体的 QPS 或 RT 是否受影响。在此基础上可以缓慢增加故障节点范围，验证上游服务限流降级、熔断等是否有效。最终故障节点增加到请求服务超时，估算系统容错红线，衡量系统容错能力。

2）验证容器编排配置是否合理

通过模拟杀服务 Pod、杀节点、增大 Pod 资源负载，观察系统服务可用性，验证副本配置、资源限制配置以及 Pod 下部署的容器是否合理。

3）测试 PaaS 层是否健壮

通过模拟上层资源负载，验证调度系统的有效性；模拟依赖的分布式存储不可用，验证系统的容错能力；模拟调度节点不可用，测试调度任务是否自动迁移到可用节点；模拟主备节点故障，测试主备切换是否正常。

4）验证监控告警的时效性

通过对系统注入故障，验证监控指标是否准确，监控维度是否完善，告警阈值是否合理，告警是否快速，告警接收人是否正确，通知渠道是否可用等，提升监控告警的准确和时效性。

5）定位与解决问题的应急能力

通过故障突袭，随机对系统注入故障，考察相关人员对问题的应急能力，以及问题上报、处理流程是否合理，达到以战养战，锻炼人定位与解决问题的能力。

三、ChaosBlade操作指南

（1）获取 ChaosBlade 最新的 release 包，目前支持的平台是 linux/amd64 和 darwin/64，下载对应平台的包。

wget https://github.com/chaosblade-io/chaosblade/releases/download/v0.3.0/chaosblade-0.3.0.linux-amd64.tar.gz

下载完成后解压即可，无需编译。解压后的目录如下：

├── bin
│   ├── chaos_burncpu
│   ├── chaos_burnio
│   ├── chaos_changedns
│   ├── chaos_delaynetwork
│   ├── chaos_dropnetwork
│   ├── chaos_filldisk
│   ├── chaos_killprocess
│   ├── chaos_lossnetwork
│   ├── chaos_stopprocess
│   ├── cplus-chaosblade.spec.yaml
│   ├── jvm.spec.yaml
│   └── tools.jar
├── blade
└── lib
    ├── cplus
    │   ├── chaosblade-exec-cplus.jar
    │   └── script
    │       ├── shell_break_and_return_attach.sh
    │       ├── shell_break_and_return.sh
    │       ├── shell_check_process_duplicate.sh
    │       ├── shell_check_process_id.sh
    │       ├── shell_initialization.sh
    │       ├── shell_modify_variable_attch.sh
    │       ├── shell_modify_variable.sh
    │       ├── shell_remove_process.sh
    │       ├── shell_response_delay_attach.sh
    │       └── shell_response_delay.sh
    └── sandbox
        ├── bin
        │   └── sandbox.sh
        ├── cfg
        │   ├── sandbox-logback.xml
        │   ├── sandbox.properties
        │   └── version
        ├── example
        │   └── sandbox-debug-module.jar
        ├── install-local.sh
        ├── lib
        │   ├── sandbox-agent.jar
        │   ├── sandbox-core.jar
        │   └── sandbox-spy.jar
        ├── module
        │   ├── chaosblade-java-agent-0.2.0.jar
        │   └── sandbox-mgr-module.jar
        └── provider
            └── sandbox-mgr-provider.jar

其中 blade 是可执行文件，即 chaosblade 工具的 cli，混沌实验执行的工具。执行 ./blade help 可以查看支持命令有哪些

blade 命令列表如下：

prepare：简写 p，混沌实验前的准备，比如演练 Java 应用，则需要挂载 java agent。要演练应用名是 business 的应用，则在目标主机上执行 blade p jvm --process business。如果挂载成功，返回挂载的 uid，用于状态查询或者撤销挂载使用。

revoke：简写 r，撤销之前混沌实验准备，比如卸载 java agent。命令是 blade revoke UID

create: 简写是 c，创建一个混沌演练实验，指执行故障注入。命令是 blade create [TARGET] [ACTION] [FLAGS]，比如实施一次 Dubbo consumer 调用 xxx.xxx.Service 接口延迟 3s，则执行的命令为 blade create dubbo delay --consumer --time 3000 --service xxx.xxx.Service，如果注入成功，则返回实验的 uid，用于状态查询和销毁此实验使用。

destroy：简写是 d，销毁之前的混沌实验，比如销毁上面提到的 Dubbo 延迟实验，命令是 blade destroy UID

status：简写 s，查询准备阶段或者实验的状态，命令是 blade status UID 或者 blade status --type create

以上命令帮助均可使用 blade help [COMMAND]。

（2）blade可以进行哪些实验

blade可以进行哪些实验，具体可执行 blade create -h 查看

Create a chaos engineering experiment
 
Usage:
  blade create [command]
 
Aliases:
  create, c
 
Examples:
create dubbo delay --time 3000 --offset 100 --service com.example.Service --consumer
 
Available Commands:
  cplus       c++ experiment
  cpu         Cpu experiment
  disk        Disk experiment
  docker      Execute a docker experiment
  druid       Druid experiment
  dubbo       dubbo experiment
  http        http experiment
  jvm         method
  k8s         Kubernetes experiment
  mysql       mysql experiment
  network     Network experiment
  process     Process experiment
  rocketmq    Rocketmq experiment,can make message send or pull delay and exception
  script      Script chaos experiment
  servlet     java servlet experiment
 
Flags:
  -h, --help   help for create
 
Global Flags:
  -d, --debug   Set client to DEBUG mode
 
Use "blade create [command] --help" for more information about a command.

(3) 使用实例

演示一下CPU使用率100%的故障，即使用blade create cpu fullload命令。blade create cpu的用法如下：

hidden@hidden:~/chaos/chaosblade-0.2.0$ ./blade create cpu -h
Cpu experiment, for example full load
 
Usage:
  blade create cpu [flags]
  blade create cpu [command]
 
Examples:
cpu fullload
 
Available Commands:
  fullload    cpu fullload
 
Flags:
      --cpu-count string   Cpu count
      --cpu-list string    CPUs in which to allow burning (0-3 or 1,3)
  -h, --help               help for cpu
 
Global Flags:
  -d, --debug   Set client to DEBUG mode
 
Use "blade create cpu [command] --help" for more information about a command.

执行实验：

hidden@hidden:~/chaos/chaosblade-0.2.0$ ./blade create cpu fullload
{"code":200,"success":true,"result":"d9e3879cb68416a2"}

注意上面的result: d9e3879cb68416a2中的d9e3879cb68416a2，这个在停止实验的时候会用到（./blade destroy UID）。

采用iostat -c 1 1000命令查看CPU使用率（%idle）:

 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              98.75    0.00    1.25    0.00    0.00    0.00

查看CPU的使用率还可以使用sar命令、top命令等。

此时命令已经生效。下一步停止实验，执行：

 hidden@hidden:~/chaos/chaosblade-0.2.0$ ./blade destroy d9e3879cb68416a2
    {"code":200,"success":true,"result":"command: cpu fullload --debug false --help false"}

再观察CPU的情况，负载已经回到正常状态：

 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
               0.25    0.00    0.50    2.00    0.00   97.25

至此，一次CPU满负荷的故障演练完成，其他命令读者可以自行完成。

（4）查看历史执行记录

如果忘记uid, 无法恢复，可以使用以下命令查看历史

[dev@hua1-dev ~]$ ./chaosblade/blade  status --type create

{
	"code": 200,
	"success": true,
	"result": [
		{
			"Uid": "77d533cdb8b61d07",
			"Command": "mem",
			"SubCommand": "load",
			"Flag": "--debug false --help false",
			"Status": "Error",
			"Error": "{\"code\":604,\"success\":false,\"error\":\"mount: only root can do that\\n exit status 1\"} exit status 1",
			"CreateTime": "2019-12-02T11:01:27.059036062+08:00",
			"UpdateTime": "2019-12-02T11:01:27.112947513+08:00"
		},
		{
			"Uid": "fc3ff5dbcc3d8287",
			"Command": "mem",
			"SubCommand": "load",
			"Flag": "--debug false --help false",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-02T11:01:37.653829453+08:00",
			"UpdateTime": "2019-12-02T11:02:41.072792303+08:00"
		},
		{
			"Uid": "ff941d81a1bfc583",
			"Command": "mem",
			"SubCommand": "load",
			"Flag": "--debug false --help false",
			"Status": "Success",
			"Error": "",
			"CreateTime": "2019-12-02T11:02:44.433088896+08:00",
			"UpdateTime": "2019-12-02T11:02:44.473296047+08:00"
		},
		{
			"Uid": "b1a2a18a9d7d2209",
			"Command": "mem",
			"SubCommand": "load",
			"Flag": "--debug false --help false --timeout 120",
			"Status": "Success",
			"Error": "",
			"CreateTime": "2019-12-02T11:03:19.26821251+08:00",
			"UpdateTime": "2019-12-02T11:03:19.306998571+08:00"
		},
		{
			"Uid": "ade8fd251c14c3ac",
			"Command": "mem",
			"SubCommand": "load",
			"Flag": "--debug false --help false --mem-percent 20 --timeout 60",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-02T11:44:12.402871862+08:00",
			"UpdateTime": "2019-12-02T11:44:24.764721576+08:00"
		},
		{
			"Uid": "42f18ab2a9f647df",
			"Command": "mem",
			"SubCommand": "load",
			"Flag": "--timeout 60 --debug false --help false --mem-percent 50",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-02T11:44:39.03004387+08:00",
			"UpdateTime": "2019-12-02T11:44:46.068383049+08:00"
		},
		{
			"Uid": "c4bd47a436c32f8a",
			"Command": "cpu",
			"SubCommand": "fullload",
			"Flag": "--cpu-count 4 --debug false --help false",
			"Status": "Success",
			"Error": "",
			"CreateTime": "2019-12-02T17:57:21.848821251+08:00",
			"UpdateTime": "2019-12-02T17:57:22.954051573+08:00"
		},
		{
			"Uid": "b3c530b53ce081e7",
			"Command": "cpu",
			"SubCommand": "fullload",
			"Flag": "--debug false --help false --cpu-count 4",
			"Status": "Success",
			"Error": "",
			"CreateTime": "2019-12-02T17:58:09.649468604+08:00",
			"UpdateTime": "2019-12-02T17:58:10.751782523+08:00"
		},
		{
			"Uid": "a8606090e6f3bf4d",
			"Command": "cpu",
			"SubCommand": "fullload",
			"Flag": "--debug false --help false --cpu-count 4",
			"Status": "Success",
			"Error": "",
			"CreateTime": "2019-12-02T18:14:25.552265716+08:00",
			"UpdateTime": "2019-12-02T18:14:26.57991441+08:00"
		},
		{
			"Uid": "78d3d6004851c58e",
			"Command": "cpu",
			"SubCommand": "fullload",
			"Flag": "--cpu-count 4 --debug false --help false",
			"Status": "Success",
			"Error": "",
			"CreateTime": "2019-12-02T18:15:38.276979861+08:00",
			"UpdateTime": "2019-12-02T18:15:39.361869497+08:00"
		},
		{
			"Uid": "d1fe9d0df56ffd38",
			"Command": "cpu",
			"SubCommand": "fullload",
			"Flag": "--debug false --help false --cpu-count 4",
			"Status": "Success",
			"Error": "",
			"CreateTime": "2019-12-02T18:38:55.754252838+08:00",
			"UpdateTime": "2019-12-02T18:38:56.875084906+08:00"
		},
		{
			"Uid": "44e3083833a1d74a",
			"Command": "cpu",
			"SubCommand": "fullload",
			"Flag": "--cpu-count 4 --debug false --help false",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-02T18:47:57.880120218+08:00",
			"UpdateTime": "2019-12-02T18:53:37.707679493+08:00"
		},
		{
			"Uid": "bda3f35a7ca8ea16",
			"Command": "cpu",
			"SubCommand": "fullload",
			"Flag": "--debug false --help false --cpu-count 4",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-03T14:45:18.783440839+08:00",
			"UpdateTime": "2019-12-03T14:57:18.823532704+08:00"
		},
		{
			"Uid": "99a137ba58396e60",
			"Command": "disk",
			"SubCommand": "fill",
			"Flag": "--debug false --help false --size 20000",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-03T14:59:01.13288346+08:00",
			"UpdateTime": "2019-12-03T15:00:20.17680665+08:00"
		},
		{
			"Uid": "bcb4c1b17d445f55",
			"Command": "disk",
			"SubCommand": "fill",
			"Flag": "--size 20000 --debug false --help false",
			"Status": "Error",
			"Error": "dd: \ufffd\ufffd\ufffd诖\ufffd\ufffd\ufffd\"/chaos_filldisk.log.dat\": 权\ufffd薏\ufffd\ufffd\ufffd\n exit status 1 exit status 1",
			"CreateTime": "2019-12-03T15:00:46.651558522+08:00",
			"UpdateTime": "2019-12-03T15:00:46.712107819+08:00"
		},
		{
			"Uid": "0f60263c7b830b58",
			"Command": "disk",
			"SubCommand": "fill",
			"Flag": "--debug false --help false --size 20000",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-03T15:00:52.562074748+08:00",
			"UpdateTime": "2019-12-03T15:03:45.571227777+08:00"
		},
		{
			"Uid": "07486dcb6b8e1804",
			"Command": "disk",
			"SubCommand": "fill",
			"Flag": "--debug false --help false --size 20000",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-03T15:03:59.885515096+08:00",
			"UpdateTime": "2019-12-03T15:05:44.380778022+08:00"
		},
		{
			"Uid": "5f2c5c0353470b66",
			"Command": "disk",
			"SubCommand": "fill",
			"Flag": "--debug false --help false --size 20000",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-03T15:09:05.204692014+08:00",
			"UpdateTime": "2019-12-03T15:11:25.112595984+08:00"
		},
		{
			"Uid": "33019d022f93a58e",
			"Command": "disk",
			"SubCommand": "fill",
			"Flag": "--debug false --help false --size 20000",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-03T15:12:05.644988155+08:00",
			"UpdateTime": "2019-12-03T15:15:46.775748998+08:00"
		},
		{
			"Uid": "ae888993f31e9aeb",
			"Command": "disk",
			"SubCommand": "fill",
			"Flag": "--debug false --help false --size 20000",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-03T15:16:17.115550065+08:00",
			"UpdateTime": "2019-12-03T15:20:50.007686126+08:00"
		},
		{
			"Uid": "d24dd9239902eb6f",
			"Command": "network",
			"SubCommand": "delay",
			"Flag": "--time 3 --debug false --help false --interface eth0 --local-port 6396 --remote-port 6396",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-03T16:17:19.098079716+08:00",
			"UpdateTime": "2019-12-03T16:19:00.02772809+08:00"
		},
		{
			"Uid": "6aa70b124dce79f3",
			"Command": "network",
			"SubCommand": "delay",
			"Flag": "--help false --interface eth0 --local-port 6396 --remote-port 6396 --time 3000 --debug false",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-03T16:21:33.527410693+08:00",
			"UpdateTime": "2019-12-03T16:54:47.535580171+08:00"
		},
		{
			"Uid": "5d838cdd1584c7f0",
			"Command": "mem",
			"SubCommand": "load",
			"Flag": "--timeout 2 --debug false --help false --mem-percent 2",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-04T10:03:56.301575979+08:00",
			"UpdateTime": "2019-12-04T10:03:58.440572078+08:00"
		},
		{
			"Uid": "d87befe08c312ffe",
			"Command": "cpu",
			"SubCommand": "fullload",
			"Flag": "--debug false --help false --cpu-count 2",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-04T11:51:39.511920791+08:00",
			"UpdateTime": "2019-12-04T11:51:47.120728389+08:00"
		},
		{
			"Uid": "a510f7c62d4ddfef",
			"Command": "disk",
			"SubCommand": "fill",
			"Flag": "--debug false --help false --size 20000",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-04T16:59:11.557139226+08:00",
			"UpdateTime": "2019-12-04T17:01:11.52823494+08:00"
		},
		{
			"Uid": "a28f911f2f90e441",
			"Command": "cpu",
			"SubCommand": "fullload",
			"Flag": "--cpu-count 2 --debug false --help false",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-05T16:01:17.435926509+08:00",
			"UpdateTime": "2019-12-05T16:01:26.333131804+08:00"
		},
		{
			"Uid": "4778c8d168727f7a",
			"Command": "cpu",
			"SubCommand": "fullload",
			"Flag": "--help false --cpu-count 2 --debug false",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-05T16:11:24.526883118+08:00",
			"UpdateTime": "2019-12-05T16:11:32.166626915+08:00"
		},
		{
			"Uid": "1247625ff900e8b5",
			"Command": "disk",
			"SubCommand": "burn",
			"Flag": "--write false --debug false --help false --read true --size 20",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-05T16:42:52.76683927+08:00",
			"UpdateTime": "2019-12-05T16:42:58.872850884+08:00"
		},
		{
			"Uid": "6817b73edd7d1c36",
			"Command": "network",
			"SubCommand": "delay",
			"Flag": "--interface eth0 --time 2000 --debug false --help false",
			"Status": "Success",
			"Error": "",
			"CreateTime": "2019-12-05T16:43:06.855722578+08:00",
			"UpdateTime": "2019-12-05T16:43:06.880694868+08:00"
		}
	]
}