轻松SRE-使用云监控实现自动化运维

最新推荐文章于 2024-05-27 23:12:47 发布

weixin_33788244

最新推荐文章于 2024-05-27 23:12:47 发布

阅读量308

点赞数

文章标签：运维 json 数据库

原文链接：https://yq.aliyun.com/articles/279141

版权

SRE中关于监控Action的定义

监控系统是 SRE 团队监控服务质量和可用性的一个主要手段。所以监控系统的设计和策略值得着重讨论。最普遍和传统的报警策略是针对某个特定的情况或者监控值，一旦出现情况或者监控值超过阈值就触发 E-mail 报警。但是这样的报警并不是非常有效：一个需要人工阅读邮件和分析报警来决定目前是否需要采取某种行动的系统从本质上是错误的。监控系统不应该依赖人来分析信息进行报警，而是应该由系统自动分析，仅仅当需要用户执行某种操作时，才需要通知用户。

监控不做任何事情是不可能的，有三种有效的监控输出：
警报

意味着收到警报的用户需要立即执行某种操作，目标是解决某种已经发生的问题，或者是避免即将要发生的问题。

工单

意味着接受工单的用户应该执行某种操作，但是并非立即执行。系统并不能自动解决目前的情况，但是如果一个用户在几天内执行这项操作，系统不会受到任何影响。

日志

平时没有人需要关注日志信息,但是日志信息依然被收集起来以备调试和事后分析时使用。正确的做法是平时没有人主动阅读日志,除非处理其他请求的时候被要求这么做。

如何使用云监控实现

默认报警

云监控报警服务提供第一个报警能力就是发送报警信息，为此我们作了多种报警配置

支持多种渠道的通知方式
多种报警抑制策略：通道沉默/报警条件次数/生效时间
支持全部资源/应用分组/单个实例等多不同level的报警设置

具体使用方法参考

通过云监控控制台/OpenAPI/SDK三种方式，可以在云监控设置报警。

新的问题

从SRE的实践来看，设置报警规则不会是一锤定音的事情，需要长期维护，当前的某一个常见的自动化困难的警报，可能很快就会变成一个经常触发的问题，这时最好能一个临时的自动化处理的脚本来应对。
实际上SRE也是这么定义的

没有不需要采取行动的警报。如果您遇到一个自己认为不需要执行操作的警报，您需要采用自动化的手段来修复该警报。

那么问题来了，在云监控中，如何实现自动化修复/生成工单/记录日志这几个Action呢。
云监控提供了报警回调（webhook）的能力，可以用来打通云监控报警服务与你的业务系统，完成更多的运维管理可能，如何报警回调，下面进行详细说明，并且提供了回调服务的demo供你选择。

使用云监控控制台使用报警回调

设置报警回调

从以下入口创建报警规则时都可以设置报警回调：报警服务/主机监控/日志监控/云服务监控
image.png | center | 704x374

注意事项：

仅支持http post，不支持的http服务会报错
不要设置常见网站地址，会出现未知错误
该http服务需要公网可访问

定义回调的HTTP服务

基本调用方式

云监控回调时会将报警相关参数以content-type:application/json的格式post到你的http服务，因此你的服务需要使用json格式来解析，回调传入的参数，请参考文档使用报警回调

JSON代码示例

#!/usr/bin/env python2.7
# -*- coding: utf-8 -*-

VERSION = '0.1'

import argparse
import BaseHTTPServer
import cgi
import logging
import os
import sys
import json


def make_request_handler_class(opts):
    '''
    Factory to make the request handler and add arguments to it.

    It exists to allow the handler to access the opts.path variable
    locally.
    '''
    class MyRequestHandler(BaseHTTPServer.BaseHTTPRequestHandler):
        '''
        Factory generated request handler class that contain
        additional class variables.
        '''
        m_opts = opts

        def do_POST(self):
            '''
            Handle POST requests.
            '''
            logging.debug('POST %s' % (self.path))

            ctype, pdict = cgi.parse_header(self.headers['content-type'])
            result = {}
            if ctype == 'application/json':
                length = int(self.headers['content-length'])
                result = json.loads(self.rfile.read(length))
                logging.info('post json: %s'% result)

            # Get the "Back" link.
            back = self.path if self.path.find('?') < 0 else self.path[:self.path.find('?')]

            # Print out logging information about the path and args.
            logging.debug('TYPE %s' % (ctype))
            logging.debug('PATH %s' % (self.path))
            logging.debug('ARGS %d' % (len(result)))
            if len(result):
                i = 0
                for key in sorted(result):
                    logging.debug('ARG[%d] %s=%s' % (i, key, result[key]))
                    i += 1

            # Tell the browser everything is okay and that there is
            # HTML to display.
            self.send_response(200)  # OK
            self.send_header('Content-type', ctype)
            self.end_headers()

            # Display the POST variables.
            self.wfile.write('{message:"called by cms success."}')
    return MyRequestHandler

def err(msg):
    '''
    Report an error message and exit.
    '''
    print('ERROR: %s' % (msg))
    sys.exit(1)


def getopts():
    '''
    Get the command line options.
    '''

    # Get the help from the module documentation.
    this = os.path.basename(sys.argv[0])
    description = ('description:%s' % '\n  '.join(__doc__.split('\n')))
    epilog = ' '
    rawd = argparse.RawDescriptionHelpFormatter
    parser = argparse.ArgumentParser(formatter_class=rawd,
                                     description=description,
                                     epilog=epilog)

    parser.add_argument('-d', '--daemonize',
                        action='store',
                        type=str,
                        default='.',
                        metavar='DIR',
                        help='daemonize this process, store the 3 run files (.log, .err, .pid) in DIR (default "%(default)s")')

    parser.add_argument('-H', '--host',
                        action='store',
                        type=str,
                        default='localhost',
                        help='hostname, default=%(default)s')

    parser.add_argument('-l', '--level',
                        action='store',
                        type=str,
                        default='info',
                        choices=['notset', 'debug', 'info', 'warning', 'error', 'critical',],
                        help='define the logging level, the default is %(default)s')

    parser.add_argument('--no-dirlist',
                        action='store_true',
                        help='disable directory listings')

    parser.add_argument('-p', '--port',
                        action='store',
                        type=int,
                        default=8080,
                        help='port, default=%(default)s')

    parser.add_argument('-r', '--rootdir',
                        action='store',
                        type=str,
                        default=os.path.abspath('.'),
                        help='web directory root that contains the HTML/CSS/JS files %(default)s')

    parser.add_argument('-v', '--verbose',
                        action='count',
                        help='level of verbosity')

    parser.add_argument('-V', '--version',
                        action='version',
                        version='%(prog)s - v' + VERSION)

    opts = parser.parse_args()
    opts.rootdir = os.path.abspath(opts.rootdir)
    if not os.path.isdir(opts.rootdir):
        err('Root directory does not exist: ' + opts.rootdir)
    if opts.port < 1 or opts.port > 65535:
        err('Port is out of range [1..65535]: %d' % (opts.port))
    return opts


def httpd(opts):
    '''
    HTTP server
    '''
    RequestHandlerClass = make_request_handler_class(opts)
    server = BaseHTTPServer.HTTPServer((opts.host, opts.port), RequestHandlerClass)
    logging.info('Server starting %s:%s (level=%s)' % (opts.host, opts.port, opts.level))
    try:
        server.serve_forever()
    except KeyboardInterrupt:
        pass
    server.server_close()
    logging.info('Server stopping %s:%s' % (opts.host, opts.port))


def get_logging_level(opts):
    '''
    Get the logging levels specified on the command line.
    The level can only be set once.
    '''
    if opts.level == 'notset':
        return logging.NOTSET
    elif opts.level == 'debug':
        return logging.DEBUG
    elif opts.level == 'info':
        return logging.INFO
    elif opts.level == 'warning':
        return logging.WARNING
    elif opts.level == 'error':
        return logging.ERROR
    elif opts.level == 'critical':
        return logging.CRITICAL


def daemonize(opts):
    '''
    Daemonize this process.

    '''
    if os.path.exists(opts.daemonize) is False:
        err('directory does not exist: ' + opts.daemonize)

    if os.path.isdir(opts.daemonize) is False:
        err('not a directory: ' + opts.daemonize)

    bname = 'webserver-%s-%d' % (opts.host, opts.port)
    outfile = os.path.abspath(os.path.join(opts.daemonize, bname + '.log'))
    errfile = os.path.abspath(os.path.join(opts.daemonize, bname + '.err'))
    pidfile = os.path.abspath(os.path.join(opts.daemonize, bname + '.pid'))

    if os.path.exists(pidfile):
        err('pid file exists, cannot continue: ' + pidfile)
    if os.path.exists(outfile):
        os.unlink(outfile)
    if os.path.exists(errfile):
        os.unlink(errfile)

    if os.fork():
        sys.exit(0)  # exit the parent

    os.umask(0)
    os.setsid()
    if os.fork():
        sys.exit(0)  # exit the parent

    print('daemon pid %d' % (os.getpid()))

    sys.stdout.flush()
    sys.stderr.flush()

    stdin = file('/dev/null', 'r')
    stdout = file(outfile, 'a+')
    stderr = file(errfile, 'a+', 0)

    os.dup2(stdin.fileno(), sys.stdin.fileno())
    os.dup2(stdout.fileno(), sys.stdout.fileno())
    os.dup2(stderr.fileno(), sys.stderr.fileno())

    with open(pidfile, 'w') as ofp:
        ofp.write('%i' % (os.getpid()))


def main():
    ''' main entry '''
    opts = getopts()
    if opts.daemonize:
        daemonize(opts)
    logging.basicConfig(format='%(asctime)s [%(levelname)s] %(message)s', level=get_logging_level(opts))
    httpd(opts)


if __name__ == '__main__':
    main()  # this allows library functionality

HTTP服务回调校验

在使用云监控控制台和SDK创建带有报警回调的报警规则时，会触发一次回调校验，因此需要注意两点

在创建报警规则之前，被回调HTTP服务必须可正常使用
创建的这个报警规则时，云监控Mock了一组数据来校验HTTP服务是否可以正常接收请求，你要确保接收到这样的请求不做任何业务处理。你可以判断userId为test-userId就不处理

## MOCK的数据
{"alertName":"test-alertName","alertState":"-1","curValue":"4","dimensions":"[{}]","expression":"$Maximum>=85","metricName":"test-metricName","metricProject":"test-metricProject","timestamp":"1507618020731","userId":"test-userId"}

如何使用SDK定义报警回调

sdk的使用方式请参考JavaSDK使用手册
在sdk5.0.6以上版本，开始支持报警回调

<dependency>
  <groupId>com.aliyun</groupId>
  <artifactId>aliyun-java-sdk-cms</artifactId>
  <version>5.0.6</version>
</dependency>

创建方式

import com.aliyuncs.cms.model.v20170301.CreateAlarmRequest;
import com.aliyuncs.cms.model.v20170301.CreateAlarmResponse;
import com.aliyuncs.exceptions.ClientException;
import com.aliyuncs.profile.DefaultProfile;
import com.aliyuncs.profile.IClientProfile;
public class WebhookTest{
    public void init() throws ClientException {
        IClientProfile profile = DefaultProfile.getProfile("<RegionId>", "<AccessKey>", "<AccessKeySecret>");
        client = new DefaultAcsClient(profile);
    }
    public void createAlarm() throws Exception{
        CreateAlarmRequest request = new CreateAlarmRequest();
        request.setName("test_3");
        request.setNamespace("acs_ocs");
        request.setMetricName("UsedQps");
        request.setDimensions("[{userId:*****,instanceId:\"****\"}]");
        request.setPeriod(60);
        request.setStatistics("Average");
        request.setComparisonOperator(">=");
        request.setThreshold("0");
        request.setEvaluationCount(1);
        request.setContactGroups("[\"云账号报警联系人\"]");
        request.setWebhook("{url:\"http://*****\"}");
        request.setNotifyType(0);

        CreateAlarmResponse response = client.getAcsResponse(request);
    }
}