cloudwatch
by Yan Cui
崔燕
如何使用CloudWatch Events和Lambda自动为API创建CloudWatch警报 (How to auto-create CloudWatch Alarms for APIs with CloudWatch Events and Lambda)
In a previous post, I discussed how to auto-subscribe a CloudWatch Log Group to a Lambda function using CloudWatch Events. The benefit of this is that we don’t need a manual process to ensure all Lambda logs are forwarded to our log aggregation service.
在上一篇文章中 ,我讨论了如何使用CloudWatch Events将CloudWatch Log Group自动订阅给Lambda函数。 这样做的好处是我们不需要手动过程即可确保将所有Lambda日志转发到我们的日志聚合服务。
Whilst this is useful in its own right, it only scratches the surface of what we can do. CloudTrail and CloudWatch Events make it easy to automate many day-to-day operational steps, with the help of Lambda of course ?
尽管这本身是有用的,但它仅触及我们能做的事情的表面。 CloudTrail以及CloudWatch的活动可以很容易地自动完成很多日常的日常操作步骤,当然是有LAMBDA的帮助?
I work with API Gateway and Lambda a lot. Whenever you create a new API, or make changes, there are several things you need to do:
我用API网关和Lambda了很多工作。 每当您创建新的API或进行更改时,都需要做几件事:
Enable Detailed Metrics for the deployment stage
启用了部署阶段的详细指标
- Set up a dashboard in CloudWatch, showing request count, latencies, and error counts 在CloudWatch中设置仪表板,以显示请求计数,延迟和错误计数
Set up CloudWatch Alarms for P99 latencies and error counts
设置CloudWatch警报以获取P99延迟和错误计数
Because these are manual steps, they often get missed.
由于这些是手动步骤,因此经常会被遗漏。
Have you ever forgotten to update the dashboard after adding a new endpoint to your API? And did you also remember to set up a P99 latency alarm on this new endpoint? How about alarms on the number of 4XX or 5xx errors?
您是否曾经在向API添加新端点后忘记更新仪表板? 您还记得在此新端点上设置P99延迟警报吗? 警报4XX或5xx错误的数量如何?
Most teams I’ve dealt with have some conventions around these, but they don’t have a way to enforce them. The result is that the convention is applied in patches and cannot be relied upon. I find that this approach doesn’t scale with the size of the team.
我处理过的大多数团队都围绕这些约定制定了一些约定,但是他们没有办法实施这些约定。 结果是该约定被应用在补丁程序中,不能被依赖。 我发现这种方法无法随团队规模扩展。
It works when you’re a small team. Everyone has a shared understanding, and the necessary discipline to follow the convention. When the team gets bigger, you need automation to help enforce these conventions.
当您是一个小型团队时,它会起作用。 每个人都有共同的理解,以及遵循公约的必要纪律。 当团队规模扩大时,您需要自动化来帮助强制执行这些约定。
Fortunately, we can automate away these manual steps using the same pattern. In the Monitoring unit of my course Production-Ready Serverless, I demonstrated how you can do this in 3 simple steps:
幸运的是,我们可以使用相同的模式自动执行这些手动步骤。 在我的生产就绪无服务器课程的“ 监视”单元中,我演示了如何通过3个简单步骤来做到这一点:
CloudTrail captures the CreateDeployment request to API Gateway
CloudTrail捕获CreateDeployment请求API网关
CloudWatch Events pattern against this captured request
针对此捕获的请求的CloudWatch Events模式
Lambda function to enable detailed metrics, and create alarms for each endpoint
lambda函数启用详细的指标,并为每个端点创建警报
If you use the Serverless framework, then you might have a function that looks like this:
如果使用无服务器框架,则可能具有如下所示的功能:
A couple of things to note from the code above:
上面的代码有两点需要注意:
I’m using the serverless-iam-roles-per-function plugin to give the function a tailored IAM role
我正在使用serverless-iam-roles-per-function插件为该功能提供量身定制的IAM角色
The function needs the
apigateway:PATCH
permission to enable detailed metrics该功能需要
apigateway:PATCH
权限才能启用详细指标The function needs the
apigateway:GET
permission to get the API name and REST endpoints该函数需要
apigateway:GET
权限才能获取API名称和REST端点The function needs the
cloudwatch:PutMetricAlarm
permission to create the alarms该功能需要
cloudwatch:PutMetricAlarm
权限才能创建警报The environment variables specify SNS topics for the CloudWatch Alarms
环境变量为CloudWatch警报指定SNS主题
The captured event looks like this:
捕获的事件如下所示:
We can find the restApiId
and stageName
inside the detail.requestParameters
attribute. That’s all we need to figure out what endpoints are there, and so what alarms we need to create.
我们可以在detail.requestParameters
属性内找到restApiId
和stageName
。 我们仅需弄清楚那里有什么端点,以及我们需要创建什么警报。
Inside the handler function, which you can find here, we perform a few steps:
在处理函数中(您可以在此处找到),我们执行一些步骤:
Enable detailed metrics with an
updateStage
call to API Gateway通过对API Gateway的
updateStage
调用来启用详细指标Get the list of REST endpoints with a
getResources
call to API Gateway通过对API Gateway的
getResources
调用获取REST端点列表Get the REST API name with a
getRestApi
call to API Gateway通过对API Gateway的
getRestApi
调用获取REST API名称For each of the REST endpoints, create a P99 latency alarm in the
AWS/ApiGateway
namespace对于每个REST端点,在
AWS/ApiGateway
命名空间中创建P99延迟警报
Now, every time I create a new API, I will have CloudWatch Alarms to alert me when the 99 percentile latency for an endpoint goes over 1 second, for 5 minutes in a row.
现在,每次创建新的API时,当端点的99%延迟连续1分钟超过1秒时,我都会有CloudWatch Alarms来提醒我。
All this, with just a few lines of code ?
所有这些,仅需几行代码?
You can take this further, and have other Lambda functions to:
您可以更进一步,并使用其他Lambda函数来:
- Create CloudWatch Alarms for 5xx errors for each endpoint 为每个端点为5xx错误创建CloudWatch警报
- Create CloudWatch Dashboard for the API 为API创建CloudWatch仪表板
So there you have it! A useful pattern for automating away manual operational tasks.
所以你有它! 自动执行手动操作任务的有用模式。
And before you tell me about the ACloudGuru AWS Alerts Serverless plugin by the ACloudGuru folks, yes I’m aware of it. It looks neat, but it’s ultimately still something the developer has to remember to do.
在您告诉我有关ACloudGuru员工的ACloudGuru AWS Alerts Serverless插件之前,是的,我已经知道了。 它看起来很整洁,但最终仍然是开发人员必须记住要做的事情。
That requires discipline.
那需要纪律。
My experience tells me that you cannot rely on discipline, ever. Which is why I prefer to have a platform in place that will generate these alarms instead.
我的经验告诉我,你永远不能依靠纪律。 这就是为什么我更喜欢有一个可以生成这些警报的平台。
cloudwatch