aws lambda_AWS Lambda不是魔术可靠性棒

aws lambda

A 100 line Lambda function runs fine for months, then goes down for two hours, and finally recovers on its own. Cost savings or reliability — pick one.

100行的Lambda函数可以正常运行数月,然后下降两个小时,最后自行恢复。 节省成本或提高可靠性-选择其中一项。

I recently got an email alert about a certain Lambda having an elevated error %. The key error message in the logs was: “getaddrinfo EMFILE rds.mycompany.com”. No, that is not a DNS server failure — that is Node.js saying it can’t allocate any more file descriptors and is stuck. API Gateway was returning HTTP 502 errors for all requests.

最近,我收到一封有关某个Lambda的错误百分比升高的电子邮件警报。 日志中的关键错误消息是:“ getaddrinfo EMFILE rds.mycompany.com”。 不,这不是DNS服务器故障-Node.js表示它无法再分配文件描述符,并且卡住了。 API网关针对所有请求返回HTTP 502错误。

We looked at the source code but found nothing obvious (disclaimer: I don’t write Node.js). The code has minimal dependencies. It creates a MySQL connection during each invocation, and there are no global variables referencing it, so presumably garbage collection should eventually close the socket…?

我们查看了源代码,但没有发现任何明显的东西(免责声明:我没有编写Node.js)。 该代码具有最小的依赖性。 它在每次调用时创建一个MySQL连接,并且没有引用它的全局变量,因此推测垃圾回收最终应该关闭套接字…?

After two hours, the problem just went away. My guess is the Lambda container was recycled.

两个小时后,问题就消失了。 我的猜测是Lambda容器被回收了。

The next day I decided to do more testing and added this instrumentation code to the main function (of course _getActiveHandles is undocumented):

第二天,我决定进行更多测试,并将此检测代码添加到主要功能中(当然_getActiveHandles 未记录 ):

let handles = process._getActiveHandles()
console.info(“HANDLE COUNT: “ + handles.length + “\n”)
console.info(“HANDLES\n” + JSON.stringify(handles, null, 2))

And sure enough, when calling the Lambda in a loop, the handle count increases until over 900 (no, sadly it didn’t get over 9000) and then continually fails with FunctionError: Unhandled. The Lambda file descriptor limit is 1024, so this makes sense.

可以肯定的是,在循环中调用Lambda时,句柄数会增加,直到超过900(不,可悲的是,它没有超过9000),然后因FunctionError: Unhandled.连续失败FunctionError: Unhandled. Lambda文件描述符限制为1024,因此这很有意义。

The mysql2 and mysql docs for Node.js had no example on ensuring file descriptors were closed in an exception safe way with code using await. But we added a try/finally which manually closed the database connection, and that fixed the leak.

Node.js的mysql2和mysql文档没有关于确保使用异常使用await以异常安全的方式关闭文件描述符的示例。 但是我们添加了一个try / finally,它可以手动关闭数据库连接,并修复了泄漏。

let conn = await MysqlDb.connect();
try {
await do_queries_with_connection(conn);
} finally { // Without this, sockets are leaked conn.end();
}

Lambda:回到过去 (Lambda: Back to the Past)

I have learned to be very wary of “connection pools” and “caches” when making reliable services. These add hard to test, timing-dependent edge cases. Connection caching causes problems with load balancing (not shifting load quickly to the least loaded servers) and DNS fail over (not honoring the TTL). I have seen downtime due to a popular open-source connection pool getting stuck when it got a weird TLS error the developers never encountered. In contrast, I admire the Route 53 design concept of “constant work”, which is the opposite of caching.

我学会了在提供可靠服务时要非常警惕“连接池”和“缓存”。 这些增加了难以测试的,与时序有关的边缘情况。 连接缓存会导致负载平衡问题(无法将负载快速转移到负载最少的服务器),并且DNS故障转移(不遵守TTL)。 我看到停机是由于流行的开源连接池陷入了开发人员从未遇到过的奇怪的TLS错误而卡住的。 相反,我很欣赏Route 53的“恒定工作”设计概念,这与缓存相反。

But the Lambda docs highly recommend connection pooling and caching, and don’t point out the drawbacks — i.e. premature optimization. Lambda itself caches your warm containers. Sure, it improves performance and reduces cost. But there is always a cost somewhere — in this case, a big reliability and testing cost. There is no free lunch. How many of you test that your Lambdas succeed after 1024 invocations? :)

但是Lambda文档强烈建议连接池和缓存,并且没有指出缺点-即过早的优化。 Lambda本身会缓存您的温暖容器。 当然,它可以提高性能并降低成本。 但是总会有成本–在这种情况下,可靠性和测试成本很高。 天下没有免费的午餐。 多少人测试过1024次调用后Lambda是否成功? :)

So writing “serverless” Lambda code is, sadly, just like any other “serverful” programming you have done: you have to carefully ensure all your file descriptors are closed after every request, which even garbage collected languages struggle with, or ensure you have a connection pool that is reliable. Neither option is trivial.

因此,可悲的是,编写“无服务器” Lambda代码就像完成任何其他“服务器”编程一样:您必须仔细确保在每次请求后关闭所有文件描述符,甚至垃圾收集的语言也会遇到麻烦,或者确保您拥有一个可靠的连接池。 两种选择都不是简单的。

“Adjusting to the requirement for perfection is, I think, the most difficult part of learning to program.” — The Mythical Man Month

“我认为适应完美的需求是学习编程的最困难的部分。” —神话人月

The recent RDS Proxy service acknowledges this problem:

最近的RDS代理服务确认了此问题:

With RDS Proxy, you no longer need code that handles cleaning up idle connections and managing connection pools. Your function code is cleaner, simpler, and easier to maintain.

使用RDS代理,您不再需要用于清理空闲连接和管理连接池的代码。 您的功能代码更简洁,更容易维护。

I can attest that is indeed simpler, but only for languages that dispose of sockets sanely… I wish more languages used RAII or refcounted GC to force immediate cleanup because a language should serve us, and not be a source of constant foot-guns.

我可以证明确实确实更简单,但仅适用于理智处理套接字的语言……我希望更多的语言使用RAII或简化的GC来强制立即清除,因为一种语言应该为我们服务,而不是一味不断地发牢骚。

Ironically, we were using provisioned concurrency on this Lambda — we were running it like a “serverful” instance (with higher cost) but had no way to SSH in and debug it when it was hung. Be extra careful when running in this mode, because your container is even less likely to be recycled, and ask yourself why you’re not just using ECS or EC2.

具有讽刺意味的是,我们在此Lambda上使用了预配置的并发性-我们像“服务器式”实例(具有更高的成本)一样运行它,但是在挂起时无法进行SSH调试和调试。 在这种模式下运行时要格外小心,因为容器回收的可能性更低,并问自己为什么不仅仅使用ECS或EC2。

Perhaps Lambda needs a container-level shallow health check, just like we have for EC2 and ECS. This could check if the file descriptors or memory usage were >50% used, and if so, force a container recycle. Because if it walks like a server, quacks like a server, and hangs like a server…

就像我们对EC2和ECS一样,Lambda可能需要容器级的浅层运行状况检查。 这可以检查文件描述符或内存使用率是否已使用> 50%,如果是,则强制回收容器。 因为如果它像服务器一样行走,嘎嘎像服务器,而像服务器一样悬挂 ……

翻译自: https://medium.com/@karl.pickett/aws-lambda-is-not-a-magic-reliability-wand-91da728acba

aws lambda

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值