亚马逊 aws 指南 步骤
by Yan Cui
崔燕
AWS步骤功能:如何为状态机实现信号量 (AWS step functions: how to implement semaphores for state machines)
Here at DAZN, we are migrating from our legacy platform into the brave new world of microfrontends and microservices. Along the way, we also discovered the delights that AWS Step Functions have to offer. For example…
在DAZN ,我们正在从旧平台迁移到微前端和微服务的美好新世界 。 在此过程中,我们还发现了AWS Step Functions必须提供的乐趣。 例如…
- flexible error handling and retry 灵活的错误处理和重试
- the understated ability to wait between tasks 低估了在任务之间等待的能力
the ability to mix automated steps with activities that require human intervention
将自动化步骤与需要人工干预的活动混合的能力
In some cases, we need to control the number of concurrent state machine executions that can access a shared resource. This might be a business requirement. Or it could be due to scalability concerns for the shared resource. It might also be a result of the design of our state machine which makes it difficult to parallelise.
在某些情况下,我们需要控制可以访问共享资源的并发状态机执行的次数。 这可能是业务要求。 或者可能是由于共享资源的可伸缩性问题。 这也可能是我们状态机设计的结果,这使其难以并行化。
We came up with a few solutions that fall into two general categories:
我们提出了一些可分为两大类的解决方案:
- Control the number of executions that you can start 控制可以开始的执行次数
Allow concurrent executions to start, but block an execution from entering the critical path until it’s able to acquire a semaphore (that is, a signal to proceed)
允许并发执行开始,但是阻止执行进入关键路径,直到它能够获取信号量 (即继续执行的信号)为止
控制并发执行的次数 (Control the number of concurrent executions)
You can control the MAX number of concurrent executions by introducing an SQS queue. A CloudWatch schedule will trigger a Lambda function to:
您可以通过引入SQS队列来控制并发执行的最大数量。 CloudWatch计划将触发Lambda函数以:
- check how many concurrent executions there are 检查有多少个并发执行
- if there are N executions, then we can start MAX-N executions 如果有N次执行,那么我们可以开始MAX-N次执行
- poll SQS for MAX-N messages, and start a new execution for each 轮询SQS以获取MAX-N消息,并针对每个消息开始新的执行
We’re not using the new SQS trigger for Lambda here, because the purpose is to slow down the creation of new executions. Whereas the SQS trigger would push tasks to our Lambda function eagerly.
我们此处未在Lambda上使用新的SQS触发器 ,因为其目的是减慢新执行的创建。 而SQS触发器会急切地将任务推送到我们的Lambda函数。
Also, you should use a FIFO queue so that tasks are processed in the same order they’re added to the queue.
另外,您应该使用FIFO队列,以便以将任务添加到队列中的相同顺序处理任务。
使用信号量执行块 (Block execution using semaphores)
You can use the ListExecutions API to find out how many executions are in the RUNNING state. You can then sort them by startDate and only allow the oldest executions to transition to states that access the shared resource.
您可以使用ListExecutions API找出处于RUNNING状态的执行次数。 然后,您可以按startDate对它们进行排序,并且只允许最早的执行转换为访问共享资源的状态。
Take the following state machine for instance.
以以下状态机为例。
The OnlyOneShallRunAtOneTime state invokes the one-shall-pass
Lambda function and returns a proceed
flag. The Shall Pass? state then branches the flow of this execution based on the proceed
flag.
所述OnlyOneShallRunAtOneTime状态调用one-shall-pass
lambda函数,并返回一个proceed
标记。 要通行证吗? 状态然后基于proceed
标志分支此执行流程。
OnlyOneShallRunAtOneTime: Type: Task Resource: arn:aws:lambda:us-east-1:xxx:function:one-shall-pass Next: Shall Pass?Shall Pass?: Type: Choice Choices: - Variable: $.proceed # check if this execution should proceed BooleanEquals: true Next: SetWriteThroughputDeltaForScaleUp Default: WaitToProceed # otherwise wait and try again later WaitToProceed: Type: Wait Seconds: 60 Next: OnlyOneShallRunAtOneTime
The tricky thing here is how to associate the Lambda invocation with the corresponding Step Function execution. Unfortunately, Step Functions do not pass the execution ARN to the Lambda function. Instead, we have to pass the execution name as part of the input when we start the execution.
棘手的事情是如何将Lambda调用与相应的Step Function执行相关联。 不幸的是,Step Functions不会将执行ARN传递给Lambda函数。 相反,我们必须在开始执行时将执行名称作为输入的一部分传递。
const name = uuid().replace(/-/g, '_')const input = JSON.stringify({ name, bucketName, fileName, mode }) const req = { stateMachineArn, name, input }const resp = await SFN.startExecution(req).promise()
When the one_shall_pass
function runs, it can use the execution name
from the input. It’s then able to match the invocation against the executions returned by ListExecutions.
当one_shall_pass
函数运行时,它可以使用输入中的执行name
。 然后可以将调用与ListExecutions返回的执行进行匹配 。
In this particular case, only the oldest execution can proceed. All other executions would transition to the WaitToProceed state.
在这种特殊情况下,只能执行最早的执行。 所有其他的执行将过渡到WaitToProceed状态。
module.exports.handler = async (input, context) => { const executions = await listRunningExecutions() Log.info(`found ${executions.length} RUNNING executions`)
const oldest = _.sortBy(executions, x => x.startDate.getTime())[0] Log.info(`the oldest execution is [${oldest.name}]`)
if (oldest.name === input.name) { return { ...input, proceed: true } } else { return { ...input, proceed: false } }}
比较方法 (Compare the approaches)
Let’s compare the two approaches against the following criteria:
让我们根据以下标准比较这两种方法:
Scalability. How well does the approach cope as the number of concurrent executions goes up?
可扩展性。 随着并发执行次数的增加,这种方法的适应能力如何?
Simplicity. How many moving parts does the approach add?
简单。 该方法增加了多少个运动部件?
Cost. How much extra cost does the approach add?
费用 。 该方法会增加多少额外费用?
可扩展性 (Scalability)
Approach 2 (blocking executions) has two problems when you have a large number of concurrent executions.
当您有大量并发执行时,方法2(阻止执行)有两个问题。
First, you can hit the regional throttling limit on the ListExecutions
API call.
首先,您可以在ListExecutions
API调用上达到区域限制限制。
Second, if you have configured timeout on your state machine (and you should!) then they can also timeout. This creates backpressure on the system.
其次,如果您已经在状态机上配置了超时(应该这样做),那么它们也可以超时。 这会在系统上产生背压。
Approach 1 (with SQS) is far more scalable by comparison. Queued tasks are not started until they are allowed to start, so no backpressure. Only the cron Lambda function needs to list executions, so you’re also unlikely to reach API limits.
相比之下,方法1(带有SQS)具有更大的可伸缩性。 排队的任务只有在允许启动后才启动,因此不会产生背压。 仅cron Lambda函数需要列出执行,因此您也不太可能达到API限制。
简单 (Simplicity)
Approach 1 introduces new pieces to the infrastructure — SQS, CloudWatch schedule, and Lambda. Also, it forces the producers to change as well.
方法1向基础架构引入了新的部分-SQS,CloudWatch计划和Lambda。 同样,它也迫使生产者也要改变。
With approach 2, a new Lambda function is needed for the additional step, but it’s part of the state machine.
对于方法2,额外的步骤需要一个新的Lambda函数,但这是状态机的一部分。
成本 (Cost)
Approach 1 introduces minimal baseline cost even when there are no executions. However, we are talking about cents here…
即使没有执行,方法1也会引入最低的基准成本。 但是,我们在这里谈论的是美分…
Approach 2 introduces additional state transitions, which is around $25 per million. See the Step Functions pricing page for more details. Since each execution will incur 3 transitions per minute while it’s blocked, the cost of these transitions can pile up quickly.
方法2引入了其他状态转换,大约为百万分之25。 有关更多详细信息,请参见“ 步骤功能”定价页面。 由于每次执行都会在阻塞时每分钟发生3次转换,因此这些转换的成本会Swift增加。
结论 (Conclusions)
Given the two approaches we considered here, using SQS is by far the more scalable. It is also more cost effective as the number of concurrent executions goes up.
考虑到我们在此处考虑的两种方法,使用SQS到目前为止具有更大的可扩展性。 随着并发执行次数的增加,它也更具成本效益。
But, you need to manage additional infrastructure and force upstream systems to change. This can impact other teams, and ultimately affects your ability to deliver on time.
但是,您需要管理其他基础结构并强制上游系统进行更改。 这可能会影响其他团队,并最终影响您按时交付的能力。
If you do not expect a high number of executions, then you might be better off going with the second approach.
如果您不希望执行大量操作,那么采用第二种方法可能会更好。
亚马逊 aws 指南 步骤