Queue-Based Load Leveling Pattern 基于队列的负载均衡模式

  • Article文章
  • 08/26/2015 2015年8月26日
  • 5 minutes to read还有五分钟
    Use a queue that acts as a buffer between a task and a service that it invokes in order to smooth intermittent heavy loads that may otherwise cause the service to fail or the task to time out. This pattern can help to minimize the impact of peaks in demand on availability and responsiveness for both the task and the service.


Context and Problem 背景与问题

Many solutions in the cloud involve running tasks that invoke services. In this environment, if a service is subjected to intermittent heavy loads, it can cause performance or reliability issues


A service could be a component that is part of the same solution as the tasks that utilize it, or it could be a third-party service providing access to frequently used resources such as a cache or a storage service. If the same service is utilized by a number of tasks running concurrently, it can be difficult to predict the volume of requests to which the service might be subjected at any given point in time.


It is possible that a service might experience peaks in demand that cause it to become overloaded and unable to respond to requests in a timely manner. Flooding a service with a large number of concurrent requests may also result in the service failing if it is unable to handle the contention that these requests could cause.


Solution 解决方案

Refactor the solution and introduce a queue between the task and the service. The task and the service run asynchronously. The task posts a message containing the data required by the service to a queue. The queue acts as a buffer, storing the message until it is retrieved by the service. The service retrieves the messages from the queue and processes them. Requests from a number of tasks, which can be generated at a highly variable rate, can be passed to the service through the same message queue. Figure 1 shows this structure.



Figure 1 - Using a queue to level the load on a service


The queue effectively decouples the tasks from the service, and the service can handle the messages at its own pace irrespective of the volume of requests from concurrent tasks. Additionally, there is no delay to a task if the service is not available at the time it posts a message to the queue.


This pattern provides the following benefits:


  • It can help to maximize availability because delays arising in services will not have an immediate and direct impact on the application, which can continue to post messages to the queue even when the service is not available or is not currently processing messages. 它可以帮助最大限度地提高可用性,因为服务中出现的延迟不会对应用程序产生直接的影响,即使服务不可用或者当前没有处理消息,应用程序也可以继续将消息发送到队列中
  • It can help to maximize scalability because both the number of queues and the number of services can be varied to meet demand. 它可以帮助最大限度地提高可伸缩性,因为可以改变队列的数量和服务的数量以满足需求
  • It can help to control costs because the number of service instances deployed needs only to be sufficient to meet average load rather than the peak load. 它可以帮助控制成本,因为部署的服务实例的数量只需要足以满足平均负载,而不是满足峰值负载



Some services may implement throttling if demand reaches a threshold beyond which the system could fail. Throttling may reduce the functionality available. You might be able to implement load leveling with these services to ensure that this threshold is not reached.


Issues and Considerations 问题及考虑

Consider the following points when deciding how to implement this pattern:


  • It is necessary to implement application logic that controls the rate at which services handle messages to avoid overwhelming the target resource. Avoid passing spikes in demand to the next stage of the system. Test the system under load to ensure that it provides the required leveling, and adjust the number of queues and the number of service instances that handle messages to achieve this. 有必要实现控制服务处理消息速率的应用程序逻辑,以避免使目标资源不堪重负。避免将需求高峰转移到系统的下一阶段。测试负载下的系统,以确保它提供所需的均衡,并调整处理消息的队列和服务实例的数量,以实现这一点
  • Message queues are a one-way communication mechanism. If a task expects a reply from a service, it may be necessary to implement a mechanism that the service can use to send a response. For more information, see the 消息队列是一种单向通信机制。如果任务期望从服务获得应答,则可能需要实现服务可用于发送响应的机制。有关更多信息,请参见Asynchronous Messaging Primer 异步消息入门.
  • You must be careful if you apply autoscaling to services that are listening for requests on the queue because this may result in increased contention for any resources that these services share, and diminish the effectiveness of using the queue to level the load. 如果对正在监听队列上的请求的服务应用自动伸缩,则必须小心,因为这可能导致对这些服务共享的任何资源的争用增加,并降低使用队列平衡负载的有效性

When to Use this Pattern 何时使用此模式

This pattern is ideally suited to any type of application that uses services that may be subject to overloading.


This pattern might not be suitable if the application expects a response from the service with minimal latency.


Example 例子

A Microsoft Azure web role stores data by using a separate storage service. If a large number of instances of the web role run concurrently, it is possible that the storage service could be overwhelmed and be unable to respond to requests quickly enough to prevent these requests from timing out or failing. Figure 2 highlights this issue.

MicrosoftAzure Web 角色通过使用单独的存储服务来存储数据。如果 web 角色的大部分数量同时运行,存储服务可能会不堪重负,无法快速响应请求,以防止这些请求超时或失败。图2突出显示了这个问题。


Figure 2 - A service being overwhelmed by a large number of concurrent requests from instances of a web role

图2-一个服务被来自 Web 角色实例的大量并发请求所淹没

To resolve this issue, you can use a queue to level the load between the web role instances and the storage service. However, the storage service is designed to accept synchronous requests and cannot be easily modified to read messages and manage throughput. Therefore, you can introduce a worker role to act as a proxy service that receives requests from the queue and forwards them to the storage service. The application logic in the worker role can control the rate at which it passes requests to the storage service to prevent the storage service from being overwhelmed. Figure 3 shows this solution.

要解决这个问题,可以使用队列来平衡 Web 角色实例和存储服务之间的负载。但是,存储服务被设计为接受同步请求,不容易修改以读取消息和管理吞吐量。因此,可以引入辅助角色作为代理服务,接收来自队列的请求并将其转发到存储服务。Worker 角色中的应用程序逻辑可以控制向存储服务传递请求的速率,以防止存储服务不堪重负。图3显示了这个解决方案。


Figure 3 - Using a queue and a worker role to level the load between instances of the web role and the service

图3-使用一个队列和一个工作者角色来平衡 Web 角色实例和服务之间的负载

Related Patterns and Guidance 相关模式及指引

The following patterns and guidance may also be relevant when implementing this pattern:


  • Asynchronous Messaging Primer 异步消息入门. Message queues are an inherently asynchronous communications mechanism. It may be necessary to redesign the application logic in a task if it is adapted from communicating directly with a service to using a message queue. Similarly, it may be necessary to refactor a service to accept requests from a message queue (alternatively, it may be possible to implement a proxy service, as described in the example). .消息队列是一种固有的异步通信机制。如果将任务中的应用程序逻辑从直接与服务通信改为使用消息队列,则可能需要重新设计任务中的应用程序逻辑。类似地,可能需要重构服务以接受来自消息队列的请求(或者,也可以实现代理服务,如示例中所述)
  • Competing Consumers Pattern 消费者竞争模式. It may be possible to run multiple instances of a service, each of which act as a message consumer from the load-leveling queue. You can use this approach to adjust the rate at which messages are received and passed to a service. .可以运行一个服务的多个实例,其中每个实例都充当来自负载均衡队列的消息使用者。您可以使用此方法来调整消息接收和传递到服务的速率
  • Throttling Pattern 节流模式. A simple way to implement throttling with a service is to use queue-based load-leveling and route all requests to a service through a message queue. The service can process requests at a rate that ensures resources required by the service are not exhausted, and to reduce the amount of contention that could occur. .使用服务实现节流的一种简单方法是使用基于队列的负载均衡,并通过消息队列将所有请求路由到服务。服务可以以一定的速度处理请求,以确保服务所需的资源不会耗尽,并减少可能发生的争用

Retry Pattern 重试模式

  • Article文章
  • 08/26/2015 2015年8月26日
  • 10 minutes to read还有10分钟
    Enable an application to handle anticipated, temporary failures when it attempts to connect to a service or network resource by transparently retrying an operation that has previously failed in the expectation that the cause of the failure is transient. This pattern can improve the stability of the application.


Context and Problem 背景与问题

An application that communicates with elements running in the cloud must be sensitive to the transient faults that can occur in this environment. Such faults include the momentary loss of network connectivity to components and services, the temporary unavailability of a service, or timeouts that arise when a service is busy.


These faults are typically self-correcting, and if the action that triggered a fault is repeated after a suitable delay it is likely to be successful. For example, a database service that is processing a large number of concurrent requests may implement a throttling strategy that temporarily rejects any further requests until its workload has eased. An application attempting to access the database may fail to connect, but if it tries again after a suitable delay it may succeed.


Solution 解决方案

In the cloud, transient faults are not uncommon and an application should be designed to handle them elegantly and transparently, minimizing the effects that such faults might have on the business tasks that the application is performing.


If an application detects a failure when it attempts to send a request to a remote service, it can handle the failure by using the following strategies:


  • If the fault indicates that the failure is not transient or is unlikely to be successful if repeated (for example, an authentication failure caused by providing invalid credentials is unlikely to succeed no matter how many times it is attempted), the application should abort the operation and report a suitable exception. 如果错误表明失败不是短暂的,或者如果重复的话不太可能成功(例如,由于提供无效凭据而导致的身份验证失败不太可能成功,无论尝试多少次) ,应用程序应该中止操作并报告一个合适的异常
  • If the specific fault reported is unusual or rare, it may have been caused by freak circumstances such as a network packet becoming corrupted while it was being transmitted. In this case, the application could retry the failing request again immediately because the same failure is unlikely to be repeated and the request will probably be successful. 如果报告的特定故障不寻常或罕见,它可能是由异常情况引起的,如网络数据包在传输过程中损坏。在这种情况下,应用程序可以立即重试失败的请求,因为同样的失败不太可能重复,而且请求可能会成功
  • If the fault is caused by one of the more commonplace connectivity or “busy” failures, the network or service may require a short period while the connectivity issues are rectified or the backlog of work is cleared. The application should wait for a suitable time before retrying the request. 如果故障是由较常见的连接或“繁忙”故障之一引起的,则在连接问题得到纠正或清理积压的工作时,网络或服务可能需要一个短暂的时间。应用程序在重试请求之前应该等待一段适当的时间

For the more common transient failures, the period between retries should be chosen so as to spread requests from multiple instances of the application as evenly as possible. This can reduce the chance of a busy service continuing to be overloaded. If many instances of an application are continually bombarding a service with retry requests, it may take the service longer to recover.


If the request still fails, the application can wait for a further period and make another attempt. If necessary, this process can be repeated with increasing delays between retry attempts until some maximum number of requests have been attempted and failed. The delay time can be increased incrementally, or a timing strategy such as exponential back-off can be used, depending on the nature of the failure and the likelihood that it will be corrected during this time.


Figure 1 illustrates this pattern. If the request is unsuccessful after a predefined number of attempts, the application should treat the fault as an exception and handle it accordingly.



Figure 1 - Invoking an operation in a hosted service using the Retry pattern

图1-使用 Retry 模式在宿主服务中调用操作

The application should wrap all attempts to access a remote service in code that implements a retry policy matching one of the strategies listed above. Requests sent to different services can be subject to different policies, and some vendors provide libraries that encapsulate this approach. These libraries typically implement policies that are parameterized, and the application developer can specify values for items such as the number of retries and the time between retry attempts.


The code in an application that detects faults and retries failing operations should log the details of these failures. This information may be useful to operators. If a service is frequently reported as unavailable or busy, it is often because the service has exhausted its resources. You may be able to reduce the frequency with which these faults occur by scaling out the service. For example, if a database service is continually overloaded, it may be beneficial to partition the database and spread the load across multiple servers.




Microsoft Azure provides extensive support for the Retry pattern. The patterns & practices Transient Fault Handling Block enables an application to handle transient faults in many Azure services using a range of retry strategies. The Microsoft Entity Framework version 6 provides facilities for retrying database operations. Additionally, many of the Azure Service Bus and Azure Storage APIs implement retry logic transparently.

MicrosoftAzure 为 Retry 模式提供了广泛的支持。模式和实践瞬态故障处理块使应用程序能够使用一系列重试策略处理许多 Azure 服务中的瞬态故障。MicrosoftEntity Framework 版本6提供了重试数据库操作的工具。此外,许多 Azure 服务总线和 Azure 存储 API 透明地实现了重试逻辑。

Issues and Considerations 问题及考虑

You should consider the following points when deciding how to implement this pattern:


  • The retry policy should be tuned to match the business requirements of the application and the nature of the failure. It may be better for some noncritical operations to fail fast rather than retry several times and impact the throughput of the application. For example, in an interactive web application that attempts to access a remote service, it may be better to fail after a smaller number of retries with only a short delay between retry attempts, and display a suitable message to the user (for example, “please try again later”) to prevent the application from becoming unresponsive. For a batch application, it may be more appropriate to increase the number of retry attempts with an exponentially increasing delay between attempts. 应该调整重试策略,使其与应用程序的业务需求和故障的性质相匹配。对于某些非关键操作来说,快速失败可能比多次重试并影响应用程序的吞吐量更好。例如,在一个尝试访问远程服务的交互式 web 应用程序中,在尝试了较少次数的重试之后失败可能会更好,重试之间只有短暂的延迟,并向用户显示一个合适的消息(例如,“请稍后再试”) ,以防止应用程序变得无响应。对于批处理应用程序,可能更适合增加重试次数,并且尝试之间的延迟呈指数级增长
  • A highly aggressive retry policy with minimal delay between attempts, and a large number of retries, could further degrade a busy service that is running close to or at capacity. This retry policy could also affect the responsiveness of the application if it is continually attempting to perform a failing operation rather than doing useful work. 高度主动的重试策略(尝试之间的延迟最小)和大量重试可能会进一步降低运行接近容量或以容量运行的繁忙服务的性能。如果应用程序不断尝试执行失败的操作而不是做有用的工作,那么这种重试策略还可能影响应用程序的响应性
  • If a request still fails after a significant number of retries, it may be better for the application to prevent further requests going to the same resource for a period and simply report a failure immediately. When the period expires, the application may tentatively allow one or more requests through to see whether they are successful. For more details of this strategy, see the 如果一个请求在大量重试之后仍然失败,那么应用程序最好防止进一步的请求在一段时间内到达同一资源,并立即报告失败。当期限届满时,申请可能暂时允许一个或多个请求通过,以查看它们是否成功。有关此策略的更多详细信息,请参见Circuit Breaker pattern 断路器模式.
  • The operations in a service that are invoked by an application that implements a retry policy may need to be idempotent. For example, a request sent to a service may be received and processed successfully but, due to a transient fault, it may be unable to send a response indicating that the processing has completed. The retry logic in the application might then attempt to repeat the request on the assumption that the first request was not received. 实现重试策略的应用程序调用的服务中的操作可能需要是幂等的。例如,发送到服务的请求可能被成功接收和处理,但是由于暂时性故障,它可能无法发送表明处理已完成的响应。然后,应用程序中的重试逻辑可能会在假设没有接收到第一个请求的情况下尝试重复请求
  • A request to a service may fail for a variety of reasons and raise different exceptions, depending on the nature of the failure. Some exceptions may indicate a failure that could be resolved very quickly, while others may indicate that the failure is longer lasting. It may be beneficial for the retry policy to adjust the time between retry attempts based on the type of the exception. 对服务的请求可能由于各种原因而失败,并引发不同的异常,这取决于失败的性质。一些异常可能表明可以很快解决的故障,而另一些异常可能表明故障持续时间更长。根据异常的类型调整重试尝试之间的时间可能对重试策略有益
  • Consider how retrying an operation that is part of a transaction will affect the overall transaction consistency. It may be useful to fine tune the retry policy for transactional operations to maximize the chance of success and reduce the need to undo all the transaction steps. 考虑重试作为事务一部分的操作将如何影响整个事务的一致性。对事务操作的重试策略进行微调,以最大限度地提高成功机会,并减少撤消所有事务步骤的需要,这可能是有用的
  • Ensure that all retry code is fully tested against a variety of failure conditions. Check that it does not severely impact the performance or reliability of the application, cause excessive load on services and resources, or generate race conditions or bottlenecks. 确保所有重试代码都针对各种失败条件进行了充分测试。检查它是否严重影响应用程序的性能或可靠性,是否导致服务和资源负载过大,是否产生竞态条件或瓶颈
  • Implement retry logic only where the full context of a failing operation is understood. For example, if a task that contains a retry policy invokes another task that also contains a retry policy, this extra layer of retries can add long delays to the processing. It may be better to configure the lower-level task to fail fast and report the reason for the failure back to the task that invoked it. This higher-level task can then decide how to handle the failure based on its own policy. 只有在理解了失败操作的完整上下文的情况下才实现重试逻辑。例如,如果一个包含重试策略的任务调用另一个也包含重试策略的任务,这个额外的重试层可能会给处理增加长时间的延迟。最好将低级任务配置为快速失败,并将失败的原因报告给调用它的任务。然后,这个高级任务可以根据自己的策略决定如何处理失败
  • It is important to log all connectivity failures that prompt a retry so that underlying problems with the application, services, or resources can be identified. 记录所有提示重试的连接失败非常重要,这样可以识别应用程序、服务或资源的潜在问题
  • Investigate the faults that are most likely to occur for a service or a resource to discover if they are likely to be long lasting or terminal. If this is the case, it may be better to handle the fault as an exception. The application can report or log the exception, and then attempt to continue either by invoking an alternative service (if there is one available), or by offering degraded functionality. For more information on how to detect and handle long-lasting faults, see the 调查服务或资源最有可能发生的错误,以发现它们是否可能是长期的或终端的。如果是这种情况,最好将错误作为异常处理。应用程序可以报告或记录异常,然后尝试通过调用替代服务(如果有可用的服务)或提供降级功能来继续。有关如何检测和处理长期故障的更多信息,请参见Circuit Breaker pattern 断路器模式.

When to Use this Pattern 何时使用此模式

Use this pattern:


  • When an application could experience transient faults as it interacts with a remote service or accesses a remote resource. These faults are expected to be short lived, and repeating a request that has previously failed could succeed on a subsequent attempt. 当应用程序在与远程服务交互或访问远程资源时可能遇到短暂故障。这些错误预计是短暂的,重复以前失败的请求可能会在后续尝试中成功

This pattern might not be suitable:


  • When a fault is likely to be long lasting, because this can affect the responsiveness of an application. The application may simply be wasting time and resources attempting to repeat a request that is most likely to fail. 当一个错误可能是长期持续的,因为这可能会影响应用程序的响应性。应用程序可能只是在浪费时间和资源,试图重复最有可能失败的请求
  • For handling failures that are not due to transient faults, such as internal exceptions caused by errors in the business logic of an application. 用于处理不是由于暂时性错误造成的故障,例如由应用程序业务逻辑中的错误引起的内部异常
  • As an alternative to addressing scalability issues in a system. If an application experiences frequent “busy” faults, it is often an indication that the service or resource being accessed should be scaled up. 作为解决系统中可伸缩性问题的替代方法。如果应用程序经常出现“繁忙”故障,这通常表明正在访问的服务或资源应该被扩展

Example 例子

This example illustrates an implementation of the Retry pattern. The OperationWithBasicRetryAsync method, shown below, invokes an external service asynchronously through the TransientOperationAsync method (the details of this method will be specific to the service and are omitted from the sample code).

此示例说明了 Retry 模式的实现。OperationWithBasicRetryAsync 方法(如下所示)通过 TranentOperationAsync 方法异步调用外部服务(该方法的详细信息将特定于服务,并从示例代码中省略)。

C# C #Copy 收到

private int retryCount = 3;...public async Task OperationWithBasicRetryAsync(){  int currentRetry = 0;  for (; ;)  {    try    {      // Calling external service.      await TransientOperationAsync();                          // Return or break.      break;    }    catch (Exception ex)    {      Trace.TraceError("Operation Exception");      currentRetry++;      // Check if the exception thrown was a transient exception      // based on the logic in the error detection strategy.      // Determine whether to retry the operation, as well as how       // long to wait, based on the retry strategy.      if (currentRetry > this.retryCount || !IsTransient(ex))      {        // If this is not a transient error         // or we should not retry re-throw the exception.         throw;      }    }    // Wait to retry the operation.    // Consider calculating an exponential delay here and     // using a strategy best suited for the operation and fault.    Await.Task.Delay();  }}// Async method that wraps a call to a remote service (details not shown).private async Task TransientOperationAsync(){  ...}

The statement that invokes this method is encapsulated within a try/catch block wrapped in a for loop. The for loop exits if the call to the TransientOperationAsync method succeeds without throwing an exception. If the TransientOperationAsync method fails, the catch block examines the reason for the failure, and if it is deemed to be a transient error the code waits for a short delay before retrying the operation.

调用此方法的语句封装在包装在 for 循环中的 try/catch 块中。如果对 TranentOperationAsync 方法的调用成功而没有引发异常,则 for 循环将退出。如果 TranentOperationAsync 方法失败,catch 块检查失败的原因,如果被认为是暂时错误,则代码在重试操作之前等待短暂的延迟。

The for loop also tracks the number of times that the operation has been attempted, and if the code fails three times the exception is assumed to be more long lasting. If the exception is not transient or it is longlasting, the catch handler throws an exception. This exception exits the for loop and should be caught by the code that invokes the OperationWithBasicRetryAsync method.

For 循环还跟踪尝试操作的次数,如果代码失败三次,则假定异常持续时间更长。如果异常不是瞬时的或者是长期的,catch 处理程序将引发异常。此异常退出 for 循环,应由调用 OperationWithBasicRetryAsync 方法的代码捕获。

The IsTransient method, shown below, checks for a specific set of exceptions that are relevant to the environment in which the code is run. The definition of a transient exception may vary according to the resources being accessed and the environment in which the operation is being performed.

如下所示的 IsTranent 方法检查与运行代码的环境相关的一组特定异常。瞬态异常的定义可能会根据所访问的资源和执行操作的环境而有所不同。

C# C #Copy 收到

private bool IsTransient(Exception ex){  // Determine if the exception is transient.  // In some cases this may be as simple as checking the exception type, in other   // cases it may be necessary to inspect other properties of the exception.  if (ex is OperationTransientException)    return true;  var webException = ex as WebException;  if (webException != null)  {    // If the web exception contains one of the following status values     // it may be transient.    return new[] {WebExceptionStatus.ConnectionClosed,                   WebExceptionStatus.Timeout,                   WebExceptionStatus.RequestCanceled }.            Contains(webException.Status);  }  // Additional exception checking logic goes here.  return false;}

Related Patterns and Guidance 相关模式及指引

The following pattern may also be relevant when implementing this pattern:


  • Circuit Breaker Pattern 断路器模式. The Retry pattern is ideally suited to handling transient faults. If a failure is expected to be more long lasting, it may be more appropriate to implement the Circuit Breaker Pattern. The Retry pattern can also be used in conjunction with a circuit breaker to provide a comprehensive approach to handling faults. .Retry 模式非常适合处理瞬态故障。如果一个故障预计将是更长的持续时间,它可能更适合实施断路器模式。重试模式也可以与断路器一起使用,以提供一个综合的方法来处理故障

Runtime Reconfiguration Pattern 运行时重新配置模式

  • Article文章
  • 08/26/2015 2015年8月26日
  • 12 minutes to read还有12分钟
    Design an application so that it can be reconfigured without requiring redeployment or restarting the application. This helps to maintain availability and minimize downtime.


Context and Problem 背景与问题

A primary aim for important applications such as commercial and business websites is to minimize downtime and the consequent interruption to customers and users. However, at times it is necessary to reconfigure the application to change specific behavior or settings while it is deployed and in use. Therefore, it is an advantage for the application to be designed in such a way as to allow these configuration changes to be applied while it is running, and for the components of the application to detect the changes and apply them as soon as possible.


Examples of the kinds of configuration changes to be applied might be adjusting the granularity of logging to assist in debugging a problem with the application, swapping connection strings to use a different data store, or turning on or off specific sections or functionality of the application.


Solution 解决方案

The solution for implementing this pattern depends on the features available in the application hosting environment. Typically, the application code will respond to one or more events that are raised by the hosting infrastructure when it detects a change to the application configuration. This is usually the result of uploading a new configuration file, or in response to changes in the configuration through the administration portal or by accessing an API.

实现此模式的解决方案取决于应用程序宿主环境中可用的特性。通常,当检测到应用程序配置的更改时,应用程序代码将响应宿主基础结构引发的一个或多个事件。这通常是上传新配置文件的结果,或者是通过管理门户或访问 API 来响应配置中的更改。

Code that handles the configuration change events can examine the changes and apply them to the components of the application. It is necessary for these components to detect and react to the changes, and so the values they use will usually be exposed as writable properties or methods that the code in the event handler can set to new values or execute. From this point, the components should use the new values so that the required changes to the application behavior occur.


If it is not possible for the components to apply the changes at runtime, it will be necessary to restart the application so that these changes are applied when the application starts up again. In some hosting environments it may be possible to detect these types of changes, and indicate to the environment that the application must be restarted. In other cases it may be necessary to implement code that analyses the setting changes and forces an application restart when necessary.


Figure 1 shows an overview of this pattern.



Figure 1 - A basic overview of this pattern


Most environments expose events raised in response to configuration changes. In those that do not, a polling mechanism that regularly checks for changes to the configuration and applies these changes will be necessary. It may also be necessary to restart the application if the changes cannot be applied at runtime. For example, it may be possible to compare the date and time of a configuration file at preset intervals, and run code to apply the changes when a newer version is found. Another approach would be to incorporate a control in the administration UI of the application, or expose a secured endpoint that can be accessed from outside the application, that executes code that reads and applies the updated configuration.

大多数环境公开响应配置更改而引发的事件。对于那些没有这样做的配置,需要一种轮询机制来定期检查配置的更改并应用这些更改。如果无法在运行时应用更改,则可能还需要重新启动应用程序。例如,可以按预设的间隔比较配置文件的日期和时间,并在找到新版本时运行代码来应用更改。另一种方法是在应用程序的管理 UI 中合并一个控件,或者公开一个可以从应用程序外部访问的安全端点,该端点执行读取和应用更新的配置的代码。

Alternatively, the application could react to some other change in the environment. For example, occurrences of a specific runtime error might change the logging configuration to automatically collect additional information, or the code could use the current date to read and apply a theme that reflects the season or a special event.


Issues and Considerations 问题及考虑

Consider the following points when deciding how to implement this pattern:


  • The configuration settings must be stored outside of the deployed application so that they can be updated without requiring the entire package to be redeployed. Typically the settings are stored in a configuration file, or in an external repository such as a database or online storage. Access to the runtime configuration mechanism should be strictly controlled, as well as strictly audited when used. 配置设置必须存储在已部署的应用程序之外,以便可以在不需要重新部署整个包的情况下更新它们。通常,这些设置存储在配置文件中,或存储在外部存储库(如数据库或联机存储)中。应严格控制对运行时配置机制的访问,并在使用时进行严格审核
  • If the hosting infrastructure does not automatically detect configuration change events, and expose these events to the application code, you must implement an alternative mechanism to detect and apply the changes. This may be through a polling mechanism, or by exposing an interactive control or endpoint that initiates the update process. 如果宿主基础结构不能自动检测配置更改事件,并将这些事件公开给应用程序代码,则必须实现一种替代机制来检测和应用更改。这可以通过轮询机制实现,也可以通过公开启动更新过程的交互式控件或端点实现
  • If you need to implement a polling mechanism, consider how often checks for updates to the configuration should take place. A long polling interval will mean that changes might not be applied for some time. A short interval might adversely affect operation by absorbing available compute and I/O resources. 如果需要实现轮询机制,请考虑对配置进行更新检查的频率。较长的轮询间隔将意味着更改可能在一段时间内不会应用。较短的间隔可能会吸收可用的计算和 I/O 资源,从而对操作产生不利影响
  • If there is more than one instance of the application, additional factors should be considered, depending on how changes are detected. If changes are detected automatically through events raised by the hosting infrastructure, these changes may not be detected by all instances of the application at the same time. This means that some instances will be using the original configuration for a period while others will use the new settings. If the update is detected through a polling mechanism, this must communicate the change to all instances in order to maintain consistency. 如果应用程序有多个实例,则应考虑其他因素,具体取决于检测更改的方式。如果通过宿主基础结构引发的事件自动检测到更改,则应用程序的所有实例可能无法同时检测到这些更改。这意味着一些实例将在一段时间内使用原始配置,而其他实例将使用新设置。如果通过轮询机制检测到更新,则必须将更改传递给所有实例以保持一致性
  • Some configuration changes may require the application to be restarted, or even require the hosting server to be rebooted. You must identify these types of configuration settings and perform the appropriate action for each one. For example, a change that requires the application to be restarted might do this automatically, or it might be the responsibility of the administrator to initiate the restart at a suitable time when the application is not under excessive load and other instances of the application can handle the load. 某些配置更改可能需要重新启动应用程序,甚至需要重新启动宿主服务器。您必须标识这些类型的配置设置,并为每个配置设置执行适当的操作。例如,需要重新启动应用程序的更改可能会自动执行,或者管理员有责任在应用程序没有处于过度负载并且应用程序的其他实例可以处理负载的适当时候启动重新启动
  • Plan for a staged rollout of updates and confirm they are successful, and that the updated application instances are performing correctly, before applying the update to all instances. This can prevent a total outage of the application should an error occur. Where the update requires a restart or a reboot of the application, particularly where the application has a significant start up or warm up time, use a staged rollout approach to prevent multiple instances being offline at the same time. 在将更新应用到所有实例之前,计划分阶段推出更新,并确认它们是否成功,以及更新后的应用程序实例是否正确执行。这可以防止在发生错误时应用程序完全中断。如果更新需要重新启动或重新启动应用程序,特别是在应用程序具有重要的启动或预热时间的情况下,请使用分阶段展开方法来防止多个实例同时脱机
  • Consider how you will roll back configuration changes that cause issues, or that result in failure of the application. For example, it should be possible to roll back a change immediately instead of waiting for a polling interval to detect the change. 考虑如何回滚导致问题或导致应用程序失败的配置更改。例如,应该可以立即回滚更改,而不是等待轮询间隔来检测更改
  • Consider how the location of the configuration settings might affect application performance. For example, you should handle the error that will occur if the external store you use is unavailable when the application starts, or when configuration changes are to be applied—perhaps by using a default configuration or by caching the settings locally on the server and reusing these values while retrying access to the remote data store. 考虑配置设置的位置可能如何影响应用程序性能。例如,如果在应用程序启动时,或者要应用配置更改时,您所使用的外部存储区不可用,那么您应该处理这种情况下可能发生的错误ーー也许可以使用默认配置,或者在服务器本地缓存设置,并在重试对远程数据存储区的访问时重用这些值
  • Caching can help to reduce delays if a component needs to repeatedly access configuration settings. However, when the configuration changes, the application code will need to invalidate the cached settings, and the component must use the updated settings. 如果组件需要重复访问配置设置,缓存可以帮助减少延迟。但是,当配置发生更改时,应用程序代码将需要使缓存的设置无效,并且组件必须使用更新后的设置

When to Use this Pattern 何时使用此模式

This pattern is ideally suited for:


  • Applications for which you must avoid all unnecessary downtime, while still being able to apply changes to the application configuration. 您必须避免所有不必要的停机时间,同时仍然能够对应用程序配置应用更改的应用程序
  • Environments that expose events raised automatically when the main configuration changes. Typically this is when a new configuration file is detected, or when changes are made to an existing configuration file. 公开主配置更改时自动引发的事件的环境。通常是在检测到新配置文件或对现有配置文件进行更改时
  • Applications where the configuration changes often and the changes can be applied to components without requiring the application to be restarted, or without requiring the hosting server to be rebooted. 配置经常更改且更改可应用于组件的应用程序,而不需要重新启动应用程序,或者不需要重新启动宿主服务器

This pattern might not be suitable if the runtime components are designed so they can be configured only at initialization time, and the effort of updating those components cannot be justified in comparison to restarting the application and enduring a short downtime.


Example 例子

Microsoft Azure Cloud Services roles detect and expose two events that are raised when the hosting environment detects a change to the ServiceConfiguration.cscfg files:

Microsoft Azure Cloud Services 角色检测并公开两个事件,这两个事件是在宿主环境检测到对 ServiceConfiguration.cscfg 文件的更改时引发的:

  • RoleEnvironment.Changing 角色环境,正在改变. This event is raised after a configuration change is detected, but before it is applied to the application. You can handle the event to query the changes and to cancel the runtime reconfiguration. If you cancel the change, the web or worker role will be restarted automatically so that the new configuration is used by the application. .此事件在检测到配置更改之后但应用于应用程序之前引发。您可以处理事件以查询更改并取消运行库重新配置。如果取消更改,Web 或 worker 角色将自动重新启动,以便应用程序使用新的配置
  • RoleEnvironment.Changed 角色环境,改变了. This event is raised after the application configuration has been applied. You can handle the event to query the changes that were applied. .此事件在应用程序配置后引发。您可以处理事件以查询已应用的更改

When you cancel a change in the RoleEnvironment.Changing event you are indicating to Azure that a new setting cannot be applied while the application is running, and that it must be restarted in order to use the new value. Effectively you will cancel a change only if your application or component cannot react to the change at runtime, and requires a restart in order to use the new value.

取消角色环境中的更改时。更改事件,指示 Azure 在应用程序运行时不能应用新设置,必须重新启动该设置才能使用新值。实际上,只有在应用程序或组件无法在运行时对更改作出反应并且需要重新启动才能使用新值时,才能取消更改。


For more information see RoleEnvironment.Changing Event and Use the RoleEnvironment.Changing Event on MSDN.

有关更多信息,请参见 MSDN 上的 RoleEnvironment。更改事件和使用 RoleEnvironment。更改事件。

To handle the RoleEnvironment.Changing and RoleEnvironment.Changed events you will typically add a custom handler to the event. For example, the following code from the Global.asax.cs class in the Runtime Reconfiguration solution of the examples you can download for this guide shows how to add a custom function named RoleEnvironment_Changed to the event hander chain. This is from the Global.asax.cs file of the example.

处理角色环境。变化与角色环境。已更改的事件通常将向事件添加自定义处理程序。例如,下面的代码来自可以为本指南下载的示例的运行时重构解决方案中的 Global.asax.cs 类,它显示了如何将名为 RoleEnvironment _ Changed 的自定义函数添加到事件处理程序链中。这来自示例的 Global.asax.cs 文件。


The examples for this pattern are in the RuntimeReconfiguration.Web project of the RuntimeReconfiguration solution.

此模式的示例位于 RuntimeReconfiguration.Web 项目的 RuntimeReconfigurationSolutions 中。

protected void Application_Start(object sender, EventArgs e)
  RoleEnvironment.Changed += this.RoleEnvironment_Changed;

In a web or worker role you can use similar code in the OnStart event handler of the role to handle the RoleEnvironment.Changing event. This is from the WebRole.cs file of the example.

在 web 或 worker 角色中,可以在角色的 OnStart 事件处理程序中使用类似的代码来处理 RoleEnvironment。改变事件。这来自示例的 WebRole.cs 文件。

public override bool OnStart()
  // Add the trace listener. The web role process is not configured by web.config.
  Trace.Listeners.Add(new DiagnosticMonitorTraceListener());

  RoleEnvironment.Changing +=   this.RoleEnvironment_Changing;
  return base.OnStart();

Be aware that, in the case of web roles, the OnStart event handler runs in a separate process from the web application process itself. This is why you will typically handle the RoleEnvironment.Changed event handler in the Global.asax file so that you can update the runtime configuration of your web application, and the RoleEnvironment.Changing event in the role itself. In the case of a worker role, you can subscribe to both the RoleEnvironment.Changing and RoleEnvironment.Changed events within the OnStart event handler.

请注意,在 Web 角色的情况下,OnStart 事件处理程序运行在一个独立于 Web 应用程序流程本身的流程中。这就是您通常处理 RoleEnvironment 的原因。更改 Global.asax 文件中的事件处理程序,以便更新 Web 应用程序和 RoleEnvironment 的运行时配置。更改角色本身中的事件。在工作者角色的情况下,您可以同时订阅两个角色环境。变化与角色环境。更改了 OnStart 事件处理程序中的事件。


You can store custom configuration settings in the service configuration file, in a custom configuration file, in a database such as Azure SQL Database or SQL Server in a Virtual Machine, or in Azure blob or table storage. You will need to create code that can access the custom configuration settings and apply these to the application—typically by setting the properties of components within the application.

您可以将自定义配置设置存储在服务配置文件中、自定义配置文件中、虚拟机中的 Azure SQL 数据库或 SQL Server 数据库中、 Azure blob 或表存储中。您需要创建可以访问自定义配置设置的代码,并将这些设置应用于应用程序ーー通常是通过设置应用程序内部组件的属性。

For example, the following custom function reads the value of a setting, whose name is passed as a parameter, from the Azure service configuration file and then applies it to the current instance of a runtime component named SomeRuntimeComponent. This is from the Global.asax.cs file of the example

例如,下面的自定义函数从 Azure 服务配置文件读取设置的值(其名称作为参数传递) ,然后将其应用于名为 Some RuntimeComponent 的运行时组件的当前实例。这来自示例的 Global.asax.cs 文件

private static void ConfigureFromSetting(string settingName)
  var value = RoleEnvironment.GetConfigurationSettingValue(settingName);
  SomeRuntimeComponent.Instance.CurrentValue = value;


Some configuration settings, such as those for Windows Identity Framework, cannot be stored in the Azure service configuration file and must be in the App.config or Web.config file.

一些配置设置,比如 Windows Identity Framework,不能存储在 Azure 服务配置文件中,必须存储在 App.config 或 Web.config 文件中。

In Azure, some configuration changes are detected and applied automatically. This includes the configuration of the Widows Azure diagnostics system in the Diagnostics.wadcfg file, which specifies the types of information to collect and how to persist the log files. Therefore, it is only necessary to write code that handles the custom settings you add to the service configuration file. Your code should either:

在 Azure 中,一些配置更改将被检测并自动应用。这包括 Diagnotics.wadcfg 文件中 Widows Azure 诊断系统的配置,该文件指定要收集的信息类型以及如何持久化日志文件。因此,只需编写处理添加到服务配置文件中的自定义设置的代码。您的代码应该:

  • Apply the custom settings from an updated configuration to the appropriate components of your application at runtime so that their behavior reflects the new configuration. 在运行时将更新后的配置中的自定义设置应用于应用程序的适当组件,以便它们的行为反映新的配置
  • Cancel the change to indicate to Azure that the new value cannot be applied at runtime, and that the application must be restarted in order for the change to be applied. 取消更改,以向 Azure 指示不能在运行时应用新值,并且必须重新启动应用程序才能应用更改

For example, the following code from the WebRole.cs class in the Runtime Reconfiguration solution of the examples you can download for this guide shows how you can use the RoleEnvironment.Changing event to cancel the update for all settings except the ones that can be applied at runtime without requiring a restart. This example allows a change to the settings named “CustomSetting” to be applied at runtime without restarting the application (the component that uses this setting will be able to read the new value and change its behavior accordingly at runtime). Any other change to the configuration will automatically cause the web or worker role to restart.

例如,下面的代码来自您可以为本指南下载的示例的运行时重构解决方案中的 WebRole.cs 类,它显示了如何使用 RoleEnvironment。更改事件以取消除可在运行时应用而无需重新启动的设置以外的所有设置的更新。此示例允许在运行时更改名为“ CustomSet”的设置,而无需重新启动应用程序(使用此设置的组件将能够读取新值,并在运行时相应地更改其行为)。对配置的任何其他更改都将自动导致 Web 或 worker 角色重新启动。

C# C #Copy 收到

private void RoleEnvironment_Changing(object sender,
                               RoleEnvironmentChangingEventArgs e)
  var changedSettings = e.Changes.OfType<RoleEnvironmentConfigurationSettingChange>()
                                 .Select(c => c.ConfigurationSettingName).ToList();
  Trace.TraceInformation("Changing notification. Settings being changed: "
                         + string.Join(", ", changedSettings));

  if (changedSettings
    .Any(settingName => !string.Equals(settingName, CustomSettingName,
    Trace.TraceInformation("Cancelling dynamic configuration change (restarting).");

    // Setting this to true will restart the role gracefully. If Cancel is not 
    // set to true, and the change is not handled by the application, the 
    // application will not use the new value until it is restarted (either 
    // manually or for some other reason).
    e.Cancel = true; 
    Trace.TraceInformation("Handling configuration change without restarting. ");



This approach demonstrates good practice because it ensures that a change to any setting that the application code is not aware of (and so cannot be sure that it can be applied at runtime) will cause a restart. If any one of the changes is cancelled, the role will be restarted.


Updates that are not cancelled in the RoleEnvironment.Changing event handler can then be detected and applied to the application components after the new configuration has been accepted by the Azure framework. For example, the following code in the Global.asax file of the example solution handles the RoleEnvironment.Changed event. It examines each configuration setting and, when it finds the setting named “CustomSetting”, calls a function (shown earlier) that applies the new setting to the appropriate component in the application.

在角色环境中未取消的更新。然后,在 Azure 框架接受新配置之后,可以检测到更改的事件处理程序并将其应用到应用程序组件中。例如,示例解决方案的 Global.asax 文件中的以下代码处理 RoleEnvironment。事情有变。它检查每个配置设置,并在找到名为“ CustomSet”的设置时调用一个函数(如前所示) ,该函数将新设置应用于应用程序中的适当组件。

C# C #Copy 收到

private void RoleEnvironment_Changed(object sender, 
                               RoleEnvironmentChangedEventArgs e)
  Trace.TraceInformation("Updating instance with new configuration settings.");

  foreach (var settingChange in
    if (string.Equals(settingChange.ConfigurationSettingName, 
      // Execute a function to update the configuration of the component.
      ConfigureFromSetting(CustomSettingName );

Note that if you fail to cancel a configuration change, but do not apply the new value to your application component, then the change will not take effect until the next time that the application is restarted. This may lead to unpredictable behavior, particularly if the hosting role instance is restarted automatically by Azure as part of its regular maintenance operations—at which point the new setting value will be applied.

请注意,如果未能取消配置更改,但未将新值应用于应用程序组件,则更改将在下次重新启动应用程序之前不会生效。这可能会导致不可预测的行为,特别是如果托管角色实例作为其常规维护操作的一部分被 Azure 自动重新启动时ーー此时将应用新的设置值。

Related Patterns and Guidance 相关模式及指引

The following pattern may also be relevant when implementing this pattern:


  • External Configuration Store Pattern 外部配置存储模式. Moving configuration information out of the application deployment package to a centralized location can provide opportunities for easier management and control of configuration data, and sharing configuration data across applications and application instances. The****External Configuration Store pattern explains how you can do this. .将配置信息从应用程序部署包移动到集中的位置可以提供更容易管理和控制配置数据的机会,并且可以跨应用程序和应用程序实例共享配置数据。外部配置存储模式解释了如何做到这一点

Scheduler Agent Supervisor Pattern 调度代理主管模式

  • Article文章
  • 08/26/2015 2015年8月26日
  • 17 minutes to read还有17分钟
    Coordinate a set of actions across a distributed set of services and other remote resources, attempt to transparently handle faults if any of these actions fail, or undo the effects of the work performed if the system cannot recover from a fault. This pattern can add resiliency to a distributed system by enabling it to recover and retry actions that fail due to transient exceptions, long-lasting faults, and process failures.


Context and Problem 背景与问题

An application performs tasks that comprise a number of steps, some of which may invoke remote services or access remote resources. The individual steps may be independent of each other, but they are orchestrated by the application logic that implements the task.


Whenever possible, the application should ensure that the task runs to completion and resolve any failures that might occur when accessing remote services or resources. These failures could occur for a variety of reasons. For example, the network might be down, communications could be interrupted, a remote service may be unresponsive or in an unstable state, or a remote resource might be temporarily inaccessible—perhaps due to resource constraints. In many cases these failures may be transient and can be handled by using the Retry pattern.

只要有可能,应用程序应确保任务运行到完成,并解决访问远程服务或资源时可能发生的任何故障。这些故障可能由于各种原因而发生。例如,网络可能关闭,通信可能中断,远程服务可能没有响应或处于不稳定状态,或远程资源可能暂时无法访问(可能是由于资源限制)。在许多情况下,这些故障可能是暂时的,可以通过使用 Retry 模式来处理。

If the application detects a more permanent fault from which it cannot easily recover, it must be able to restore the system to a consistent state and ensure integrity of the entire end-to-end operation.


Solution 解决方案

The Scheduler Agent Supervisor pattern defines the following actors. These actors orchestrate the steps (individual items of work) to be performed as part of the task (the overall process):

计划程序代理管理器模式定义以下参与者。这些参与者编排要作为任务(整个流程)的一部分执行的步骤(单个工作项) :

  • The Scheduler arranges for the individual steps that comprise the overall task to be executed and orchestrates their operation. These steps can be combined into a pipeline or workflow, and the Scheduler is responsible for ensuring that the steps in this workflow are performed in the appropriate order. The Scheduler maintains information about the state of the workflow as each step is performed (such as “step not yet started,” “step running,” or “step completed”) and records information about this state. This state information should also include an upper limit of the time allowed for the step to finish (referred to as the Complete By time). If a step requires access to a remote service or resource, the Scheduler invokes the appropriate Agent, passing it the details of the work to be performed. The Scheduler typically communicates with an Agent by using asynchronous request/response messaging. This can be implemented by using queues, although other distributed messaging technologies could be used instead.

    计划程序安排组成要执行的整个任务的各个步骤,并编排它们的操作。这些步骤可以组合成管道或工作流,计划程序负责确保以适当的顺序执行此工作流中的步骤。计划程序在执行每个步骤时维护关于工作流状态的信息(例如“步骤尚未启动”、“步骤正在运行”或“步骤已完成”) ,并记录关于此状态的信息。此状态信息还应包括允许完成步骤的时间上限(称为“按时完成”)。如果某个步骤需要访问远程服务或资源,计划程序将调用适当的 Agent,并将要执行的工作的详细信息传递给它。调度程序通常通过使用异步请求/响应消息传递与代理进行通信。这可以通过使用队列来实现,不过也可以使用其他分布式消息传递技术。

    Note 注意

    The Scheduler performs a similar function to the Process Manager in the Process Manager pattern. The actual workflow is typically defined and implemented by a workflow engine that is controlled by the Scheduler. This approach decouples the business logic in the workflow from the Scheduler.

    计划程序在 Process Manager 模式中执行与 Process Manager 类似的功能。实际工作流通常由调度程序控制的工作流引擎定义和实现。此方法将工作流中的业务逻辑与调度程序分离。

  • The Agent contains logic that encapsulates a call to a remote service, or access to a remote resource referenced by a step in a task. Each Agent typically wraps calls to a single service or resource, implementing the appropriate error handling and retry logic (subject to a timeout constraint, described later). If the steps in the workflow being run by the Scheduler utilize several services and resources across different steps, each step might reference a different Agent (this is an implementation detail of the pattern).

    代理包含封装对远程服务的调用或对任务中的步骤引用的远程资源的访问的逻辑。每个 Agent 通常包装对单个服务或资源的调用,实现适当的错误处理和重试逻辑(受超时约束的影响,稍后将介绍)。如果计划程序运行的工作流中的步骤跨不同的步骤使用多个服务和资源,则每个步骤可能引用不同的 Agent (这是模式的实现细节)。

  • The Supervisor monitors the status of the steps in the task being performed by the Scheduler. It runs periodically (the frequency will be system-specific), examines the status of steps as maintained by the Scheduler. If it detects any that have timed out or failed, it arranges for the appropriate Agent to recover the step or execute the appropriate remedial action (this may involve modifying the status of a step). Note that the recovery or remedial actions are typically implemented by the Scheduler and Agents. The Supervisor should simply request that these actions be performed.

    主管监视计划执行者执行任务的步骤的状态。它定期运行(频率将是系统特定的) ,检查由计划程序维护的步骤的状态。如果它发现任何超时或失败,它会安排适当的代理人恢复步骤或执行适当的补救行动(这可能涉及修改步骤的状态)。注意,恢复或补救操作通常由计划程序和代理实现。主管应该简单地要求执行这些操作。

The Scheduler, Agent, and Supervisor are logical components and their physical implementation depends on the technology being used. For example, several logical agents may be implemented as part of a single web service.

调度程序、代理和监督程序是逻辑组件,它们的物理实现取决于所使用的技术。例如,多个逻辑代理可以作为单个 Web 服务的一部分实现。

The Scheduler maintains information about the progress of the task and the state of each step in a durable data store, referred to as the State Store. The Supervisor can use this information to help determine whether a step has failed. Figure 1 illustrates the relationship between the Scheduler, the Agents, the Supervisor, and the State Store.

计划程序维护有关任务进度和持久数据存储区(称为 State Store)中每个步骤的状态的信息。主管可以使用此信息来帮助确定步骤是否失败。图1说明了计划程序、代理、主管和状态存储之间的关系。


Figure 1 - The actors in the Scheduler Agent Supervisor pattern


Note 注意

This diagram shows a simplified illustration of the pattern. In a real implementation, there may be many instances of the Scheduler running concurrently, each a subset of tasks. Similarly, the system could run multiple instances of each Agent, or even multiple Supervisors. In this case, Supervisors must coordinate their work with each other carefully to ensure that they don’t compete to recover the same failed steps and tasks. The Leader Election pattern provides one possible solution to this problem.

此图显示了该模式的简化说明。在实际的实现中,可能会有许多并发运行调度程序的实例,每个实例都是任务的子集。类似地,系统可以运行每个 Agent 的多个实例,甚至可以运行多个督导员。在这种情况下,主管必须彼此仔细协调他们的工作,以确保他们不会竞争,以恢复相同的失败的步骤和任务。“领导人选举”模式为这个问题提供了一个可能的解决方案。

When an application wishes to run a task, it submits a request to the Scheduler. The Scheduler records initial state information about the task and its steps (for example, “step not yet started”) in the State Store and then commences performing the operations defined by the workflow. As the Scheduler starts each step, it updates the information about the state of that step in the State Store (for example, “step running”).

当应用程序希望运行任务时,它向计划程序提交一个请求。调度程序在 State Store 中记录关于任务及其步骤的初始状态信息(例如,“ step not yet start”) ,然后开始执行由工作流定义的操作。当调度程序开始每个步骤时,它将更新有关该步骤在 State Store 中的状态的信息(例如,“ step running”)。

If a step references a remote service or resource, the Scheduler sends a message to the appropriate Agent. The message may contain the information that the Agent needs to pass to the service or access the resource, in addition to the Complete By time for the operation. If the Agent completes its operation successfully, it returns a response to the Scheduler. The Scheduler can then update the state information in the State Store (for example, “step completed”) and perform the next step. This process continues until the entire task is complete.


An Agent can implement any retry logic that is necessary to perform its work. However, if the Agent does not complete its work before the Complete By period expires the Scheduler will assume that the operation has failed. In this case, the Agent should stop its work and not attempt to return anything to the Scheduler (not even an error message), or attempt any form of recovery. The reason for this restriction is that, after a step has timed out or failed, another instance of the Agent may be scheduled to run the failing step (this process is described later).

代理可以实现执行其工作所必需的任何重试逻辑。但是,如果代理没有在“完成时间”过期之前完成其工作,则计划程序将假定操作失败。在这种情况下,代理应该停止其工作,不要尝试向计划程序返回任何内容(甚至不要返回错误消息) ,也不要尝试任何形式的恢复。这种限制的原因是,在一个步骤超时或失败之后,可能会调度 Agent 的另一个实例来运行失败的步骤(稍后将介绍此过程)。

If the Agent itself fails, the Scheduler will not receive a response. The pattern may not make a distinction between a step that has timed out and one that has genuinely failed.


If a step times out or fails, the State Store will contain a record that indicates that the step is running (“step running”), but the Complete By time will have passed. The Supervisor looks for steps such as this and attempts to recover them. One possible strategy is for the Supervisor to update the Complete By value to extend the time available to complete the step, and then send a message to the Scheduler identifying the step that has timed out . The Scheduler can then attempt to repeat this step. However, such a design requires the tasks to be idempotent.

如果某个步骤超时或失败,州存储将包含一条记录,指示该步骤正在运行(“步骤正在运行”) ,但“完成时间”已经过去。主管寻找这样的步骤并尝试恢复它们。一种可能的策略是,主管更新 CompleteBy 值,以延长可用于完成步骤的时间,然后向调度程序发送消息,指出已超时的步骤。然后,计划程序可以尝试重复此步骤。然而,这样的设计要求任务是幂等的。

It may be necessary for the Supervisor to prevent the same step from being retried if it continually fails or times out. To achieve this, the Supervisor could maintain a retry count for each step, along with the state information, in the State Store. If this count exceeds a predefined threshold the Supervisor can adopt a strategy such as waiting for an extended period before notifying the Scheduler that it should retry the step, in the expectation that the fault will be resolved during this period. Alternatively, the Supervisor can send a message to the Scheduler to request the entire task be undone by implementing a Compensating Transaction (this approach will depend on the Scheduler and Agents providing the information necessary to implement the compensating operations for each step that completed successfully).

如果同一步骤持续失败或超时,主管可能有必要防止重新尝试同一步骤。为了实现这一点,主管可以在 State Store 中维护每个步骤的重试次数以及状态信息。如果计数超过预定义的阈值,主管可以采取一种策略,比如等待一段延长的时间,然后通知调度程序它应该重试该步骤,期望故障在这段时间内得到解决。或者,主管可以向调度程序发送消息,要求通过实现补偿事务来撤销整个任务(这种方法将取决于调度程序和代理提供必要的信息,以实现成功完成的每个步骤的补偿操作)。

Note 注意

It is not the purpose of the Supervisor to monitor the Scheduler and Agents, and restart them if they fail. This aspect of the system should be handled by the infrastructure in which these components are running. Similarly, the Supervisor should not have knowledge of the actual business operations that the tasks being performed by the Scheduler are running (including how to compensate should these tasks fail). This is the purpose of the workflow logic implemented by the Scheduler. The sole responsibility of the Supervisor is to determine whether a step has failed and arrange either for it to be repeated or for the entire task containing the failed step to be undone.


If the Scheduler is restarted after a failure, or the workflow being performed by the Scheduler terminates unexpectedly, the Scheduler should be able to determine the status of any in-flight task that it was handling when it failed, and be prepared to resume this task from the point at which it failed. The implementation details of this process are likely to be system specific. If the task cannot be recovered, it may be necessary to undo the work already performed by the task. This may also require implementing a Compensating Transaction.


The key advantage of this pattern is that the system is resilient in the event of unexpected temporary or unrecoverable failures. The system can be constructed to be self-healing. For example, if an Agent or the Scheduler crashes, a new one can be started and the Supervisor can arrange for a task to be resumed. If the Supervisor fails, another instance can be started and can take over from where the failure occurred. If the Supervisor is scheduled to run periodically, a new instance may be automatically started after a predefined interval. The State Store may be replicated to achieve an even greater degree of resiliency.


Issues and Considerations 问题及考虑

You should consider the following points when deciding how to implement this pattern:


  • This pattern may be nontrivial to implement and requires thorough testing of each possible failure mode of the system. 这个模式可能不容易实现,并且需要对系统的每个可能的故障模式进行彻底的测试
  • The recovery/retry logic implemented by the Scheduler may be complex and dependent on state information held in the State Store. It may also be necessary to record the information required to implement a Compensating Transaction in a durable data store. 计划程序实现的恢复/重试逻辑可能很复杂,并且依赖于州存储中保存的状态信息。还可能需要记录在持久数据存储中实现补偿事务所需的信息
  • The frequency with which the Supervisor runs will be important. It should run frequently enough to prevent any failed steps from blocking an application for an extended period, but it should not run so frequently that it becomes an overhead. 主管运行的频率很重要。它应该足够频繁地运行,以防止任何失败的步骤阻塞应用程序长时间,但是它不应该运行得太频繁以至于成为开销
  • The steps performed by an Agent could be run more than once. The logic that implements these steps should be idempotent. 代理执行的步骤可以运行多次。实现这些步骤的逻辑应该是幂等的

When to Use this Pattern 何时使用此模式

Use this pattern when a process that runs in a distributed environment such as the cloud must be resilient to communications failure and/or operational failure.


This pattern might not be suitable for tasks that do not invoke remote services or access remote resources.


Example 例子

A web application that implements an ecommerce system has been deployed on Microsoft Azure. Users can run this application to browse the products available from an organization, and place orders for these products. The user interface runs as a web role, and the order processing elements of the application are implemented as a set of worker roles. Part of the order processing logic involves accessing a remote service, and this aspect of the system could be prone to transient or more long-lasting faults. For this reason, the designers used the Scheduler Agent Supervisor pattern to implement the order processing elements of the system.

一个实现电子商务系统的 web 应用程序已经部署在微软 Azure 上。用户可以运行此应用程序来浏览组织提供的产品,并为这些产品下订单。用户界面作为 Web 角色运行,应用程序的订单处理元素作为一组辅助角色实现。订单处理逻辑的一部分涉及到访问远程服务,系统的这一方面可能容易出现短暂故障或更持久的故障。出于这个原因,设计人员使用调度代理监督模式来实现系统的订单处理元素。

When a customer places an order, the application constructs a message that describes the order and posts this message to a queue. A separate Submission process, running in a worker role, retrieves this message, inserts the details of the order into the Orders database, and creates a record for the order process in the State Store. Note that the inserts into the Orders database and the State Store are performed as part of the same operation. The Submission process is designed to ensure that both inserts complete together.

当客户下订单时,应用程序构造一条描述订单的消息,并将该消息发送到队列。在辅助角色中运行的单独的 Submission 进程检索此消息,将订单的详细信息插入 Orders 数据库,并为 State Store 中的订单进程创建一条记录。注意,对 Orders 数据库和 State Store 的插入是作为相同操作的一部分执行的。提交过程旨在确保两个插入一起完成。

The state information that the Submission process creates for the order includes:


  • OrderID 命令: The ID of the order in the Orders database. : Orders 数据库中订单的 ID

  • LockedBy 被锁住了: The instance ID of the worker role handling the order. There may be multiple current instances of the worker role running the Scheduler, but each order should only be handled by a single instance. : 处理订单的 worker 角色的实例 ID。运行计划程序的辅助角色可能有多个当前实例,但是每个订单应该只由单个实例处理

  • CompleteBy 完成: The time by which the order should be processed. : 处理订单的时间

  • ProcessState

    The current state of the task handling the order. The possible states are:


    • Pending 等待中. The order has been created but processing has not yet been initiated. 。订单已经创建,但处理尚未启动
    • Processing 正在处理. The order is currently being processed. 。当前正在处理订单
    • Processed 处理过了. The order has been processed successfully. 。订单已成功处理
    • Error 错误. The order processing has failed. 。订单处理失败
  • FailureCount 故障计数: The number of times that processing has been attempted for the order. : 尝试处理订单的次数

In this state information, the OrderID field is copied from the order ID of the new order. The LockedBy and CompleteBy fields are set to null, the ProcessState field is set to Pending, and the FailureCount field is set to 0.

在此状态信息中,将从新订单的订单 ID 复制 OrderID 字段。LockedBy 和 CompleteBy 字段被设置为 null,ProcessState 字段被设置为 Pending,false ureCount 字段被设置为0。


In this example, the order handling logic is relatively simple and only comprises a single step that invokes a remote service. In a more complex multi-step scenario, the Submission process would likely involve several steps, and so several records would be created in the State Store—each one describing the state of an individual step.

在本例中,订单处理逻辑相对简单,只包含一个调用远程服务的步骤。在更复杂的多步骤场景中,提交过程可能涉及多个步骤,因此在 State Store 中将创建多个记录ーー每个记录描述单个步骤的状态。

The Scheduler also runs as part of a worker role and implements the business logic that handles the order. An instance of the Scheduler polling for new orders examines the State Store for records where the LockedBy field is null and the ProcessState field is Pending. When the Scheduler finds a new order, it immediately populates the LockedBy field with its own instance ID, sets the CompleteBy field to an appropriate time, and sets the ProcessState field to Processing. The code that does this is designed to be exclusive and atomic to ensure that two concurrent instances of the Scheduler cannot attempt to handle the same order simultaneously.

计划程序还作为工作者角色的一部分运行,并实现处理订单的业务逻辑。对新订单进行调度器轮询的实例检查 State Store 中 LockedBy 字段为 null 且 ProcessState 字段为 Pending 的记录。当 Scheduler 发现一个新订单时,它会立即用自己的实例 ID 填充 LockedBy 字段,将 CompleteBy 字段设置为适当的时间,并将 ProcessState 字段设置为 Processing。执行此操作的代码被设计为独占的和原子的,以确保调度程序的两个并发实例不能尝试同时处理相同的顺序。

The Scheduler then runs the business workflow to process the order asynchronously, passing it the value in the OrderID field from the State Store. The workflow handling the order retrieves the details of the order from the Orders database and performs its work. When a step in the order processing workflow needs to invoke the remote service, it uses an Agent. The workflow step communicates with the Agent by using a pair of Azure Service Bus message queues acting as a request/response channel. Figure 2 shows a high-level view of the solution.

然后,计划程序运行业务工作流来异步处理订单,并将来自 State Store 的 OrderID 字段中的值传递给它。处理订单的工作流从 Orders 数据库中检索订单的详细信息并执行其工作。当订单处理工作流中的某个步骤需要调用远程服务时,它将使用一个 Agent。工作流步骤通过使用一对 Azure Service Bus 消息队列作为请求/响应通道与 Agent 进行通信。图2显示了解决方案的高级视图。


Figure 2 - Using the Scheduler Agent Supervisor pattern to handle orders in a Azure solution

图2-在 Azure 解决方案中使用调度代理监管模式处理订单

The message sent to the Agent from a workflow step describes the order and includes the CompleteBy time. If the Agent receives a response from the remote service before the CompleteBy time expires, it constructs a reply message that it posts on the Service Bus queue on which the workflow is listening. When the workflow step receives the valid reply message, it completes its processing and the Scheduler sets the ProcessState field of the order state to Processed. At this point, the order processing has completed successfully.

从工作流步骤发送到 Agent 的消息描述了顺序并包括 CompleteBy 时间。如果 Agent 在 CompleteBy 时间到期之前从远程服务收到响应,它将构造一条应答消息,并将其发送到工作流正在侦听的 Service Bus 队列上。当工作流步骤接收到有效的应答消息时,它将完成其处理并且 Scheduler 将订单状态的 ProcessState 字段设置为 Processed。此时,订单处理已经成功完成。

If the CompleteBy time expires before the Agent receives a response from the remote service, the Agent simply halts its processing and terminates handling the order. Similarly, if the workflow handling the order exceeds the CompleteBy time, it also terminates. In both of these cases, the state of the order in the State Store remains set to Processing, but the CompleteBy time indicates that the time for processing the order has passed and the process is deemed to have failed. Note that if the Agent that is accessing the remote service, or the workflow that is handling the order (or both) terminate unexpectedly, the information in the State Store will again remain set to Processing and eventually will have an expired CompleteBy value.

如果 CompleteBy 时间在 Agent 从远程服务收到响应之前过期,则 Agent 只需停止其处理并终止处理订单。类似地,如果处理订单的工作流超过 CompleteBy 时间,它也将终止。在这两种情况下,State Store 中订单的状态仍然设置为 Processing,但 CompleteBy 时间表明处理订单的时间已经过去,并且流程被认为已经失败。请注意,如果访问远程服务的 Agent 或处理订单的工作流(或两者)意外终止,则 State Store 中的信息将再次保持设置为 Processing,并且最终将具有过期的 CompleteBy 值。

If the Agent detects an unrecoverable non-transient fault while it is attempting to contact the remote service, it can send an error response back to the workflow. The Scheduler can set the status of the order to Error and raise an event that alerts an operator. The operator can then attempt to resolve the reason for the failure manually and resubmit the failed processing step.

如果代理在试图联系远程服务时检测到不可恢复的非瞬态故障,则可以将错误响应发送回工作流。计划程序可以将订单的状态设置为 Error 并引发警告操作员的事件。然后,操作员可以尝试手动解决失败的原因,并重新提交失败的处理步骤。

The Supervisor periodically examines the State Store looking for orders with an expired CompleteBy value. If the Supervisor finds such a record, it increments the FailureCount field. If the FailureCount value is below a specified threshold value, the Supervisor resets the LockedBy field to null, updates the CompleteBy field with a new expiration time, and sets the ProcessState field to Pending. An instance of the Scheduler can pick up this order and perform its processing as before. If the FailureCount value exceeds a specified threshold, the reason for the failure is assumed to be non-transient. The Supervisor sets the status of the order to Error and raises an event that alerts an operator, as previously described.

督导程序定期检查 State Store,以查找具有过期 CompleteBy 值的订单。如果管理员发现这样一个记录,它将增加“故障计数”字段。如果 False ureCount 值低于指定的阈值,督导程序将 LockedBy 字段重置为 null,用新的过期时间更新 CompleteBy 字段,并将 ProcessState 字段设置为 Pending。计划程序的实例可以拾取此订单并像前面一样执行其处理。如果 False ureCount 值超过指定的阈值,则假定失败的原因是非瞬态的。如前所述,主管将订单的状态设置为 Error 并引发警告操作员的事件。


In this example, the Supervisor is implemented in a separate worker role. You can utilize a variety of strategies to arrange for the Supervisor task to be run, including using the Azure Scheduler service (not to be confused with the Scheduler component in this pattern). For more information about the Azure Scheduler service, visit the Scheduler page.

在这个例子中,督导者是在一个单独的工作者角色中实现的。您可以使用各种策略来安排督导任务的运行,包括使用 Azure 调度器服务(不要与此模式中的调度器组件混淆)。有关 Azure 调度器服务的更多信息,请访问调度器页面。

Although it is not shown in this example, the Scheduler may need to keep the application that submitted the order in the first place informed about the progress and status of the order. The application and the Scheduler are isolated from each other to eliminate any dependencies between them. The application has no knowledge of which instance of the Scheduler is handling the order, and the Scheduler is unaware of which specific application instance posted the order.


To enable the order status to be reported, the application could use its own private response queue. The details of this response queue would be included as part of the request sent to the Submission process, which would include this information in the State Store. The Scheduler would then post messages to this queue indicating the status of the order (“request received,” “order completed,” “order failed,” and so on). It should include the Order ID in these messages so that they can be correlated with the original request by the application.

要报告订单状态,应用程序可以使用自己的私有响应队列。这个响应队列的详细信息将作为发送到提交过程的请求的一部分包括在内,提交过程将在 State Store 中包含这些信息。然后,调度程序将向此队列发送消息,指示订单的状态(“请求已收到”、“订单已完成”、“订单失败”等)。它应该在这些消息中包含 Order ID,以便它们可以与应用程序的原始请求相关联。

Related Patterns and Guidance 相关模式及指引

The following patterns and guidance may also be relevant when implementing this pattern:


  • Retry Pattern 重试模式. An Agent can use this pattern to transparently retry an operation that accesses a remote service or resource, and that has previously failed, in the expectation that the cause of the failure is transient and may be corrected. .代理可以使用这种模式透明地重试访问远程服务或资源的操作,这些操作以前曾经失败过,期望失败的原因是暂时的,并且可以纠正
  • Circuit Breaker Pattern 断路器模式. An Agent can use this pattern to handle faults that may take a variable amount of time to rectify when connecting to a remote service or resource. .代理可以使用此模式处理在连接到远程服务或资源时可能需要不同时间来纠正的错误
  • Compensating Transaction Pattern 补偿事务模式. If the workflow being performed by a Scheduler cannot be completed successfully, it may be necessary to undo any work it has previously performed. The Compensating Transaction pattern describes how this can be achieved for operations that follow the eventual consistency model. These are the types of operations that are commonly implemented by a Scheduler that performs complex business processes and workflows. .如果计划程序正在执行的工作流无法成功完成,则可能需要撤消其以前执行的任何工作。补偿事务模式描述了如何对遵循最终一致性模型的操作进行补偿。这些类型的操作通常由执行复杂业务流程和工作流的计划程序实现
  • Asynchronous Messaging Primer 异步消息入门. The components in the Scheduler Agent Supervisor pattern typically run decoupled from each other and communicate asynchronously. The Asynchronous Messaging primer describes some of the approaches that can be used to implement asynchronous communication based on message queues. .调度代理监管器模式中的组件通常彼此解耦运行并异步通信。异步消息入门介绍了一些可用于实现基于消息队列的异步通信的方法
  • Leader Election Pattern 领袖选举模式. It may be necessary to coordinate the actions of multiple instances of a Supervisor to prevent them from attempting to recover the same failed process. The Leader Election pattern describes how this coordination can be achieved. .可能有必要协调一名主管的多个实例的行动,以防止它们试图恢复同一失败过程。领导人选举模式描述了如何实现这种协调

Sharding Pattern 分片模式

  • Article文章
  • 08/26/2015 2015年8月26日
  • 20 minutes to read20分钟读完
    Divide a data store into a set of horizontal partitions or shards. This pattern can improve scalability when storing and accessing large volumes of data.


Context and Problem 背景与问题

A data store hosted by a single server may be subject to the following limitations:


  • Storage space 存储空间. A data store for a large-scale cloud application may be expected to contain a huge volume of data that could increase significantly over time. A server typically provides only a finite amount of disk storage, but it may be possible to replace existing disks with larger ones, or add further disks to a machine as data volumes grow. However, the system will eventually reach a hard limit whereby it is not possible to easily increase the storage capacity on a given server. .大型云应用程序的数据存储可能会包含大量的数据,这些数据随着时间的推移会显著增加。服务器通常只提供有限数量的磁盘存储,但是可以用较大的磁盘替换现有的磁盘,或者随着数据量的增长向计算机添加更多的磁盘。但是,系统最终将达到一个硬限制,因此不可能轻易地增加给定服务器上的存储容量
  • Computing resources 电脑资源. A cloud application may be required to support a large number of concurrent users, each of which run queries that retrieve information from the data store. A single server hosting the data store may not be able to provide the necessary computing power to support this load, resulting in extended response times for users and frequent failures as applications attempting to store and retrieve data time out. It may be possible to add memory or upgrade processors, but the system will reach a limit when it is not possible to increase the compute resources any further. .云应用程序可能需要支持大量并发用户,每个用户都运行从数据存储中检索信息的查询。承载数据存储的单个服务器可能无法提供必要的计算能力来支持这种负载,导致用户的响应时间延长,并且在应用程序试图存储和检索数据超时时经常出现故障。可以添加内存或升级处理器,但是当不可能进一步增加计算资源时,系统将达到一个限制
  • Network bandwidth 网络带宽. Ultimately, the performance of a data store running on a single server is governed by the rate at which the server can receive requests and send replies. It is possible that the volume of network traffic might exceed the capacity of the network used to connect to the server, resulting in failed requests. .最终,在单个服务器上运行的数据存储的性能取决于服务器接收请求和发送应答的速率。网络流量可能超过用于连接到服务器的网络的容量,从而导致请求失败
  • Geography 地理位置. It may be necessary to store data generated by specific users in the same region as those users for legal, compliance, or performance reasons, or to reduce latency of data access. If the users are dispersed across different countries or regions, it may not be possible to store the entire data for the application in a single data store. .出于法律、法规遵循或性能方面的原因,可能有必要将特定用户生成的数据存储在与这些用户相同的区域中,或者减少数据访问的延迟。如果用户分散在不同的国家或地区,则可能无法将应用程序的整个数据存储在单个数据存储区中

Scaling vertically by adding more disk capacity, processing power, memory, and network connections may postpone the effects of some of these limitations, but it is likely to be only a temporary solution. A commercial cloud application capable of supporting large numbers of users and high volumes of data must be able to scale almost indefinitely, so vertical scaling is not necessarily the best solution.


Solution 解决方案

Divide the data store into horizontal partitions or shards. Each shard has the same schema, but holds its own distinct subset of the data. A shard is a data store in its own right (it can contain the data for many entities of different types), running on a server acting as a storage node.

将数据存储区划分为水平分区或分片。每个碎片具有相同的模式,但是拥有自己独特的数据子集。碎片本身就是一个数据存储(它可以包含许多不同类型实体的数据) ,运行在充当存储节点的服务器上。

This pattern offers the following benefits:


  • You can scale the system out by adding further shards running on additional storage nodes. 您可以通过添加在其他存储节点上运行的更多碎片来扩展系统
  • A system can use off the shelf commodity hardware rather than specialized (and expensive) computers for each storage node. 对于每个存储节点,系统可以使用现成的商品硬件,而不是专门的(昂贵的)计算机
  • You can reduce contention and improved performance by balancing the workload across shards. 您可以通过跨碎片平衡工作负载来减少争用和提高性能
  • In the cloud, shards can be located physically close to the users that will access the data. 在云中,可以将碎片放置在离访问数据的用户很近的地方

When dividing a data store up into shards, decide which data should be placed in each shard. A shard typically contains items that fall within a specified range determined by one or more attributes of the data. These attributes form the shard key (sometimes referred to as the partition key). The shard key should be static. It should not be based on data that might change.


Sharding physically organizes the data. When an application stores and retrieves data, the sharding logic directs the application to the appropriate shard. This sharding logic may be implemented as part of the data access code in the application, or it could be implemented by the data storage system if it transparently supports sharding.


Abstracting the physical location of the data in the sharding logic provides a high level of control over which shards contain which data, and enables data to migrate between shards without reworking the business logic of an application should the data in the shards need to be redistributed later (for example, if the shards become unbalanced). The tradeoff is the additional data access overhead required in determining the location of each data item as it is retrieved.

抽象分片逻辑中数据的物理位置可以提供对哪些分片包含哪些数据的高级控制,并且如果分片中的数据以后需要重新分布(例如,如果分片变得不平衡) ,可以使数据在分片之间迁移,而无需重新修改应用程序的业务逻辑。折衷是在检索每个数据项时确定其位置所需的额外数据访问开销。

To ensure optimal performance and scalability, it is important to split the data in a way that is appropriate for the types of queries the application performs. In many cases, it is unlikely that the sharding scheme will exactly match the requirements of every query. For example, in a multi-tenant system an application may need to retrieve tenant data by using the tenant ID, but it may also need to look up this data based on some other attribute such as the tenant’s name or location. To handle these situations, implement a sharding strategy with a shard key that supports the most commonly performed queries.

为了确保最佳性能和可伸缩性,以适合应用程序执行的查询类型的方式分割数据非常重要。在许多情况下,分片方案不太可能完全匹配每个查询的需求。例如,在多租户系统中,应用程序可能需要通过使用租户 ID 来检索租户数据,但是它也可能需要基于其他属性(如租户的名称或位置)来查找这些数据。要处理这些情况,使用支持最常执行的查询的分片键实现分片策略。

If queries regularly retrieve data by using a combination of attribute values, it may be possible to define a composite shard key by concatenating attributes together. Alternatively, use a pattern such as Index Table to provide fast lookup to data based on attributes that are not covered by the shard key.

如果查询通过组合使用属性值定期检索数据,则可以通过将属性连接在一起来定义组合碎片键。或者,可以使用像 Index Table 这样的模式来提供基于属性的数据的快速查找,这些属性不在碎片键的覆盖范围内。

Sharding Strategies 切分策略

Three strategies are commonly used when selecting the shard key and deciding how to distribute data across shards. Note that there does not have to be a one-to-one correspondence between shards and the servers that host them—a single server can host multiple shards. The strategies are:


  • The Lookup strategy. In this strategy the sharding logic implements a map that routes a request for data to the shard that contains that data by using the shard key. In a multi-tenant application all the data for a tenant might be stored together in a shard by using the tenant ID as the shard key. Multiple tenants might share the same shard, but the data for a single tenant will not be spread across multiple shards. Figure 1 shows an example of this strategy.

    查找策略。在此策略中,分片逻辑实现了一个映射,该映射通过使用分片键将数据请求路由到包含该数据的分片。在多租户应用程序中,通过使用租户 ID 作为分片密钥,可以将租户的所有数据存储在一个分片中。多个租户可能共享同一个碎片,但是单个租户的数据不会跨多个碎片分布。图1显示了此策略的一个示例。


    Figure 1 - Sharding tenant data based on tenant IDs

    图1-基于租户 ID 的租户数据分片

    The mapping between the shard key and the physical storage may be based on physical shards where each shard key maps to a physical partition. Alternatively, a technique that provides more flexibility when rebalancing shards is to use a virtual partitioning approach where shard keys map to the same number of virtual shards, which in turn map to fewer physical partitions. In this approach, an application locates data by using a shard key that refers to a virtual shard, and the system transparently maps virtual shards to physical partitions. The mapping between a virtual shard and a physical partition can change without requiring the application code to be modified to use a different set of shard keys.


  • The Range strategy. This strategy groups related items together in the same shard, and orders them by shard key—the shard keys are sequential. It is useful for applications that frequently retrieve sets of items by using range queries (queries that return a set of data items for a shard key that falls within a given range). For example, if an application regularly needs to find all orders placed in a given month, this data can be retrieved more quickly if all orders for a month are stored in date and time order in the same shard. If each order was stored in a different shard, they would have to be fetched individually by performing a large number of point queries (queries that return a single data item). Figure 2 shows an example of this strategy.

    Range 策略。这种策略将相关项目组合在同一个碎片中,并按照碎片键ーー碎片键是连续的ーー对它们进行排序。对于通过使用范围查询(查询返回属于给定范围的碎片键的一组数据项)频繁检索项集的应用程序来说,它非常有用。例如,如果应用程序定期需要查找给定月份的所有订单,则如果将一个月的所有订单按日期和时间顺序存储在同一碎片中,则可以更快地检索此数据。如果每个订单都存储在不同的碎片中,那么就必须通过执行大量的点查询(返回单个数据项的查询)来单独获取订单。图2显示了此策略的一个示例。


    Figure 2 - Storing sequential sets (ranges) of data in shards


    In this example, the shard key is a composite key comprising the order month as the most significant element, followed by the order day and the time. The data for orders is naturally sorted when new orders are created and appended to a shard. Some data stores support two-part shard keys comprising a partition key element that identifies the shard and a row key that uniquely identifies an item within the shard. Data is usually held in row key order within the shard. Items that are subject to range queries and need to be grouped together can use a shard key that has the same value for the partition key but a unique value for the row key.


  • The Hash strategy. The purpose of this strategy is to reduce the chance of hotspots in the data. It aims to distribute the data across the shards in a way that achieves a balance between the size of each shard and the average load that each shard will encounter. The sharding logic computes the shard in which to store an item based on a hash of one or more attributes of the data. The chosen hashing function should distribute data evenly across the shards, possibly by introducing some random element into the computation. Figure 2 shows an example of this strategy.



    Figure 3 - Sharding tenant data based on a hash of tenant IDs

    图3-基于租户 ID 散列的租户数据分片

  • To understand the advantage of the Hash strategy over other sharding strategies, consider how a multi-tenant application that enrolls new tenants sequentially might assign the tenants to shards in the data store. When using the Range strategy, the data for tenants 1 to n will all be stored in shard A, the data for tenants n+1 to m will all be stored in shard B, and so on. If the most recently registered tenants are also the most active, most data activity will occur in a small number of shards—which could cause hotspots. In contrast, the Hash strategy allocates tenants to shards based on a hash of their tenant ID. This means that sequential tenants are most likely to be allocated to different shards, as shown in Figure 3 for tenants 55 and 56, which will distribute the load across these shards.

    为了理解 Hash 策略相对于其他分片策略的优势,请考虑一下按顺序登记新租户的多租户应用程序如何将租户分配给数据存储区中的分片。在使用 Range 策略时,租户1到 n 的数据将全部存储在分片 A 中,租户 n + 1到 m 的数据将全部存储在分片 B 中,依此类推。如果最近注册的租户也是最活跃的,那么大多数数据活动将发生在少量碎片中,这可能会导致热点。相反,Hash 策略根据租户 ID 的散列将租户分配给碎片。这意味着顺序租户最有可能被分配到不同的分片,如图3中的55和56租户所示,它们将负载分配到这些分片上。

The following table lists the main advantages and considerations for these three sharding strategies.


Lookup查一下More control over the way that shards are configured and used.更多地控制碎片的配置和使用方式。Using virtual shards reduces the impact when rebalancing data because new physical partitions can be added to even out the workload. The mapping between a virtual shard and the physical partitions that implement the shard can be modified without affecting application code that uses a shard key to store and retrieve data.使用虚拟碎片可以减少重新平衡数据时的影响,因为可以添加新的物理分区来均衡工作负载。可以修改虚拟碎片和实现碎片的物理分区之间的映射,而不会影响使用碎片密钥存储和检索数据的应用程序代码。Looking up shard locations can impose an additional overhead.查找碎片位置会增加额外的开销。
Range范围Easy to implement and works well with range queries because they can often fetch multiple data items from a single shard in a single operation.易于实现并且可以很好地处理范围查询,因为它们通常可以在单个操作中从单个碎片中获取多个数据项。Easier data management. For example, if users in the same region are in the same shard, updates can be scheduled in each time zone based on the local load and demand pattern.更容易的数据管理。例如,如果相同区域中的用户位于相同的分片中,则可以根据本地负载和需求模式在每个时区中调度更新。May not provide optimal balancing between shards.可能无法在碎片之间提供最佳平衡。Rebalancing shards is difficult and may not resolve the problem of uneven load if the majority of activity is for adjacent shard keys.如果大部分活动是针对相邻的碎片键,那么重新平衡碎片是困难的,并且可能无法解决负载不均匀的问题。
Hash大麻Better chance of a more even data and load distribution.数据和负载分布更均匀的机会更大。Request routing can be accomplished directly by using the hash function. There is no need to maintain a map.请求路由可以通过使用散列函数直接完成。不需要维护映射。Computing the hash may impose an additional overhead.计算散列可能会增加额外的开销。Rebalancing shards is difficult.重新平衡碎片是困难的。

Most common sharding schemes implement one of the approaches described above, but you should also consider the business requirements of your applications and their patterns of data usage. For example, in a multi-tenant application:


  • You can shard data based on workload. You could segregate the data for highly volatile tenants in separate shards. The speed of data access for other tenants may be improved as a result. 您可以根据工作负载分片数据。您可以将高度不稳定的租户的数据隔离在单独的碎片中。因此,其他租户的数据访问速度可能会得到提高
  • You can shard data based on the location of tenants. It may be possible to take the data for tenants in a specific geographic region offline for backup and maintenance during off-peak hours in that region, while the data for tenants in other regions remains online and accessible during their business hours. 您可以根据租户的位置分片数据。在非高峰时段,可以将特定地理区域的租户的数据离线备份和维护,而其他地区的租户的数据保持在线,并可在其营业时间访问
  • High-value tenants could be assigned their own private high-performing, lightly loaded shards, whereas lower-value tenants might be expected to share more densely-packed, busy shards. 高价值租户可以分配他们自己的私有高性能、轻负载的碎片,而低价值租户可能会分享更密集、忙碌的碎片
  • The data for tenants that require a high degree of data isolation and privacy could be stored on a completely separate server. 需要高度数据隔离和保密的租户数据可以存储在一个完全独立的服务器上

Scaling and Data Movement Operations 缩放和数据移动操作

Each of the sharding strategies implies different capabilities and levels of complexity for managing scale in, scale out, data movement, and maintaining state.


The Lookup strategy permits scaling and data movement operations to be carried out at the user level, either online or offline. The technique is to suspend some or all user activity (perhaps during off-peak periods), move the data to the new virtual partition or physical shard, change the mappings, invalidate or refresh any caches that hold this data, and then allow user activity to resume. Often this type of operation can be centrally managed. The Lookup strategy requires state to be highly cacheable and replica friendly.

查找策略允许在用户级别(在线或离线)执行伸缩和数据移动操作。该技术是暂停部分或全部用户活动(可能在非高峰期) ,将数据移动到新的虚拟分区或物理分片,更改映射,使保存这些数据的任何缓存失效或刷新,然后允许用户活动恢复。这种类型的操作通常可以集中管理。查找策略要求状态具有高度可缓存性并且对副本友好。

The Range strategy imposes some limitations on scaling and data movement operations, which must typically be carried out when a part or all of the data store is offline because the data must be split and merged across the shards. Moving the data to rebalance shards may not resolve the problem of uneven load if the majority of activity is for adjacent shard keys or data identifiers that are within the same range. The Range strategy may also require some state to be maintained in order to map ranges to the physical partitions.

Range 策略对缩放和数据移动操作施加了一些限制,这些操作通常必须在部分或全部数据存储脱机时执行,因为数据必须通过分片进行拆分和合并。如果大部分活动是针对相邻的碎片键或相同范围内的数据标识符,那么将数据移动到重新平衡碎片可能无法解决负载不均衡的问题。Range 策略还可能需要维护某些状态,以便将范围映射到物理分区。

The Hash strategy makes scaling and data movement operations more complex because the partition keys are hashes of the shard keys or data identifiers. The new location of each shard must be determined from the hash function, or the function modified to provide the correct mappings. However, the Hash strategy does not require maintenance of state.

哈希策略使缩放和数据移动操作更加复杂,因为分区键是碎片键或数据标识符的哈希。每个分片的新位置必须通过散列函数确定,或者通过修改函数来提供正确的映射。但是,Hash 策略不需要维护状态。

Issues and Considerations 问题及考虑

Consider the following points when deciding how to implement this pattern:


  • Sharding is complementary to other forms of partitioning, such as vertical partitioning and functional partitioning. For example, a single shard may contain entities that have been partitioned vertically, and a functional partition may be implemented as multiple shards. For more information about partitioning, see the Data Partitioning Guidance.


  • Keep shards balanced so that they all handle a similar volume of I/O. As data is inserted and deleted, it may be necessary to periodically rebalance the shards to guarantee an even distribution and to reduce the chance of hotspots. Rebalancing can be an expensive operation. To reduce the frequency with which rebalancing becomes necessary you should plan for growth by ensuring that each shard contains sufficient free space to handle the expected volume of changes. You should also develop strategies and scripts that you can use to quickly rebalance shards should this become necessary.

    保持碎片的平衡,以便它们都能处理相似的 I/O 量。随着数据的插入和删除,可能需要定期重新平衡碎片,以保证均匀分布,并减少出现热点的机会。再平衡可能是一项代价高昂的操作。为了减少重新平衡变得必要的频率,您应该通过确保每个碎片包含足够的可用空间来处理预期的更改量来规划增长。您还应该开发策略和脚本,以便在必要时能够快速重新平衡碎片。

  • Use stable data for the shard key. If the shard key changes, the corresponding data item may have to move between shards, increasing the amount of work performed by update operations. For this reason, avoid basing the shard key on potentially volatile information. Instead, look for attributes that are invariant or that naturally form a key.


  • Ensure that shard keys are unique. For example, avoid using auto-incrementing fields as the shard key. Is some systems, auto-incremented fields may not be coordinated across shards, possibly resulting in items in different shards having the same shard key.


    Note 注意

    Auto-incremented values in fields that do not comprise the shard key can also cause problems. For example, if you use auto-incremented fields to generate unique IDs, then two different items located in different shards may be assigned the same ID.

    在不包含碎片键的字段中自动增加值也可能导致问题。例如,如果使用自动递增的字段来生成唯一的 ID,那么位于不同碎片中的两个不同项可能被分配相同的 ID。

  • It may not be possible to design a shard key that matches the requirements of every possible query against the data. Shard the data to support the most frequently performed queries, and if necessary create secondary index tables to support queries that retrieve data by using criteria based on attributes that are not part of the shard key. For more information, see the Index Table pattern.


  • Queries that access only a single shard will be more efficient than those that retrieve data from multiple shards, so avoid implementing a sharding scheme that results in applications performing large numbers of queries that join data held in different shards. Remember that a single shard can contain the data for multiple types of entities. Consider denormalizing your data to keep related entities that are commonly queried together (such as the details of customers and the orders that they have placed) in the same shard to reduce the number of separate reads that an application performs.


    Note 注意

    If an entity in one shard references an entity stored in another shard, include the shard key for the second entity as part of the schema for the first entity. This can help to improve the performance of queries that reference related data across shards.


  • If an application must perform queries that retrieve data from multiple shards, it may be possible to fetch this data by using parallel tasks. Examples include fan-out queries, where data from multiple shards is retrieved in parallel and then aggregated into a single result. However, this approach inevitably adds some complexity to the data access logic of a solution.


  • For many applications, creating a larger number of small shards can be more efficient than having a small number of large shards because they can offer increased opportunities for load balancing. This approach can also be useful if you anticipate the need to migrate shards from one physical location to another. Moving a small shard is quicker than moving a large one.


  • Make sure that the resources available to each shard storage node are sufficient to handle the scalability requirements in terms of data size and throughput. For more information, see the section “Designing Partitions for Scalability” in the Data Partitioning Guidance.


  • Consider replicating reference data to all shards. If an operation that retrieves data from a shard also references static or slow-moving data as part of the same query, add this data to the shard. The application can then fetch all of the data for the query easily, without having to make an additional round trip to a separate data store.


    Note 注意

    If reference data held in multiple shards changes, the system must synchronize these changes across all shards. The system may experience a degree of inconsistency while this synchronization occurs. If you follow this approach, you should design your applications to be able to handle this inconsistency.


  • It can be difficult to maintain referential integrity and consistency between shards, so you should minimize operations that affect data in multiple shards. If an application must modify data across shards, evaluate whether complete data consistency is actually a requirement. Instead, a common approach in the cloud is to implement eventual consistency. The data in each partition is updated separately, and the application logic must take responsibility for ensuring that the updates all complete successfully, as well as handling the inconsistencies that can arise from querying data while an eventually consistent operation is running. For more information about implementing eventual consistency, see the Data Consistency Primer.


  • Configuring and managing a large number of shards can be a challenge. Tasks such as monitoring, backing up, checking for consistency, and logging or auditing must be accomplished on multiple shards and servers, possibly held in multiple locations. These tasks are likely to be implemented by using scripts or other automation solutions, but scripting and automation might not be able to completely eliminate the additional administrative requirements.


  • Shards can be geo-located so that the data that they contain is close to the instances of an application that use it. This approach can considerably improve performance, but requires additional consideration for tasks that must access multiple shards in different locations.


When to Use this Pattern 何时使用此模式

Use this pattern:


  • When a data store is likely to need to scale beyond the limits of the resources available to a single storage node. 当数据存储可能需要扩展到超出单个存储节点可用资源的限制时
  • To improve performance by reducing contention in a data store. 通过减少数据存储区中的争用来提高性能

Note 注意

The primary focus of sharding is to improve the performance and scalability of a system, but as a by-product it can also improve availability by virtue of the way in which the data is divided into separate partitions. A failure in one partition does not necessarily prevent an application from accessing data held in other partitions, and an operator can perform maintenance or recovery of one or more partitions without making the entire data for an application inaccessible. For more information, see the Data Partitioning Guidance.


Example 例子

The following example uses a set of SQL Server databases acting as shards. Each database holds a subset of the data used by an application. The application retrieves data that is distributed across the shards by using its own sharding logic (this is an example of a fan-out query). The details of the data that is located in each shard is returned by a method called GetShards. This method returns an enumerable list of ShardInformation objects, where the ShardInformation type contains an identifier for each shard and the SQL Server connection string that an application should use to connect to the shard (the connection strings are not shown in the code example).

下面的示例使用一组 SQLServer 数据库作为碎片。每个数据库都包含应用程序使用的数据的一个子集。应用程序通过使用自己的分片逻辑(这是扇形查询的一个示例)检索分布在分片之间的数据。位于每个碎片中的数据的详细信息由一个名为 GetShards 的方法返回。此方法返回 ShardInformation 对象的可枚举列表,其中 ShardInformation 类型包含每个 Shard 的标识符和应用程序应该用于连接到 Shard 的 SQL Server 连接字符串(连接字符串在代码示例中未显示)。

private IEnumerable<ShardInformation> GetShards(){
  // This retrieves the connection information from a shard store
  // (commonly a root database).
  return new[]  {
    new ShardInformation
      Id = 1,
      ConnectionString = ...
    new ShardInformation
      Id = 2,
      ConnectionString = ...

The code below shows how the application uses the list of ShardInformation objects to perform a query that fetches data from each shard in parallel. The details of the query are not shown, but in this example the data that is retrieved comprises a string which could hold information such as the name of a customer if the shards contain the details of customers. The results are aggregated into a ConcurrentBag collection for processing by the application.

下面的代码显示了应用程序如何使用 ShardInformation 对象列表来执行查询,该查询并行地从每个碎片中获取数据。没有显示查询的详细信息,但是在这个示例中,检索到的数据包含一个字符串,如果碎片包含客户的详细信息,该字符串可以包含客户的名称等信息。结果聚合到 ConcurrentBag 集合中,供应用程序处理。

// Retrieve the shards as a ShardInformation[] instance. 
var shards = GetShards();
var results = new ConcurrentBag<string>();
// Execute the query against each shard in the shard list.
// This list would typically be retrieved from configuration 
// or from a root/master shard 
store.Parallel.ForEach(shards, shard =>{
  // NOTE: Transient fault handling is not included, 
  // but should be incorporated when used in a real world application.
  using (var con = new SqlConnection(shard.ConnectionString))  {
    var cmd = new SqlCommand("SELECT ... FROM ...", con);
    Trace.TraceInformation("Executing command against shard: {0}", shard.Id);
    var reader = cmd.ExecuteReader();
    // Read the results in to a thread-safe data structure.
    while (reader.Read())
Trace.TraceInformation("Fanout query complete - Record Count: {0}",                         results.Count);

Related Patterns and Guidance 相关模式及指引

The following patterns and guidance may also be relevant when implementing this pattern:


  • Data Consistency Primer 数据一致性入门. It may be necessary to maintain consistency for data distributed across different shards. The Data Consistency Primer summarizes the issues surrounding maintaining consistency over distributed data, and describes the benefits and tradeoffs of different consistency models. .可能需要维护分布在不同碎片上的数据的一致性。数据一致性入门总结了围绕在分布式数据上维护一致性的问题,并描述了不同一致性模型的优缺点
  • Data Partitioning Guidance 数据分区指南. Sharding a data store can introduce a range of additional issues. The Data Partitioning Guidance describes these issues in relation to partitioning data stores in the cloud to improve scalability, reduce contention, and optimize performance. .分片数据存储区可能会引入一系列其他问题。数据分区指南描述了与在云中分区数据存储以提高可伸缩性、减少争用和优化性能相关的这些问题
  • Index Table Pattern 索引表模式. Sometimes it is not possible to completely support queries just through the design of the shard key. The Index Table pattern enables an application to quickly retrieve data from a large data store by specifying a key other than the shard key. .有时不可能仅仅通过设计分片键就完全支持查询。Index Table 模式使应用程序能够通过指定不同于分片密钥的密钥,从大型数据存储区快速检索数据
  • Materialized View Pattern 实体化视图模式. To maintain the performance of some query operations, it may be beneficial to create materialized views that aggregate and summarize data, especially if this summary data is based on information that is distributed across shards. The Materialized View pattern describes how to generate and populate these views. .为了维护某些查询操作的性能,创建聚合和汇总数据的物化视图可能是有益的,特别是如果这些汇总数据基于跨碎片分布的信息。物化视图模式描述了如何生成和填充这些视图

Static Content Hosting Pattern 静态内容托管模式

  • Article文章
  • 08/26/2015 2015年8月26日
  • 7 minutes to read还有7分钟
    Deploy static content to a cloud-based storage service that can deliver these directly to the client. This pattern can reduce the requirement for potentially expensive compute instances.


Context and Problem 背景与问题

Web applications typically include some elements of static content. This static content may include HTML pages and other resources such as images and documents that are available to the client, either as part of an HTML page (such as inline images, style sheets, and client-side JavaScript files) or as separate downloads (such as PDF documents).

Web 应用程序通常包含静态内容的一些元素。这些静态内容可能包括 HTML 页面和其他资源,比如客户端可用的图片和文档,或者作为 HTML 页面的一部分(比如内联图片、样式表和客户端 JavaScript 文件) ,或者作为单独的下载(比如 PDF 文档)。

Although web servers are well tuned to optimize requests through efficient dynamic page code execution and output caching, they must still handle requests to download static content. This absorbs processing cycles that could often be put to better use.

尽管 Web 服务器通过有效的动态页面代码执行和输出缓存进行了良好的调优,以优化请求,但是它们仍然必须处理下载静态内容的请求。这吸收了处理周期,往往可以更好地加以利用。

Solution 解决方案

In most cloud hosting environments it is possible to minimize the requirement for compute instances (for example, to use a smaller instance or fewer instances), by locating some of an application’s resources and static pages in a storage service. The cost for cloud-hosted storage is typically much less than for compute instances.


When hosting some parts of an application in a storage service, the main considerations are related to deployment of the application and to securing resources that are not intended to be available to anonymous users.


Issues and Considerations 问题及考虑

Consider the following points when deciding how to implement this pattern:


  • The hosted storage service must expose an HTTP endpoint that users can access to download the static resources. Some storage services also support HTTPS, which means that it is possible to host resources in storage service that require the use of SSL. 宿主存储服务必须公开一个 HTTP 端点,用户可以访问该端点来下载静态资源。一些存储服务还支持 HTTPS,这意味着可以在需要使用 SSL 的存储服务中托管资源
  • For maximum performance and availability, consider using a content delivery network (where available) to cache the contents of the storage container in multiple datacenters around the world. However, this will incur additional cost for the use of the content delivery network. 为了获得最大的性能和可用性,可以考虑使用一个内容传递网路(如果有的话)在世界各地的多个数据中心缓存存储容器的内容。不过,使用内容传递网路会增加成本
  • Storage accounts are often geo-replicated by default to provide resiliency against events that might impact a datacenter. This means that the IP address may change, but the URL will remain the same. 默认情况下,存储帐户通常是地理复制的,以提供对可能影响数据中心的事件的弹性。这意味着 IP 地址可能会改变,但 URL 将保持不变
  • When some content is located in a storage account and other content is in a hosted compute instance it becomes more challenging to deploy an application and to update it. It may be necessary to perform separate deployments, and version the application and content in order to manage it more easily—especially when the static content includes script files or UI components. However, if only static resources are to be updated they can simply be uploaded to the storage account without needing to redeploy the application package. 当某些内容位于存储帐户中,而其他内容位于宿主计算实例中时,部署和更新应用程序变得更具挑战性。可能需要执行单独的部署,并对应用程序和内容进行版本控制,以便更容易地管理它,特别是当静态内容包含脚本文件或 UI 组件时。但是,如果只更新静态资源,那么只需将它们上载到存储帐户,而无需重新部署应用程序包
  • Storage services may not support the use of custom domain names. In this case it is necessary to specify the full URL of the resources in links because they will be in a different domain from the dynamically generated content containing the links. 存储服务可能不支持使用自定义域名。在这种情况下,有必要指定链接中资源的完整 URL,因为它们将位于不同于包含链接的动态生成内容的域中
  • The storage containers must be configured for public read access, but it is vital to ensure that they are not configured for public write access to prevent users being able to upload content. Consider using a valet key or token to control access to resources that should not be available anonymously—see 存储容器必须配置为公共读访问,但至关重要的是要确保它们没有配置为公共写访问,以防止用户能够上传内容。请考虑使用代理密钥或令牌来控制对不应匿名提供的资源的访问ーー请参见Valet Key Pattern 代客泊车钥匙模式 for more information. 了解更多信息

When to Use this Pattern 何时使用此模式

This pattern is ideally suited for:


  • Minimizing the hosting cost for websites and applications that contain some static resources. 最小化包含一些静态资源的网站和应用程序的托管成本
  • Minimizing the hosting cost for websites that consist of only static content and resources. Depending on the capabilities of the hosting provider’s storage system, it might be possible to host a fully static website in its entirety within a storage account. 最小化只包含静态内容和资源的网站的托管成本。根据托管提供商的存储系统的能力,可以在一个存储帐户中完整地托管一个完全静态的网站
  • Exposing static resources and content for applications running in other hosting environments or on-premises servers. 公开在其他宿主环境或内部服务器中运行的应用程序的静态资源和内容
  • Locating content in more than one geographical area by using a content delivery network that caches the contents of the storage account in multiple datacenters around the world. 使用经纬度内容传递网路在世界各地的多个数据中心缓存存储帐户的内容,从而在多个数据中心定位内容
  • Monitoring costs and bandwidth usage. Using a separate storage account for some or all of the static content allows the costs to be more easily distinguished from hosting and runtime costs. 监控成本和带宽使用情况。对部分或全部静态内容使用单独的存储帐户可以更容易地将成本与托管和运行时成本区分开来

This pattern might not be suitable in the following situations:


  • The application needs to perform some processing on the static content before delivering it to the client. For example, it may be necessary to add a timestamp to a document. 应用程序需要在将静态内容交付给客户机之前对其执行一些处理。例如,可能需要向文档添加时间戳
  • The volume of static content is very small. The overhead of retrieving this content from separate storage may outweigh the cost benefit of separating it out from the compute resources. 静态内容的体积非常小。从单独的存储中检索此内容的开销可能超过将其从计算资源中分离出来的成本收益

Note 注意

It is sometimes possible to store a complete website that contains only static content such as HTML pages, images, style sheets, client-side JavaScript files, and downloadable documents such as PDF files in a cloud-hosted storage. For more information see An efficient way of deploying a static web site on Microsoft Azure on the Infosys blog.

有时可以在云存储中存储完整的网站,其中只包含静态内容,如 HTML 页面、图像、样式表、客户端 JavaScript 文件和可下载文档,如 PDF 文件。要了解更多信息,请看 Infosys 博客上 Microsoft Azure 上部署静态网站的有效方法。

Example 例子

Static content located in Azure blob storage can be accessed directly by a web browser. Azure provides an HTTP-based interface over storage that can be publicly exposed to clients. For example, content in a Azure blob storage container is exposed using a URL of the form:

Azure blob 存储中的静态内容可以通过 Web 浏览器直接访问。Azure 提供了一个基于 HTTP 的存储接口,可以向客户机公开。例如,Azure blob 存储容器中的内容使用表单的 URL 公开:


Http://[ Storage-account-name ] . blob.core.windows.net/[ container-name ]/[ file-name ]

When uploading the content for the application it is necessary to create one or more blob containers to hold the files and documents. Note that the default permission for a new container is Private, and you must change this to Public to allow clients to access the contents. If it is necessary to protect the content from anonymous access, you can implement the Valet Key pattern so users must present a valid token in order to download the resources.

在上传应用程序的内容时,需要创建一个或多个 blob 容器来保存文件和文档。注意,新容器的默认权限是 Private,必须将其更改为 Public,以允许客户端访问内容。如果需要保护内容不受匿名访问,可以实现 Valet Key 模式,这样用户必须提供有效的令牌才能下载资源。

Note 注意

The page Blob Service Concepts on the Azure website contains information about blob storage, and the ways that you can access it and use it.

Azure 网站上的 Blob 服务概念页面包含有关 Blob 存储的信息,以及访问和使用它的方法。

The links in each page will specify the URL of the resource and the client will access this resource directly from the storage service. Figure 1 shows this approach.

每个页面中的链接将指定资源的 URL,客户端将直接从存储服务访问该资源。图1显示了这种方法。


Figure 1 - Delivering static parts of an application directly from a storage service


The links in the pages delivered to the client must specify the full URL of the blob container and resource. For example, a page that contains a link to an image in a public container might contain the following.

传递给客户端的页面中的链接必须指定 blob 容器和资源的完整 URL。例如,包含指向公共容器中图像的链接的页面可能包含以下内容。

HTML 超文本标示语言Copy 收到

<img src="http://mystorageaccount.blob.core.windows.net/myresources/image1.png"
     alt="My image" />

Note 注意

If the resources are protected by using a valet key, such as an Azure Shared Access Signature (SAS), this signature must be included in the URLs in the links.

如果资源是通过使用代理密钥来保护的,比如 Azure 共享访问签名(Azure Shared Access Signature,SAS) ,那么这个签名必须包含在链接中的 URL 中。

The examples available for this guide contain a solution named StaticContentHosting that demonstrates using external storage for static resources. The StaticContentHosting.Cloud project contains configuration files that specify the storage account and container that holds the static content.

本指南提供的示例包含一个名为 StaticContentHoost 的解决方案,该解决方案演示了如何对静态资源使用外部存储。静态内容托管。云项目包含指定存储帐户和容纳静态内容的容器的配置文件。

<Setting name="StaticContent.StorageConnectionString" 
         value="UseDevelopmentStorage=true" />
<Setting name="StaticContent.Container" value="static-content" />

The Settings class in the file Settings.cs of the StaticContentHosting.Web project contains methods to extract these values and build a string value containing the cloud storage account container URL.

StaticContentHohost 的 Settings.cs 文件中的 Settings 类。Web 项目包含提取这些值和构建包含云存储帐户容器 URL 的字符串值的方法。

public class Settings{
    public static string StaticContentStorageConnectionString {
        get {
            return RoleEnvironment.GetConfigurationSettingValue(                              "StaticContent.StorageConnectionString");
    public static string StaticContentContainer  {
        get {
            return RoleEnvironment.GetConfigurationSettingValue("StaticContent.Container");
    public static string StaticContentBaseUrl  {
        get {
            var account = CloudStorageAccount.Parse(StaticContentStorageConnectionString);
            return string.Format("{0}/{1}", account.BlobEndpoint.ToString().TrimEnd('/'),                                      StaticContentContainer.TrimStart('/'));

The StaticContentUrlHtmlHelper class in the file StaticContentUrlHtmlHelper.cs exposes a method named StaticContentUrl that generates a URL containing the path to the cloud storage account if the URL passed to it starts with the ASP.NET root path character (~).

文件 StaticContentUrlHtmlHelper.cs 中的 staticContenturlHtmlHelper 类公开了一个名为 staticContenturl 的方法,该方法生成一个 URL,如果传递给它的 URL 以 ASP.NET 根路径字符(~)开始,则该 URL 包含云存储帐户的路径。

public static class StaticContentUrlHtmlHelper{
    public static string StaticContentUrl(this HtmlHelper helper, string contentPath)  {
        if (contentPath.StartsWith("~"))    {
            contentPath = contentPath.Substring(1);
        contentPath = string.Format("{0}/{1}", Settings.StaticContentBaseUrl.TrimEnd('/'),                                contentPath.TrimStart('/'));
        var url = new UrlHelper(helper.ViewContext.RequestContext);
        return url.Content(contentPath);

The file Index.cshtml in the Views\Home folder contains an image element that uses the StaticContentUrl method to create the URL for its src attribute.

ViewsHome 文件夹中的文件 Index.cshtml 包含一个图像元素,该元素使用 StaticContentUrl 方法为其 src 属性创建 URL。

HTML 超文本标示语言Copy 收到

<img src="@Html.StaticContentUrl("~/Images/orderedList1.png")" alt="Test Image" />

Related Patterns and Guidance 相关模式及指引

The following pattern may also be relevant when implementing this pattern:


  • Valet Key Pattern 代客泊车钥匙模式. If the target resources are not supposed to be available to anonymous users it is necessary to implement security over the store that holds the static content. The Valet Key pattern describes how to use a token or key that provides clients with restricted direct access to a specific resource or service such as a cloud-hosted storage service. .如果目标资源不应该对匿名用户可用,则有必要对存储静态内容的存储区实现安全性。Valet Key 模式描述如何使用令牌或密钥,该令牌或密钥为客户机提供对特定资源或服务(如云托管存储服务)的有限直接访问

Throttling Pattern 节流模式

  • Article文章
  • 08/26/2015 2015年8月26日
  • 8 minutes to read还有8分钟
    Control the consumption of resources used by an instance of an application, an individual tenant, or an entire service. This pattern can allow the system to continue to function and meet service level agreements, even when an increase in demand places an extreme load on resources.


Context and Problem 背景与问题

The load on a cloud application typically varies over time based on the number of active users or the types of activities they are performing. For example, more users are likely to be active during business hours, or the system may be required to perform computationally expensive analytics at the end of each month. There may also be sudden and unanticipated bursts in activity. If the processing requirements of the system exceed the capacity of the resources that are available, it will suffer from poor performance and may even fail. The system may be obliged to meet an agreed level of service, and such failure could be unacceptable.


There are many strategies available for handling varying load in the cloud, depending on the business goals for the application. One strategy is to use autoscaling to match the provisioned resources to the user needs at any given time. This has the potential to consistently meet user demand, while optimizing running costs. However, while autoscaling may trigger the provisioning of additional resources, this provisioning is not instantaneous. If demand grows quickly, there may be a window of time where there is a resource deficit.


Solution 解决方案

An alternative strategy to autoscaling is to allow applications to use resources only up to some soft limit, and then throttle them when this limit is reached. The system should monitor how it is using resources so that, when usage exceeds some system-defined threshold, it can throttle requests from one or more users to enable the system to continue functioning and meet any service level agreements (SLAs) that are in place. For more information on monitoring resource usage, see the Instrumentation and Telemetry Guidance.


The system could implement several throttling strategies, including:


  • Rejecting requests from an individual user who has already accessed system APIs more than 拒绝已经访问系统 API 超过n times per second over a given period of time. This requires that the system meters the use of resources for each tenant or user running an application. For more information, see the 在给定的时间段内每秒的时间。这要求系统对运行应用程序的每个租户或用户的资源使用情况进行测量。有关更多信息,请参见Service Metering Guidance 服务计量指引.
  • Disabling or degrading the functionality of selected nonessential services so that essential services can run unimpeded with sufficient resources. For example, if the application is streaming video output, it could switch to a lower resolution. 禁用或降低选定的非必要服务的功能,以便基本服务能够在有足够资源的情况下畅通无阻地运行。例如,如果应用程序是流式视频输出,它可以切换到较低的分辨率
  • Using load leveling to smooth the volume of activity (this approach is covered in more detail by the 使用负载均衡来平滑活动量(此方法由Queue-based Load Leveling pattern 基于队列的负载均衡模式). In a multitenant environment, this approach will reduce the performance for every tenant. If the system must support a mix of tenants with different SLAs, the work for high-value tenants might be performed immediately. Requests for other tenants can be held back, and handled when the backlog has eased. The ).在多租户环境中,这种方法将降低每个租户的性能。如果系统必须支持具有不同 SLA 的租户组合,则可以立即为高价值租户执行工作。对其他租户的请求可以暂缓,并在积压缓解后处理。那个Priority Queue pattern 优先队列模式 could be used to help implement this approach. 可以用来帮助实现这种方法
  • Deferring operations being performed on behalf of lower priority applications or tenants. These operations can be suspended or curtailed, with an exception generated to inform the tenant that the system is busy and that the operation should be retried later. 代表较低优先级的应用程序或租户延迟执行的操作。这些操作可以暂停或缩减,但是生成一个异常,以通知租户系统正忙,以及以后应该重试该操作

Figure 1 shows an area graph for resource utilization (a combination of memory, CPU, bandwidth, and other factors) against time for applications that are making use of three features. A feature is an area of functionality, such as a component that performs a specific set of tasks, a piece of code that performs a complex calculation, or an element that provides a service such as an in-memory cache. These features are labeled A, B, and C.

图1显示了使用三个特性的应用程序的资源利用率(内存、 CPU、带宽和其他因素的组合)随时间变化的区域图。特性是功能的一个区域,例如执行特定任务集的组件、执行复杂计算的代码段或提供服务(如内存缓存)的元素。这些特征被标记为 A、 B 和 C。


Figure 1 - Graph showing resource utilization against time for applications running on behalf of three users


Note 注意

The area immediately below the line for a feature indicates the resources used by applications when they invoke this feature. For example, the area below the line for Feature A shows the resources used by applications that are making use of Feature A, and the area between the lines for Feature A and Feature B indicates the resources by used by applications invoking Feature B. Aggregating the areas for each feature shows the total resource utilization of the system.

特性行下面的区域表示应用程序在调用该特性时使用的资源。例如,功能 A 线下面的区域显示使用功能 A 的应用程序所使用的资源,功能 A 和功能 B 线之间的区域显示调用功能 B 的应用程序所使用的资源。

The graph in Figure 1 illustrates the effects of deferring operations. Just prior to time T1, the total resources allocated to all applications using these features reach a threshold (the soft limit of resource utilization). At this point, the applications are in danger of exhausting the resources available. In this system, Feature B is less critical than Feature A or Feature C, so it is temporarily disabled and the resources that it was using are released. Between times T1 and T2, the applications using Feature A and Feature C continue running as normal. Eventually, the resource use of these two features diminishes to the point when, at time T2, there is sufficient capacity to enable Feature B again.

图1中的图表说明了延迟操作的影响。就在 T1之前,分配给使用这些特性的所有应用程序的总资源达到了一个阈值(资源利用率的软限制)。此时,应用程序有耗尽可用资源的危险。在这个系统中,特征 B 的关键性不如特征 A 或特征 C,因此它被暂时禁用,它所使用的资源被释放。在 T1和 T2之间,使用 FeatureA 和 FeatureC 的应用程序继续正常运行。最终,这两个特性的资源使用减少到一定程度,在时间 T2时,有足够的容量再次启用特性 B。

The autoscaling and throttling approaches can also be combined to help keep the applications responsive and within SLAs. If the demand is expected to remain high, throttling may provide a temporary solution while the system scales out. At this point, the full functionality of the system can be restored.

还可以将自动缩放和节流方法结合起来,以帮助保持应用程序在 SLA 内响应。如果预计需求仍然很高,节流可能提供一个临时的解决方案,而系统的规模。此时,可以恢复系统的全部功能。

Figure 2 shows an area graph of the overall resource utilization by all applications running in a system against time, and illustrates how throttling can be combined with autoscaling.



Figure 2 - Graph showing the effects of combining throttling with autoscaling


At time T1, the threshold specifying the soft limit of resource utilization is reached. At this point, the system can start to scale out. However, if the new resources do not become available sufficiently quickly then the existing resources may be exhausted and the system could fail. To prevent this from occurring, the system is temporarily throttled, as described earlier. When autoscaling has completed and the additional resources are available, throttling can be relaxed.

在 T1时,达到指定资源利用软限制的阈值。此时,系统可以开始向外扩展。但是,如果新的资源不能足够快地变得可用,那么现有的资源可能会耗尽,系统可能会失败。为了防止这种情况发生,如前所述,系统被临时节流。当自动伸缩完成并且附加资源可用时,可以放松节流。

Issues and Considerations 问题及考虑

You should consider the following points when deciding how to implement this pattern:


  • Throttling an application, and the strategy to use, is an architectural decision that impacts the entire design of a system. Throttling should be considered early on in the application design because it is not easy to add it once a system has been implemented. 控制应用程序和使用策略是一个影响系统整体设计的体系结构决策。在应用程序设计的早期阶段就应该考虑节流,因为一旦实现了系统,就很难再添加节流
  • Throttling must be performed quickly. The system must be capable of detecting an increase in activity and react accordingly. The system must also be able to revert back to its original state quickly after the load has eased. This requires that the appropriate performance data is continually captured and monitored. 节流必须迅速执行。系统必须能够检测到活动的增加并作出相应的反应。系统还必须能够在负载减轻后迅速恢复到原来的状态。这要求不断地捕获和监视适当的性能数据
  • If a service needs to temporarily deny a user request, it should return a specific error code so that the client application understands that the reason for the refusal to perform an operation is due to throttling. The client application can wait for a period before retrying the request. 如果服务需要临时拒绝用户请求,它应该返回一个特定的错误代码,以便客户端应用程序理解拒绝执行操作的原因是由于节流。客户端应用程序在重试请求之前可以等待一段时间
  • Throttling can be used as an interim measure while a system autoscales. In some cases it may be better to simply throttle, rather than to scale, if a burst in activity is sudden and is not expected to be long lived because scaling can add considerably to running costs. 当系统自动缩放时,节流可以作为一种临时措施。在某些情况下,如果活动的爆发是突然的,而且预计不会持续很长时间,那么最好是简单地节流,而不是扩大规模,因为扩大规模可能会大大增加运行成本
  • If throttling is being used as a temporary measure while a system autoscales, and if resource demands grow very quickly, the system might not be able to continue functioning—even when operating in a throttled mode. If this is not acceptable, consider maintaining larger reserves of capacity and configuring more aggressive autoscaling. 如果在系统自动伸缩时,节流被用作一种临时措施,而且资源需求增长非常快,那么系统可能无法继续运行ーー即使是在节流模式下运行。如果这是不可接受的,考虑维护更大的容量储备和配置更积极的自动伸缩

When to Use this Pattern 何时使用此模式

Use this pattern:


  • To ensure that a system continues to meet service level agreements. 确保系统继续满足服务水平协议
  • To prevent a single tenant from monopolizing the resources provided by an application. 防止单个租户垄断应用程序提供的资源
  • To handle bursts in activity. 处理活动中的突发事件
  • To help cost-optimize a system by limiting the maximum resource levels needed to keep it functioning. 通过限制维持系统运行所需的最大资源水平来帮助优化系统的成本

Example 例子

Figure 3 illustrates how throttling can be implemented in a multi-tenant system. Users from each of the tenant organizations access a cloud-hosted application where they fill out and submit surveys. The application contains instrumentation that monitors the rate at which these users are submitting requests to the application.


In order to prevent the users from one tenant affecting the responsiveness and availability of the application for all other users, a limit is applied to the number of requests per second that the users from any one tenant can submit. The application blocks requests that exceed this limit.



Figure 3 - Implementing throttling in a multi-tenant application


Related Patterns and Guidance 相关模式及指引

The following patterns and guidance may also be relevant when implementing this pattern:


  • Instrumentation and Telemetry Guidance 仪器和遥测导则.****Throttling depends on gathering information on how heavily a service is being used. The Instrumentation and Telemetry Guidance describes how to generate and capture custom monitoring information. .节流依赖于收集关于服务使用量的信息。仪表和遥测指南描述了如何生成和捕获自定义监视信息
  • Service Metering Guidance 服务计量指引. This guidance describes how to meter the use of services in order to gain an understanding of how they are used. This information can be useful in determining how to throttle a service. .本指南描述了如何对服务的使用进行度量,以便理解服务是如何使用的。此信息对于确定如何限制服务有用
  • Autoscaling Guidance 自动缩放导航. Throttling can be used as an interim measure while a system autoscales, or to remove the need for a system to autoscale. The Autoscaling****Guidance contains more information on autoscaling strategies. .节流可以作为一个临时措施,而系统自动缩放,或消除需要一个系统自动缩放。自动缩放指南包含更多关于自动缩放策略的信息
  • Queue-based Load Leveling pattern 基于队列的负载均衡模式. Queue-based load leveling is a commonly used mechanism for implementing throttling. A queue can act as a buffer that helps to even out the rate at which requests sent by an application are delivered to a service. .基于队列的负载均衡是实现节流的常用机制。队列可以充当缓冲区,帮助平衡应用程序发送的请求传递到服务的速率
  • Priority Queue Pattern 优先队列模式. A system can use priority queuing as part of its throttling strategy to maintain performance for critical or higher value applications, while reducing the performance of less important applications. .系统可以使用优先级队列作为其节流策略的一部分,以维护关键或更高值应用程序的性能,同时降低不太重要应用程序的性能

Valet Key Pattern 代客泊车钥匙模式

  • Article文章
  • 08/26/2015 2015年8月26日
  • 13 minutes to read还有13分钟
    Use a token or key that provides clients with restricted direct access to a specific resource or service in order to offload data transfer operations from the application code. This pattern is particularly useful in applications that use cloud-hosted storage systems or queues, and can minimize cost and maximize scalability and performance.


Context and Problem 背景与问题

Client programs and web browsers often need to read and write files or data streams to and from an application’s storage. Typically, the application will handle the movement of the data—either by fetching it from storage and streaming it to the client, or by reading the uploaded stream from the client and storing it in the data store. However, this approach absorbs valuable resources such as compute, memory, and bandwidth.

客户端程序和 Web 浏览器通常需要在应用程序的存储中读写文件或数据流。通常,应用程序将处理数据的移动ーー要么从存储中提取数据并将其流到客户端,要么从客户端读取上传的流并将其存储在数据存储中。但是,这种方法会占用有价值的资源,如计算、内存和带宽。

Data stores have the capability to handle upload and download of data directly, without requiring the application to perform any processing to move this data, but this typically requires the client to have access to the security credentials for the store. While this can be a useful technique to minimize data transfer costs and the requirement to scale out the application, and to maximize performance, it means that the application is no longer able to manage the security of the data. Once the client has a connection to the data store for direct access, the application cannot act as the gatekeeper. It is no longer in control of the process and cannot prevent subsequent uploads or downloads from the data store.


This is not a realistic approach in modern distributed systems that may need to serve untrusted clients. Instead, applications must be able to securely control access to data in a granular way, but still reduce the load on the server by setting up this connection and then allowing the client to communicate directly with the data store to perform the required read or write operations.


Solution 解决方案

To resolve the problem of controlling access to a data store where the store itself cannot manage authentication and authorization of clients, one typical solution is to restrict access to the data store’s public connection and provide the client with a key or token that the data store itself can validate.


This key or token is usually referred to as a valet key. It provides time-limited access to specific resources and allows only predefined operations such as reading and writing to storage or queues, or uploading and downloading in a web browser. Applications can create and issue valet keys to client devices and web browsers quickly and easily, allowing clients to perform the required operations without requiring the application to directly handle the data transfer. This removes the processing overhead, and the consequent impact on performance and scalability, from the application and the server.

此密钥或令牌通常称为代客密钥。它提供了对特定资源的有时间限制的访问,并且只允许预定义的操作,比如对存储或队列的读写,或者在 Web 浏览器中的上传和下载。应用程序可以快速、方便地创建并向客户端设备和 Web 浏览器发出代客密钥,从而允许客户端执行所需的操作,而无需应用程序直接处理数据传输。这就从应用程序和服务器上消除了处理开销以及随之而来的对性能和可伸缩性的影响。

The client uses this token to access a specific resource in the data store for only a specific period, and with specific restrictions on access permissions, as shown in Figure 1. After the specified period, the key becomes invalid and will not allow subsequent access to the resource.



Figure 1 - Overview of the pattern


It is also possible to configure a key that has other dependencies, such as the scope of the location of the data. For example, depending on the data store capabilities, the key may specify a complete table in a data store, or only specific rows in a table. In cloud storage systems the key may specify a container, or just a specific item within a container.


The key can also be invalidated by the application. This is a useful approach if the client notifies the server that the data transfer operation is complete. The server can then invalidate that key to prevent its use for any subsequent access to the data store.


Using this pattern can simplify managing access to resources because there is no requirement to create and authenticate a user, grant permissions, and then remove the user again. It also makes it easy to constrain the location, the permission, and the validity period—all by simply generating a suitable key at runtime. The important factors are to limit the validity period, and especially the location of the resource, as tightly as possible so that the recipient can use it for only the intended purpose.


Issues and Considerations 问题及考虑

Consider the following points when deciding how to implement this pattern:


  • Manage the validity status and period of the key 管理钥匙的有效状态和有效期. The key is a bearer instrument that, if leaked or compromised, effectively unlocks the target item and makes it available for malicious use during the validity period. A key can usually be revoked or disabled, depending on how it was issued. Server-side policies can be changed or, in the ultimate case, the server key it was signed with can be invalidated. Specify a short validity period to minimize the risk of allowing subsequent unwarranted operations to take place against the data store. However, if the validity period is too short, the client may not be able to complete the operation before the key expires. Allow authorized users to renew the key before the validity period expires if multiple accesses to the protected resource are required. .密钥是一个持有者工具,如果泄漏或泄露,有效地解锁目标项目,并使其在有效期内可供恶意使用。密钥通常可以撤销或禁用,这取决于它是如何发出的。可以更改服务器端策略,或者在最终情况下,使用其签名的服务器密钥可能无效。指定一个较短的有效期,以尽量减少允许对数据存储区执行后续不必要操作的风险。但是,如果有效期太短,客户端可能无法在密钥过期之前完成操作。如果需要对受保护资源进行多次访问,允许授权用户在有效期过期之前更新密钥
  • Control the level of access the key will provide 控制密钥将提供的访问级别. Typically, the key should allow the user to perform only the actions necessary to complete the operation, such as read-only access if the client should not be able to upload data to the data store. For file uploads, it is common to specify a key that provides write-only permission, as well as the location and the validity period. It is vital to accurately specify the resource or the set of resources to which the key applies. .通常,密钥应该只允许用户执行完成操作所必需的操作,例如,如果客户端不能将数据上传到数据存储区,则只能执行只读访问。对于文件上传,通常指定一个提供只写权限的键,以及位置和有效期。准确地指定密钥所应用的资源或资源集是至关重要的
  • Consider how to control users’ behavior 考虑如何控制用户的行为. Implementing this pattern means some loss of control over the resources to which users are granted access. The level of control that can be exerted is limited by the capabilities of the policies and permissions available for the service or the target data store. For example, it is usually not possible to create a key that limits the size of the data to be written to storage, or the number of times the key can be used to access a file. This can result in huge unexpected costs for data transfer, even when used by the intended client, and might be caused by an error in the code that causes repeated upload or download. To limit the number of times a file can be uploaded or downloaded it may be necessary, where possible, to force the client to notify the application when one operation has completed. For example, some data stores raise events the application code can use to monitor operations and control user behavior. However, it may be hard to enforce quotas for individual users in a multi-tenant scenario where the same key is used by all the users from one tenant. .实现此模式意味着对授予用户访问权限的资源失去一些控制。可以施加的控制级别受到服务或目标数据存储区可用的策略和权限的能力的限制。例如,通常不可能创建一个密钥来限制要写入存储的数据的大小,或者密钥可用于访问文件的次数。这可能导致巨大的意外成本的数据传输,即使是在预期的客户端使用时,也可能是由于代码中的错误导致重复上传或下载。为了限制文件上传或下载的次数,在可能的情况下,可能需要强制客户机在一次操作完成后通知应用程序。例如,某些数据存储引发应用程序代码可用于监视操作和控制用户行为的事件。然而,在多租户场景中,如果来自一个租户的所有用户都使用相同的密钥,那么可能很难对单个用户实施配额
  • Validate, and optionally sanitize, all uploaded data 验证,并可选地清除所有上传的数据. A malicious user that gains access to the key could upload data aimed at further compromising the system. Alternatively, authorized users might upload data that is invalid and, when processed, could result in an error or system failure. To protect against this, ensure that all uploaded data is validated and checked for malicious content before use. .获得密钥的恶意用户可以上传旨在进一步损害系统的数据。或者,授权用户可能上传无效的数据,并且在处理时可能导致错误或系统故障。为了防止这种情况的发生,请确保在使用前验证并检查所有上传的数据是否含有恶意内容
  • Audit all operations 审核所有业务. Many key-based mechanisms can log operations such as uploads, downloads, and failures. These logs can usually be incorporated into an audit process, and also used for billing if the user is charged based on file size or data volume. Use the logs to detect authentication failures that might be caused by issues with the key provider, or inadvertent removal of a stored access policy. .许多基于键的机制可以记录诸如上传、下载和失败之类的操作。这些日志通常可以合并到审计过程中,如果用户根据文件大小或数据量收费,还可以用于计费。使用日志可以检测可能由密钥提供程序问题或无意中删除存储访问策略引起的身份验证失败
  • Deliver the key securely 把钥匙安全送到. It may be embedded in a URL that the user activates in a web page, or it may be used in a server redirection operation so that the download occurs automatically. Always use HTTPS to deliver the key over a secure channel. .它可以嵌入到用户在网页中激活的 URL 中,也可以在服务器重定向操作中使用,以便自动进行下载。始终使用 HTTPS 通过安全通道传递密钥
  • Protect sensitive data in transit 在传输过程中保护敏感数据. Sensitive data delivered through the application will usually take place using SSL or TLS, and this should be enforced for clients accessing the data store directly. .通过应用程序传递的敏感数据通常使用 SSL 或 TLS 进行,对于直接访问数据存储的客户机,应该强制执行这一点

Other issues to be aware of when implementing this pattern are:


  • If the client does not, or cannot notify the server of completion of the operation, and the only limit is the expiry period of the key, the application will not be able to perform auditing operations such as counting the number of uploads or downloads, or preventing multiple uploads or downloads. 如果客户端没有,或者不能通知服务器操作完成,唯一的限制是密钥的有效期,应用程序将无法执行审计操作,如计算上传或下载的次数,或者阻止多次上传或下载
  • The flexibility of key policies that can be generated may be limited. For example, some mechanisms may allow only the use of a timed expiry period. Others may not be able to specify a sufficient granularity of read/write permissions. 可以产生的关键策略的灵活性可能是有限的。例如,有些机制可能只允许使用限时过期期限。其他人可能无法指定足够的读/写权限粒度
  • If the start time for the key or token validity period is specified, ensure that it is a little earlier than the current server time to allow for client clocks that might be slightly out of synchronization. The default if not specified is usually the current server time. 如果指定了密钥或令牌有效期的开始时间,请确保它略早于当前服务器时间,以便允许客户端时钟可能略有不同步。如果未指定,则默认值通常为当前服务器时间
  • The URL containing the key will be recorded in server log files. While the key will typically have expired before the log files are used for analysis, ensure that you limit access to them. If log data is transmitted to a monitoring system or stored in another location, consider implementing a delay to prevent leakage of keys until after their validity period has expired. 包含密钥的 URL 将记录在服务器日志文件中。虽然在使用日志文件进行分析之前,密钥通常已过期,但请确保限制对它们的访问。如果日志数据被传送到监测系统或存储在另一个地点,考虑实施延迟,以防止密钥泄漏,直到其有效期过期
  • If the client code runs in a web browser, the browser may need to support cross-origin resource sharing (CORS) to enable code that executes within the web browser to access data in a different domain from the originating domain that served the page. Some older browsers and some data stores do not support CORS, and code that runs in these browsers may not be able to use a valet key to provide access to data in a different domain, such as a cloud storage account. 如客户端代码在浏览器中运行,浏览器可能需要支持跨来源资源共享(CORS) ,使在浏览器中执行的代码能够访问服务于该页的原始域以外的域中的数据。一些较老的浏览器和一些数据存储不支持 CORS,并且在这些浏览器中运行的代码可能无法使用代理密钥来提供对不同域(如云存储帐户)中的数据的访问

When to Use this Pattern 何时使用此模式

This pattern is ideally suited for the following situations:


  • To minimize resource loading and maximize performance and scalability. Using a valet key does not require the resource to be locked, no remote server call is required, there is no limit on the number of valet keys that can be issued, and it avoids a single point of failure that would arise from performing the data transfer through the application code. Creating a valet key is typically a simple cryptographic operation of signing a string with a key. 使资源负载最小化,并使性能和可伸缩性最大化。使用代客密钥不需要锁定资源,不需要远程服务器调用,可以发出的代客密钥数量没有限制,并且避免了通过应用程序代码执行数据传输可能产生的单点故障。创建代客密钥通常是使用密钥对字符串进行签名的简单加密操作
  • To minimize operational cost. Enabling direct access to stores and queues is resource and cost efficient, can result in fewer network round trips, and may allow for a reduction in the number of compute resources required. 降低运作成本。启用对存储和队列的直接访问是资源和成本有效的,可以减少网络往返,并且可以减少所需的计算资源数量
  • When clients regularly upload or download data, particularly where there is a large volume or when each operation involves large files. 当客户端定期上传或下载数据时,尤其是当数据量很大或每个操作涉及大文件时
  • When the application has limited compute resources available, either due to hosting limitations or cost considerations. In this scenario, the pattern is even more advantageous if there are many concurrent data uploads or downloads because it relieves the application from handling the data transfer. 当应用程序可用的计算资源有限时,可能是由于宿主限制或成本考虑。在此场景中,如果有许多并发数据上传或下载,则该模式更为有利,因为它使应用程序无需处理数据传输
  • When the data is stored in a remote data store or a different datacenter. If the application was required to act as a gatekeeper, there may be a charge for the additional bandwidth of transferring the data between datacenters, or across public or private networks between the client and the application, and then between the application and the data store. 当数据存储在远程数据存储区或不同的数据中心时。如果应用程序被要求充当看门人,则可能需要支付额外的带宽费用,以便在数据中心之间传输数据,或者在客户机和应用程序之间,然后在应用程序和数据存储之间跨公共或私有网络传输数据

This pattern might not be suitable in the following situations:


  • If the application must perform some task on the data before it is stored or before it is sent to the client. For example, the application may need to perform validation, log access success, or execute a transformation on the data. However, some data stores and clients are able to negotiate and carry out simple transformations such as compression and decompression (for example, a web browser can usually handle GZip formats). 如果应用程序必须在数据存储之前或发送到客户端之前对数据执行某项任务。例如,应用程序可能需要执行验证、日志访问成功或对数据执行转换。然而,一些数据存储和客户端能够协商和执行简单的转换,如压缩和解压缩(例如,Web 浏览器通常可以处理 GZip 格式)
  • If the design and implementation of an existing application makes it difficult and costly to implement. Using this pattern typically requires a different architectural approach for delivering and receiving data. 如果现有应用程序的设计和实现使实现变得困难和昂贵。使用这种模式通常需要一种不同的体系结构方法来传递和接收数据
  • If it is necessary to maintain audit trails or control the number of times a data transfer operation is executed, and the valet key mechanism in use does not support notifications that the server can use to manage these operations. 如果需要维护审计跟踪或控制数据传输操作的执行次数,并且正在使用的代理键机制不支持服务器可以用来管理这些操作的通知
  • If it is necessary to limit the size of the data, especially during upload operations. The only solution to this is for the application to check the data size after the operation is complete, or check the size of uploads after a specified period or on a scheduled basis. 如果需要限制数据的大小,尤其是在上载操作期间。唯一的解决方案是应用程序在操作完成后检查数据大小,或者在指定的时间段后或在计划的基础上检查上传的大小

Example 例子

Microsoft Azure supports Shared Access Signatures (SAS) on Azure storage for granular access control to data in blobs, tables, and queues, and for Service Bus queues and topics. An SAS token can be configured to provide specific access rights such as read, write, update, and delete to a specific table; a key range within a table; a queue; a blob; or a blob container. The validity can be a specified time period or with no time limit.

微软 Azure 支持 Azure 存储上的共享访问签名(SAS) ,用于对 blobs、表和队列中的数据进行粒度访问控制,以及对服务总线队列和主题进行访问控制。可以将 SAS 令牌配置为提供特定的访问权限,如对特定表的读、写、更新和删除; 表中的键范围; 队列; blob 或 blob 容器。有效期可以是指定的时间段,也可以没有时间限制。

Azure SAS also supports server-stored access policies that can be associated with a specific resource such as a table or blob. This feature provides additional control and flexibility compared to application-generated SAS tokens, and should be used whenever possible. Settings defined in a server-stored policy can be changed and are reflected in the token without requiring a new token to be issued, but settings defined in the token itself cannot be changed without issuing a new token. This approach also makes it possible to revoke a valid SAS token before it has expired.

Azure SAS 还支持服务器存储的访问策略,这些策略可以与特定的资源(如表或 blob)相关联。与应用程序生成的 SAS 令牌相比,此特性提供了额外的控制和灵活性,应尽可能使用。可以更改在服务器存储策略中定义的设置,并将其反映在令牌中,而不需要发出新令牌,但是如果不发出新令牌,则不能更改在令牌本身中定义的设置。这种方法还可以在有效的 SAS 令牌过期之前撤销它。

Note 注意

For more information see Introducing Table SAS (Shared Access Signature), Queue SAS and update to Blob SAS in the Azure Storage Team blog and Shared Access Signatures, Part 1: Understanding the SAS Model on MSDN.

有关详细信息,请参阅 Azure 存储团队博客中的表 SAS (共享访问签名)、队列 SAS 和对 Blob SAS 的更新,以及共享访问签名,第1部分: 理解 MSDN 上的 SAS 模型。

The following code demonstrates how to create a SAS that is valid for five minutes. The GetSharedAccessReferenceForUpload method returns a SAS that can be used to upload a file to Azure Blob Storage.

下面的代码演示如何创建有效时间为5分钟的 SAS。GetSharedAccessReferenceForUpload 方法返回可用于将文件上传到 Azure Blob Storage 的 SAS。

public class ValuesController : ApiController{
  private readonly CloudStorageAccount account;
  private readonly string blobContainer;
  /// <summary>
  /// Return a limited access key that allows the caller to upload a file
  /// to this specific destination for a defined period of time.
  /// </summary>
  private StorageEntitySas GetSharedAccessReferenceForUpload(string blobName)  {
      var blobClient = this.account.CreateCloudBlobClient();
      var container = blobClient.GetContainerReference(this.blobContainer);
      var blob = container.GetBlockBlobReference(blobName);
      var policy = new SharedAccessBlobPolicy    {
          Permissions = SharedAccessBlobPermissions.Write,
          // Specify a start time five minutes earlier to allow for client clock skew.
          SharedAccessStartTime = DateTime.UtcNow.AddMinutes(-5),
          // Specify a validity period of five minutes starting from now.
          SharedAccessExpiryTime = DateTime.UtcNow.AddMinutes(5)
      // Create the signature.
      var sas = blob.GetSharedAccessSignature(policy);
      return new StorageEntitySas {
          BlobUri = blob.Uri,
          Credentials = sas,
          Name = blobName
  public struct StorageEntitySas  {
      public string Credentials;
      public Uri BlobUri;
      public string Name;

Note 注意

The complete sample containing this code is available in the ValetKey solution available for download with this guidance. The ValetKey.Web project in this solution contains a web application that includes the ValuesController class shown above. A sample client application that uses this web application to retrieve a SAS key and upload a file to blob storage is available in the ValetKey.Client project.

包含此代码的完整示例可在 ValetKey 解决方案中获得,可通过本指南下载。ValetKey.此解决方案中的 Web 项目包含一个 Web 应用程序,其中包含上面显示的 ValuesController 类。ValetKey 中提供了一个示例客户端应用程序,该应用程序使用此 Web 应用程序检索 SAS 密钥并将文件上传到 blob 存储。客户项目。

Related Patterns and Guidance 相关模式及指引

The following patterns and guidance may also be relevant when implementing this pattern:


  • Gatekeeper Pattern 守门人模式. This pattern can be used in conjunction with the Valet Key pattern to protect applications and services by using a dedicated host instance that acts as a broker between clients and the application or service. The gatekeeper validates and sanitizes requests, and passes requests and data between the client and the application. This pattern can provide an additional layer of security, and reduce the attack surface of the system. .此模式可以与 Valet Key 模式结合使用,通过使用专用的主机实例(充当客户机与应用程序或服务之间的代理)来保护应用程序和服务。网守验证和清理请求,并在客户机和应用程序之间传递请求和数据。此模式可以提供额外的安全层,并减少系统的攻击面
  • Static Content Hosting Pattern 静态内容托管模式. This pattern describes how to deploy static resources to a cloud-based storage service that can deliver these resources directly to the client in order to reduce the requirement for expensive compute instances. Where the resources are not intended to be publicly available, the Valet Key pattern can be used to secure them. .此模式描述如何将静态资源部署到基于云的存储服务,该存储服务可以直接将这些资源交付给客户机,以减少对昂贵的计算实例的需求。如果资源不打算公开可用,可以使用 ValetKey 模式来保护它们
  • 0
  • 0
  • 0


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
©️2022 CSDN 皮肤主题:技术黑板 设计师:CSDN官方博客 返回首页
钱包余额 0