Azure Best Practices Availability Checklist

Application design

  • Avoid any single point of failure. All components, services, resources, and compute instances should be deployed as multiple instances to prevent a single point of failure from affecting availability. This includes authentication mechanisms. Design the application to be configurable to use multiple instances, and to automatically detect failures and redirect requests to non-failed instances where the platform does not do this automatically.
  • Decompose workload per different service-level agreement. If a service is composed of critical and less-critical workloads, manage them differently and specify the service features and number of instances to meet their availability requirements.
  • Minimize and understand service dependencies. Minimize the number of different services used where possible, and ensure you understand all of the feature and service dependencies that exist in the system. This includes the nature of these dependencies, and the impact of failure or reduced performance in each one on the overall application. Microsoft guarantees at least 99.9 percent availability for most services, but this means that every additional service an application relies on potentially reduces the overall availability SLA of your system by 0.1 percent.
  • Design tasks and messages to be idempotent (safely repeatable) where possible, so that duplicated requests will not cause problems. For example, a service can act as a consumer that handles messages sent as requests by other parts of the system that act as producers. If the consumer fails after processing the message, but before acknowledging that it has been processed, a producer might submit a repeat request which could be handled by another instance of the consumer. For this reason, consumers and the operations they carry out should be idempotent so that repeating a previously executed operation does not render the results invalid. This may mean detecting duplicated messages, or ensuring consistency by using an optimistic approach to handling conflicts.
  • Use a message broker that implements high availability for critical transactions. Many scenarios for initiating tasks or accessing remote services use messaging to pass instructions between the application and the target service. For best performance, the application should be able to send the message and then return to process more requests, without needing to wait for a reply. To guarantee delivery of messages, the messaging system should provide high availability. Azure Service Bus message queues implement at least oncesemantics. This means that each message posted to a queue will not be lost, although duplicate copies may be delivered under certain circumstances. If message processing is idempotent (see the previous item), repeated delivery should not be a problem.
  • Design applications to gracefully degrade when reaching resource limits, and take appropriate action to minimize the impact for the user. In some cases, the load on the application may exceed the capacity of one or more parts, causing reduced availability and failed connections. Scaling can help to alleviate this, but it may reach a limit imposed by other factors, such as resource availability or cost. Design the application so that, in this situation, it can automatically degrade gracefully. For example, in an ecommerce system, if the order-processing subsystem is under strain (or has even failed completely), it can be temporarily disabled while allowing other functionality (such as browsing the product catalog) to continue. It might be appropriate to postpone requests to a failing subsystem, for example still enabling customers to submit orders but saving them for later processing, when the orders subsystem is available again.
  • Gracefully handle rapid burst events. Most applications need to handle varying workloads over time, such as peaks first thing in the morning in a business application or when a new product is released in an ecommerce site. Auto-scaling can help to handle the load, but it may take some time for additional instances to come online and handle requests. Prevent sudden and unexpected bursts of activity from overwhelming the application: design it to queue requests to the services it uses and degrade gracefully when queues are near to full capacity. Ensure that there is sufficient performance and capacity available under non-burst conditions to drain the queues and handle outstanding requests. For more information, see the Queue-Based Load Leveling Pattern.

Deployment and maintenance

  • Deploy multiple instances of roles for each service. Microsoft makes availability guarantees for services that you create and deploy, but these guarantees are only valid if you deploy at least two instances of each role in the service. This enables one role to be unavailable while the other remains active. This is especially important if you need to deploy updates to a live system without interrupting clients’ activities; instances can be taken down and upgraded individually while the others continue online.
  • Host applications in multiple datacenters. Although extremely unlikely, it is possible for an entire datacenter to go offline through an event such as a natural disaster or Internet failure. Vital business applications should be hosted in more than one datacenter to provide maximum availability. This can also reduce latency for local users, and provide additional opportunities for flexibility when updating applications.
  • Automate and test deployment and maintenance tasks. Distributed applications consist of multiple parts that must work together. Deployment should therefore be automated, using tested and proven mechanisms such as scripts and deployment applications. These can update and validate configuration, and automate the deployment process. Automated techniques should also be used to perform updates of all or parts of applications. It is vital to test all of these processes fully to ensure that errors do not cause additional downtime. All deployment tools must have suitable security restrictions to protect the deployed application; define and enforce deployment policies carefully and minimize the need for human intervention.
  • Consider using staging and production features of the platform where these are available. For example, using Azure Cloud Services staging and production environments allows applications to be switched from one to another instantly through a virtual IP address swap (VIP Swap). However, if you prefer to stage on-premises, or deploy different versions of the application concurrently and gradually migrate users, you may not be able to use a VIP Swap operation.
  • Apply configuration changes without recycling the instance when possible. In many cases, the configuration settings for an Azure application or service can be changed without requiring the role to be restarted. Role expose events that can be handled to detect configuration changes and apply them to components within the application. However, some changes to the core platform settings do require a role to be restarted. When building components and services, maximize availability and minimize downtime by designing them to accept changes to configuration settings without requiring the application as a whole to be restarted.
  • Use upgrade domains for zero downtime during updates. Azure compute units such as web and worker roles are allocated to upgrade domains. Upgrade domains group role instances together so that, when a rolling update takes place, each role in the upgrade domain is stopped, updated, and restarted in turn. This minimizes the impact on application availability. You can specify how many upgrade domains should be created for a service when the service is deployed.

    Note

    Roles are also distributed across fault domains, each of which is reasonably independent from other fault domains in terms of server rack, power, and cooling provision, in order to minimize the chance of a failure affecting all role instances. This distribution occurs automatically, and you cannot control it.

  • Configure availability sets for Azure virtual machines. Placing two or more virtual machines in the same availability set guarantees that these virtual machines will not be deployed to the same fault domain. To maximize availability, you should create multiple instances of each critical virtual machine used by your system and place these instances in the same availability set. If you are running multiple virtual machines that serve different purposes, create an availability set for each virtual machine. Add instances of each virtual machine to each availability set. For example, if you have created separate virtual machines to act as a web server and a reporting server, create an availability set for the web server and another availability set for the reporting server. Add instances of the web server virtual machine to the web server availability set, and add instances of the reporting server virtual machine to the reporting server availability set.

Data management

  • Geo-replicate data in Azure Storage. Data in Azure Storage is automatically replicated within in a datacenter. For even higher availability, use Read-access geo-redundant storage (-RAGRS), which replicates your data to a secondary region and provides read-only access to the data in the secondary location. The data is durable even in the case of a complete regional outage or a disaster. For more information, see Azure Storage replication.
  • Geo-replicate databases. Azure SQL Database and Cosmos DB both support geo-replication, which enables you to configure secondary database replicas in other regions. Secondary databases are available for querying and for failover in the case of a data center outage or the inability to connect to the primary database. For more information, see Failover groups and active geo-replication(SQL Database) and How to distribute data globally with Azure Cosmos DB?.
  • Use optimistic concurrency and eventual consistency where possible. Transactions that block access to resources through locking (pessimistic concurrency) can cause poor performance and considerably reduce availability. These problems can become especially acute in distributed systems. In many cases, careful design and techniques such as partitioning can minimize the chances of conflicting updates occurring. Where data is replicated, or is read from a separately updated store, the data will only be eventually consistent. But the advantages usually far outweigh the impact on availability of using transactions to ensure immediate consistency.
  • Use periodic backup and point-in-time restore, and ensure it meets the Recovery Point Objective (RPO). Regularly and automatically back up data that is not preserved elsewhere, and verify you can reliably restore both the data and the application itself should a failure occur. Data replication is not a backup feature because errors and inconsistencies introduced through failure, error, or malicious operations will be replicated across all stores. The backup process must be secure to protect the data in transit and in storage. Databases or parts of a data store can usually be recovered to a previous point in time by using transaction logs. Microsoft Azure provides a backup facility for data stored in Azure SQL Database. The data is exported to a backup package on Azure blob storage, and can be downloaded to a secure on-premises location for storage.
  • Enable the high availability option to maintain a secondary copy of an Azure Redis cache. When using Azure Redis Cache, choose the standard option to maintain a secondary copy of the contents. For more information, see Create a cache in Azure Redis Cache.

Errors and failures

  • Introduce the concept of a timeout. Services and resources may become unavailable, causing requests to fail. Ensure that the timeouts you apply are appropriate for each service or resource as well as the client that is accessing them. (In some cases, it may be appropriate to allow a longer timeout for a particular instance of a client, depending on the context and other actions that the client is performing.) Very short timeouts may cause excessive retry operations for services and resources that have considerable latency. Very long timeouts can cause blocking if a large number of requests are queued, waiting for a service or resource to respond.
  • Retry failed operations caused by transient faults. Design a retry strategy for access to all services and resources where they do not inherently support automatic connection retry. Use a strategy that includes an increasing delay between retries as the number of failures increases, to prevent overloading of the resource and to allow it to gracefully recover and handle queued requests. Continual retries with very short delays are likely to exacerbate the problem.
  • Stop sending requests to avoid cascading failures when remote services are unavailable. There may be situations in which transient or other faults, ranging in severity from a partial loss of connectivity to the complete failure of a service, take much longer than expected to return to normal. Additionally, if a service is very busy, failure in one part of the system may lead to cascading failures, and result in many operations becoming blocked while holding onto critical system resources such as memory, threads, and database connections. Instead of continually retrying an operation that is unlikely to succeed, the application should quickly accept that the operation has failed, and gracefully handle this failure. You can use the circuit breaker pattern to reject requests for specific operations for defined periods. For more information, see Circuit Breaker Pattern.
  • Compose or fall back to multiple components to mitigate the impact of a specific service being offline or unavailable. Design applications to take advantage of multiple instances without affecting operation and existing connections where possible. Use multiple instances and distribute requests between them, and detect and avoid sending requests to failed instances, in order to maximize availability.
  • Fall back to a different service or workflow where possible. For example, if writing to SQL Database fails, temporarily store data in blob storage. Provide a facility to replay the writes in blob storage to SQL Database when the service becomes available. In some cases, a failed operation may have an alternative action that allows the application to continue to work even when a component or service fails. If possible, detect failures and redirect requests to other services that can offer a suitable alternative functionality, or to back up or reduced functionality instances that can maintain core operations while the primary service is offline.

Monitoring and disaster recovery

  • Provide rich instrumentation for likely failures and failure events to report the situation to operations staff. For failures that are likely but have not yet occurred, provide sufficient data to enable operations staff to determine the cause, mitigate the situation, and ensure that the system remains available. For failures that have already occurred, the application should return an appropriate error message to the user but attempt to continue running, albeit with reduced functionality. In all cases, the monitoring system should capture comprehensive details to enable operations staff to effect a quick recovery, and if necessary, for designers and developers to modify the system to prevent the situation from arising again.
  • Monitor system health by implementing checking functions. The health and performance of an application can degrade over time, without being noticeable until it fails. Implement probes or check functions that are executed regularly from outside the application. These checks can be as simple as measuring response time for the application as a whole, for individual parts of the application, for individual services that the application uses, or for individual components. Check functions can execute processes to ensure they produce valid results, measure latency and check availability, and extract information from the system.
  • Regularly test all failover and fallback systems to ensure they are available and operate as expected. Changes to systems and operations may affect failover and fallback functions, but the impact may not be detected until the main system fails or becomes overloaded. Test it before it is required to compensate for a live problem at runtime.
  • Test the monitoring systems. Automated failover and fallback systems, and manual visualization of system health and performance by using dashboards, all depend on monitoring and instrumentation functioning correctly. If these elements fail, miss critical information, or report inaccurate data, an operator might not realize that the system is unhealthy or failing.
  • Track the progress of long-running workflows and retry on failure. Long-running workflows are often composed of multiple steps. Ensure that each step is independent and can be retried to minimize the chance that the entire workflow will need to be rolled back, or that multiple compensating transactions need to be executed. Monitor and manage the progress of long-running workflows by implementing a pattern such as Scheduler Agent Supervisor Pattern.
  • Plan for disaster recovery. Create an accepted, fully-tested plan for recovery from any type of failure that may affect system availability. Choose a multi-site disaster recovery architecture for any mission-critical applications. Identify a specific owner of the disaster recovery plan, including automation and testing. Ensure the plan is well-documented, and automate the process as much as possible. Establish a backup strategy for all reference and transactional data, and test the restoration of these backups regularly. Train operations staff to execute the plan, and perform regular disaster simulations to validate and improve the plan.
在使用Python来安装geopandas包时,由于geopandas依赖于几个其他的Python库(如GDAL, Fiona, Pyproj, Shapely等),因此安装过程可能需要一些额外的步骤。以下是一个基本的安装指南,适用于大多数用户: 使用pip安装 确保Python和pip已安装: 首先,确保你的计算机上已安装了Python和pip。pip是Python的包管理工具,用于安装和管理Python包。 安装依赖库: 由于geopandas依赖于GDAL, Fiona, Pyproj, Shapely等库,你可能需要先安装这些库。通常,你可以通过pip直接安装这些库,但有时候可能需要从其他源下载预编译的二进制包(wheel文件),特别是GDAL和Fiona,因为它们可能包含一些系统级的依赖。 bash pip install GDAL Fiona Pyproj Shapely 注意:在某些系统上,直接使用pip安装GDAL和Fiona可能会遇到问题,因为它们需要编译一些C/C++代码。如果遇到问题,你可以考虑使用conda(一个Python包、依赖和环境管理器)来安装这些库,或者从Unofficial Windows Binaries for Python Extension Packages这样的网站下载预编译的wheel文件。 安装geopandas: 在安装了所有依赖库之后,你可以使用pip来安装geopandas。 bash pip install geopandas 使用conda安装 如果你正在使用conda作为你的Python包管理器,那么安装geopandas和它的依赖可能会更简单一些。 创建一个新的conda环境(可选,但推荐): bash conda create -n geoenv python=3.x anaconda conda activate geoenv 其中3.x是你希望使用的Python版本。 安装geopandas: 使用conda-forge频道来安装geopandas,因为它提供了许多地理空间相关的包。 bash conda install -c conda-forge geopandas 这条命令会自动安装geopandas及其所有依赖。 注意事项 如果你在安装过程中遇到任何问题,比如编译错误或依赖问题,请检查你的Python版本和pip/conda的版本是否是最新的,或者尝试在不同的环境中安装。 某些库(如GDAL)可能需要额外的系统级依赖,如地理空间库(如PROJ和GEOS)。这些依赖可能需要单独安装,具体取决于你的操作系统。 如果你在Windows上遇到问题,并且pip安装失败,尝试从Unofficial Windows Binaries for Python Extension Packages网站下载相应的wheel文件,并使用pip进行安装。 脚本示例 虽然你的问题主要是关于如何安装geopandas,但如果你想要一个Python脚本来重命名文件夹下的文件,在原始名字前面加上字符串"geopandas",以下是一个简单的示例: python import os # 指定文件夹路径 folder_path = 'path/to/your/folder' # 遍历文件夹中的文件 for filename in os.listdir(folder_path): # 构造原始文件路径 old_file_path = os.path.join(folder_path, filename) # 构造新文件名 new_filename = 'geopandas_' + filename # 构造新文件路径 new_file_path = os.path.join(folder_path, new_filename) # 重命名文件 os.rename(old_file_path, new_file_path) print(f'Renamed "{filename}" to "{new_filename}"') 请确保将'path/to/your/folder'替换为你想要重命名文件的实际文件夹路径。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值