应用程序设计标准

发表于 2017-10-13 更新于 2023-08-04 分类于樱桃沟阅读次数：本文字数： 14k 阅读时长 ≈ 13 分钟

个人从运维角度来看，这些标准对于提升系统的稳定性会有很大帮助，但是对于研发的质量水平会要求非常高了。这里很多东西都是基于微软自己的Azure来说的，可以作为一些参考，毕竟理念的东西是相同的。

应用程序设计

Avoid any single point of failure.

All components, services, resources, and
compute instances should be deployed as multiple instances to prevent a single point of failure from affecting availability. This includes authentication mechanisms. Design the application to be configurable to use multiple instances, and to automatically detect failures and redirect requests to non-failed instances where the platform does not do this automatically.

避免任何错误的单点问题。所有的组件，服务，资源以及计算实例都必须多实例部署，以避免故障单点影响应用稳定性。这也包括认证授权机制。将应用涉及成可配置为多实例的，以及可以自动检测错误和重定向到正常的实例上当平台不支持自动完成这些的时候。

Decompose workload per different service-level agreement.

If a service is composed of critical and less-critical workloads, manage them differently and specify the service features and number of instances to meet their availability requirements.

针对不同的SLA进行负载分解。如果一个服务是由重负载和轻负载组成的，那么对它们进行不同的管理，并指定服务特性和实例数量，以满足它们的可用性需求。

Minimize and understand service dependencies.

Minimize the number of different services used where possible, and ensure you understand all of the feature and service dependencies that exist in the system. This includes the nature of these dependencies, and the impact of failure or reduced performance in each one on the overall application. Microsoft guarantees at least 99.9 percent availability for most services, but this means that every additional service an application relies on potentially reduces the overall availability SLA of your system by 0.1 percent.

最小化和理解服务依赖。尽可能最小化不同的服务，并确保您理解系统中存在的所有特性和服务依赖项。这包括这些依赖关系的本质，以及每个应用程序失败或性能下降对整体应用程序的影响。微软保证大多数服务至少有99.9%的可用性，但这意味着应用程序所依赖的每一项额外服务都可能降低系统的总体可用性SLA 0.1%。

Design tasks and messages to be idempotent (safely repeatable) where possible, so that duplicated requests will not cause problems.

For example, a service can act as a consumer that handles messages sent as requests by other parts of the system that act as producers. If the consumer fails after processing the message, but before acknowledging that it has been processed, a producer might submit a repeat request which could be handled by another instance of the consumer. For this reason, consumers and the operations they carry out should be idempotent so that repeating a previously executed operation does not render the results invalid. This may mean detecting duplicated messages, or ensuring consistency by using an optimistic approach to handling conflicts.

在可能的情况下设计任务和消息为幂等（安全可重复），这样重复请求也不会引起问题。例如，一个服务作为消费者来处理消息发送，但另一方面它也是系统中其他组件的生产者。如果消费者在处理消息后失败，但在确认它已被处理之前，一个生产者可能提交一个重复请求，可以由另一个消费者实例处理。为此，消费者和他们所执行的操作应该保持幂等，这样重复以前执行的操作就不会导致结果无效。这可能意味着检测重复的消息，或者用积极的方法处理冲突确保一致性。

Use a message broker that implements high availability for critical transactions.

Many scenarios for initiating tasks or accessing remote services use messaging to pass instructions between the application and the target service. For best performance, the application should be able to send the message and then return to process more requests, without needing to wait for a reply. To guarantee delivery of messages, the messaging system should provide high availability. Azure Service Bus message queues implement at least once semantics. This means that each message posted to a queue will not be lost, although duplicate copies may be delivered under certain circumstances. If message processing is idempotent (see the previous item), repeated delivery should not be a problem.

使用消息代理实现关键的高可用性事务。启动任务或访问远程服务的许多场景都使用消息传递在应用程序和目标服务之间传递指令。为了获得最佳性能，应用程序应该发送完消息后返回以处理更多的请求，而无需等待回复。为了保证消息的传递，消息传递系统应该提供高可用性。Azure服务总线消息队列至少实现一次语义。这意味着发送到队列的每条消息都不会丢失，虽然在某些情况下会产生重复。如果消息处理是幂等的（见上一项），重复传递不应该是个问题。

Design applications to gracefully degrade when reaching resource limits, and take appropriate action to minimize the impact for the user.

In some cases, the load on the application may exceed the capacity of one or more parts, causing reduced availability and failed connections. Scaling can help to alleviate this, but it may reach a limit imposed by other factors, such as resource availability or cost. Design the application so that, in this situation, it can automatically degrade gracefully. For example, in an ecommerce system, if the order-processing subsystem is under strain (or has even failed completely), it can be temporarily disabled while allowing other functionality (such as browsing the product catalog) to continue. It might be appropriate to postpone requests to a failing subsystem, for example still enabling customers to submit orders but saving them for later processing, when the orders subsystem is available again.

在达到资源限制时，设计应用程序要优雅地降级，并采取适当的行动以尽量减少对用户的影响。在某些情况下，应用程序上的负载可能超过一个或多个部分的容量，从而导致可用性降低和连接失败。扩容可以帮助减轻这一点，但它可能达到其他因素的研制，如资源可用性或成本。设计应用程序在这种情况下可以自动优雅降级。例如，在电商系统中，如果订单处理的子系统处于压力状态（甚至完全失败），则可以暂时禁用该功能，同时允许其他功能（如浏览产品目录）继续进行。延迟请求到一个失败的子系统中可能是合适的，例如，在订单子系统有问题的时候系统仍然允许客户提交订单，但会保存它们以便当订单子系统再次可用时再处理。

Gracefully handle rapid burst events.

Most applications need to handle varying workloads over time, such as peaks first thing in the morning in a business application or when a new product is released in an ecommerce site. Auto-scaling can help to handle the load, but it may take some time for additional instances to come online and handle requests. Prevent sudden and unexpected bursts of activity from overwhelming the application: design it to queue requests to the services it uses and degrade gracefully when queues are near to full capacity. Ensure that there is sufficient performance and capacity available under non-burst conditions to drain the queues and handle outstanding requests. For more information, see the Queue-Based Load Leveling Pattern.

优雅地处理快速突发事件。大多数应用程序需要随着时间的推移处理不同的工作负载，比如一个商业应用程序造成的一个高峰，或者一个电商务有新的发布。自动缩放可以帮助处理负载，但额外的实例从上线到能处理请求可能需要一些时间。防止应用程序被突发事件造成雪崩：将它所使用的服务都使用队列请求，并在队列接近满时优雅地降级。确保在非突发条件下有足够的性能和容量来消费队列并处理未完成的请求。有关更多信息，请参见基于队列的负载均衡模式。

发布和维护

Deploy multiple instances of roles for each service.

Microsoft makes availability guarantees for services that you create and deploy, but these guarantees are only valid if you deploy at least two instances of each role in the service. This enables one role to be unavailable while the other remains active. This is especially important if you need to deploy updates to a live system without interrupting clients’ activities; instances can be taken down and upgraded individually while the others continue online.

为每个服务部署多个实例。微软为您创建和部署的服务提供了可用性保证，但只有在服务中部署每个角色的至少两个实例这个保证才有效。这保证了当一个角色无效了，其他实例仍然处于活动状态。如果您需要在不中断客户端活动的情况下向在线系统部署更新，那么这一点尤为重要；一个实例可以单独进行下线升级同时其他实例还继续在线。

Host applications in multiple datacenters.

Although extremely unlikely, it is possible for an entire datacenter to go offline through an event such as a natural disaster or Internet failure. Vital business applications should be hosted in more than one datacenter to provide maximum availability. This can also reduce latency for local users, and provide additional opportunities for flexibility when updating applications.

在多个数据中心部署主机应用程序。尽管概率很低，整个数据中心有可能因为诸如自然灾害或互联网故障之类的事件导致下线。重要的业务应用程序应该托管在多个数据中心中，以提供最大可用性。这也可以减少本地用户的延迟，并在升级应用程序时提供额外可能的灵活性。

Automate and test deployment and maintenance tasks.

Distributed applications consist of multiple parts that must work together. Deployment should therefore be automated, using tested and proven mechanisms such as scripts and deployment applications. These can update and validate configuration, and automate the deployment process. Automated techniques should also be used to perform updates of all or parts of applications. It is vital to test all of these processes fully to ensure that errors do not cause additional downtime. All deployment tools must have suitable security restrictions to protect the deployed application; define and enforce deployment policies carefully and minimize the need for human intervention.

自动构建和测试部署和维护任务。分布式应用程序由多个部分组成，它们必须协同工作。因此，部署应该是自动化的，使用测试和验证的机制，如脚本和部署应用程序。这些可以更新和验证配置，并自动化部署过程。自动化技术还应用于执行所有或部分应用程序的更新。至关重要的是要全面测试所有这些流程，以确保错误不会造成额外的停机时间。所有部署工具必须具有适当的安全限制，以保护部署的应用程序；仔细定义和执行部署策略，尽量减少人为干预的需要。

Consider using staging and production features of the platform where these are available.

For example, using Azure Cloud Services staging and production environments allows applications to be switched from one to another instantly through a virtual IP address swap (VIP Swap). However, if you prefer to stage on-premises, or deploy different versions of the application concurrently and gradually migrate users, you may not be able to use a VIP Swap operation.

考虑使平台提供的产品特性。例如，使用Azure云服务的验证和生产环境允许应用程序通过一个虚拟IP地址切换（VIP切换）立即从一个到另一个切换。但是，如果您喜欢在专有环境，或者同时部署应用程序的不同版本并逐步迁移用户，则可能无法使用VIP切换操作。

Apply configuration changes without recycling the instance when possible.

In many cases, the configuration settings for an Azure application or service can be changed without requiring the role to be restarted. Role expose events that can be handled to detect configuration changes and apply them to components within the application. However, some changes to the core platform settings do require a role to be restarted. When building components and services, maximize availability and minimize downtime by designing them to accept changes to configuration settings without requiring the application as a whole to be restarted.

在可能的情况下应用配置更改而不需要回收实例。在许多情况下，Azure应用程序或服务的配置设置可以在不需要重新启动就进行更改。实例汇报出相关事件是可以被用来检测配置更改并将其应用到应用程序中的组件中。但是对核心平台设置的一些更改确实需要重新启动实例。在构建组件和服务时，通过设计它们来接受配置设置的更改，而不需要整个应用程序重新启动，从而最大限度地提高可用性和最小化停机时间。

Use upgrade domains for zero downtime during updates.

Azure compute units such as web and worker roles are allocated to upgrade domains. Upgrade domains group role instances together so that, when a rolling update takes place, each role in the upgrade domain is stopped, updated, and restarted in turn. This minimizes the impact on application availability. You can specify how many upgrade domains should be created for a service when the service is deployed.

在更新期间使用”升级域”来实现零停机时间。将Azure计算单元（如Web和worker角色）分配给”升级域”。将域组角色实例一起升级，以便在进行滚动更新时，依次停止，更新和重新启动”升级域”中的每个角色。这最大限度地减少了对应用程序可用性的影响您可以指定在部署服务时为服务创建多少个”升级域”。

Note
Roles are also distributed across fault domains, each of which is reasonably independent from other fault domains in terms of server rack, power, and cooling provision, in order to minimize the chance of a failure affecting all role instances. This distribution occurs automatically, and you cannot control it.

注意
角色也分布在跨故障域中，每个故障域在服务器机架，电源和冷却系统资源都与其他故障域相当独立，以便最大限度地减少故障影响所有角色实例的机会。这个分布是会自动产生的，您无法控制。

Configure availability sets for Azure virtual machines.

Placing two or more virtual machines in the same availability set guarantees that these virtual machines will not be deployed to the same fault domain. To maximize availability, you should create multiple instances of each critical virtual machine used by your system and place these instances in the same availability set. If you are running multiple virtual machines that serve different purposes, create an availability set for each virtual machine. Add instances of each virtual machine to each availability set. For example, if you have created separate virtual machines to act as a web server and a reporting server, create an availability set for the web server and another availability set for the reporting server. Add instances of the web server virtual machine to the web server availability set, and add instances of the reporting server virtual machine to the reporting server availability set.

配置Azure虚拟机的可用性集。将两个或多个虚拟机置于同一可用性集中可确保这些虚拟机不会部署到同一个故障域。为了最大限度地提高可用性，您应该给系统使用的每个关键虚拟机创建多个实例，并将这些实例放在同一可用性集中。如果您正在运行多个用于不同目的的虚拟机，请为每个虚拟机创建可用性集。将每个虚拟机的实例添加到每个可用性集。例如，如果已创建单独的虚拟机以充当Web服务器和report服务器，请为Web服务器创建可用性集，并为report服务器创建另一个可用性集。将Web服务器虚拟机的实例添加到Web服务器可用性集，并将report服务器虚拟机的实例添加到report服务器可用性集。

数据管理

Geo-replicate data in Azure Storage.

Data in Azure Storage is automatically replicated within in a datacenter. For even higher availability, use Read-access geo-redundant storage (-RAGRS), which replicates your data to a secondary region and provides read-only access to the data in the secondary location. The data is durable even in the case of a complete regional outage or a disaster. For more information, see Azure Storage replication.

Geo-replicate databases.

Azure SQL Database and Cosmos DB both support geo-replication, which enables you to configure secondary database replicas in other regions. Secondary databases are available for querying and for failover in the case of a data center outage or the inability to connect to the primary database. For more information, see Failover groups and active geo-replication (SQL Database) and How to distribute data globally with Azure Cosmos DB?.

Use optimistic concurrency and eventual consistency where possible.

Transactions that block access to resources through locking (pessimistic concurrency) can cause poor performance and considerably reduce availability. These problems can become especially acute in distributed systems. In many cases, careful design and techniques such as partitioning can minimize the chances of conflicting updates occurring. Where data is replicated, or is read from a separately updated store, the data will only be eventually consistent. But the advantages usually far outweigh the impact on availability of using transactions to ensure immediate consistency.

Use periodic backup and point-in-time restore, and ensure it meets the Recovery Point Objective (RPO).

Regularly and automatically back up data that is not preserved elsewhere, and verify you can reliably restore both the data and the application itself should a failure occur. Data replication is not a backup feature because errors and inconsistencies introduced through failure, error, or malicious operations will be replicated across all stores. The backup process must be secure to protect the data in transit and in storage. Databases or parts of a data store can usually be recovered to a previous point in time by using transaction logs. Microsoft Azure provides a backup facility for data stored in Azure SQL Database. The data is exported to a backup package on Azure blob storage, and can be downloaded to a secure on-premises location for storage.

Enable the high availability option to maintain a secondary copy of an Azure Redis cache.

When using Azure Redis Cache, choose the standard option to maintain a secondary copy of the contents. For more information, see Create a cache in Azure Redis Cache.