- Asynchronous - Use asynchronous communication when possible. Synchronous callstie the availability of the two services together. If one has a failure or is slow the other one is affected.
- Swim Lanes – Create fault isolated “swim lanes” of hardware by customer segmentation. This prevents problems with one customer from causing issues across all customers. This also helps with diagnosis of issues and code roll outs.
- Cache - Make use of cache at multiple layers including object caches in front of databases (such as memcached), page or item caches for content (such as squid) and edge caches (such as Akamai).
- Monitoring - Understand your application’s performance from a customer’s perspective. Monitor outside of your network and have tests that simulate a real user’s experience. Also monitor the internal working of the application in terms of query and transaction execution count and timing.
- Replication - Replicate databases for recovery as well as to off load reads to multiple instances.
- Sharding - Split the application and databases by service and / or by customer using a modulus. While this requires slightly more complicated logic in the application it allows for massive scaling.
- Use Few RDBMS Features – Use the OLTP database as a persistent storage device as much as possible. The more you rely on the features offered in most RDBMS for your transactions, the greater load you are putting on the hardest item in your system to scale. Remove all business logic from the database such as stored procedures and move it into the application. When significant scaling is required join in the application and not through the SQL.
- Slow Roll – Roll out new code versions slowly, to a small subset of your servers without bringing the entire site down. This requires that all code be backwards compatible because you will have two versions of code running in production during the roll out. This method allows you to find problems that your quality and L&P testing missed while having minimal impact on customers.
- Load & Performance Testing – Test the performance of the application version before it goes into production. This will not catch all the issues, which is why you need the ability to rollback, but it is very worthwhile.
- Capacity Planning / Scalability Summits – Know how much capacity you have on all tiers and services in your system. Use Scalability Summits to plan for the increased capacity demands.
- Rollback – Always have the ability to rollback a code release.
- Root Cause Analysis - Ensure you have a learning culture that is evident by utilizing Root Cause Analysis to find and fix the real cause of issues.
- Quality From The Beginning – Quality can’t be tested into a product, it must be designed in from the beginning.
1, 尽可能地使用异步 通信.
2, 为提供不同服务的硬件引入故障隔离 .
3, 在多层系统中, 使用Cache .
4, 从用户角度监控 你的系统性能.
5, 使用数据库复制 , 降低单点读压力.
6, 根据用户和业务的不同, 将应用或数据库分片 .
7, 减少使用关系型数据库 的复杂特性. 尽可能把它当做是一个持久存储设备.关系数据库没有伸缩性,所以请把逻辑全部移动到应用层,伸缩性是靠应用层来解决的而不是sql, 个人认为,sql这种dsl可以用来描述性解决问题,但是伸缩性和性能太过技术性。
8, 以循序渐进的方式升级系统 , 先升级小部分servers, 然后逐步升级所有servers.
9, 在一个应用进入生产环境前, 一定要做性能和负载测试 .
10, 设计系统时, 应该要做容量规划与扩容方案 .
11, 使系统具备回滚 能力.
12, 确保团队具备根本问题分析能力 .这样, 当出现问题时, 可以方便快速的定位.
13, 质量是设计出来的 , 不是靠测出来的.