程序员的20个常见瓶颈

在扩展性的艺术一书中,Russell给出了20个有意思的估计:大约有20个经典瓶颈。
Russell说,如果在他年轻时他就知道这些瓶颈该有多好!这些论断包括:


* Database (数据库)

  1. 数据规模超出了最大内存限制
  2. 大查询和小查询
  3. 写写冲突
  4. 大表join超占内存


* Virtualization (虚拟化)
  1. 共享磁盘,抢磁道,磁头各种抖
  2. 网络IO波动


* programming(编程)
  1. 线程:死锁、相对于事件驱动来说过于重量级、调试、线程数与性能比非线性
  2. 事件驱动编程:回调的复杂性、函数调用中如何保存状态(how-to-store-state-in-function-calls)
  3. 缺少profile工具、缺少trace工具、缺少日志工具
  4. 单点故障、横向不可扩展
  5. 有状态的应用
  6. 搓设计:一台机器上能跑,几个用户也能跑,几个月后,几年后,尼玛,发现扛不住了,整个架构需要重写。
  7. 算法复杂度
  8. 依赖于诸如DNS查找等比较搞人的外部组件(Dependent services like DNS lookups and whatever else you may block on.)
  9. 栈空间


* Disk (磁盘)
  1. 本地磁盘访问
  2. 随机磁盘IO
  3. 磁盘碎片
  4. 当写入的数据块大于SSD块大小时SSD性能下降


* OS (操作系统)
  1. Fsync flushing,Linux缓冲区耗尽(linux buffer cache filling up)
  2. TCP缓冲区过小
  3. 文件描述符数限制
  4. 电源管理(Power budget)


* Caching (缓存)
  1. 不使用memcached
  2. HTTP中,header,etags,不压缩(headers, etags, not gzipping)
  3. 没有充分使用浏览器缓存功能
  4. 字节码缓存(如PHP)
  5. L1/L2缓存. 这是个很大的瓶颈. 把频繁使用的数据保持在L1/L2中. 设计到的方面很多:网络数据压缩后再发送,基于列压缩的DB中不解压直接计算等等。有TLB友好的算法。最重要的是牢固掌握以下基础知识:多核CPU、L1/L2,共享L3,NUMA内存,CPU、内存之间的数据传输带宽延迟,磁盘页缓存,脏页,TCP从CPU到DRAM到网卡的流程。


* CPU
  1. CPU负载
  2. 上下文切换。一个核上线程数过多,linux调度器对应用不友好,系统调用过多
  3. IO等待->所有CPU都等起
  4. CPU缓存。(Caching data is a fine grained process (In Java think volatile for instance), in order to find the right balance between having multiple instances with different values for data and heavy synchronization to keep the cached data consistent.)
  5. 背板总线的吞吐能力


* Network (网络)
  1.  网卡的最大输出带宽,IRQ达到饱和状态,软件中断占用了100%的CPU
  2. DNS查找
  3. 丢包
  4. 网络路由瞎指挥
  5. 网络磁盘访问
  6. 共享SAN(Storage Area Network)
  7  服务器失败 -> 服务器无响应


* Process (过程)
  1. 测试时间
  2. 开发时间
  3. 团队规模
  4. 预算
  5. 码债(不良代码带来的维护成本)


* Memory (内存)
  1. 内存耗尽 -> 杀进程,swap
  2. 内存不足导致的磁盘抖动
  3. 内存库的开销
  4. 内存碎片(Java中需要GC的停顿,C中无解)


上面用英文表达的,是我还不太理解的,望诸位赐教。


========================================原文===========================

In Zen And The Art Of Scaling - A Koan And Epigram ApproachRussell Sullivan offered an interesting conjecture: there are 20 classic bottlenecks. This sounds suspiciously like the idea that there only 20 basic story plots. And depending on how you chunkify things, it may be true, but in practice we all know bottlenecks come in infinite flavors, all tasting of sour and ash.

One day Aurelien Broszniowski from Terracotta emailed me his list of bottlenecks, we cc’ed Russell in on the conversation, he gave me his list, I have a list, and here’s the resulting stone soup. 

Russell said this is his “I wish I knew when I was younger" list and I think that’s an enriching way to look at it. The more experience you have, the more different types of projects you tackle, the more lessons you’ll be able add to a list like this. So when you read this list, and when you make your own, you are stepping through years of accumulated experience and more than a little frustration, but in each there is a story worth grokking.

  • Database:
    • Working size exceeds available RAM
    • Long & short running queries
    • Write-write conflicts
    • Large joins taking up memory
  • Virtualisation:
    • Sharing a HDD, disk seek death
    • Network I/O fluctuations in the cloud
  • Programming:
    • Threads: deadlocks, heavyweight as compared to events, debugging, non-linear scalability, etc...
    • Event driven programming: callback complexity, how-to-store-state-in-function-calls, etc...
    • Lack of profiling, lack of tracing, lack of logging
    • One piece can't scale, SPOF, non horizontally scalable, etc...
    • Stateful apps
    • Bad design : The developers create an app which runs fine on their computer. The app goes into production, and runs fine, with a couple of users. Months/Years later, the application can't run with thousands of users and needs to be totally re-architectured and rewritten.
    • Algorithm complexity
    • Dependent services like DNS lookups and whatever else you may block on.
    • Stack space
  • Disk:
    • Local disk access
    • Random disk I/O -> disk seeks
    • Disk fragmentation
    • SSDs performance drop once  data written is greater than SSD size
  • OS:
    • Fsync flushing, linux buffer cache filling up
    • TCP buffers too small
    • File descriptor limits
    • Power budget
  • Caching:
    • Not using memcached (database pummeling)
    • In HTTP: headers, etags, not gzipping, etc..
    • Not utilising the browser's cache enough
    • Byte code caches (e.g. PHP)
    • L1/L2 caches. This is a huge bottleneck. Keep important hot/data in L1/L2. This spans so much: snappy for network I/O, column DBs run algorithms directly on compressed data, etc. Then there are techniques to not destroy your TLB. The most important idea is to have a firm grasp on computer architecture in terms of CPUs multi-core, L1/L2, shared L3, NUMA RAM, data transfer bandwidth/latency from DRAM to chip, DRAM caches DiskPages, DirtyPages, TCP packets travel thru CPU<->DRAM<->NIC.
  • CPU:
    • CPU overload
    • Context switches -> too many threads on a core, bad luck w/ the linux scheduler, too many system calls, etc...
    • IO waits -> all CPUs wait at the same speed
    • CPU Caches: Caching data is a fine grained process (In Java think volatile for instance), in order to find the right balance between having multiple instances with different values for data and heavy synchronization to keep the cached data consistent.
    • Backplane throughput
  • Network:
    • NIC maxed out, IRQ saturation, soft interrupts taking up 100% CPU
    • DNS lookups
    • Dropped packets
    • Unexpected routes with in the network
    • Network disk access
    • Shared SANs
    • Server failure -> no answer anymore from the server
  • Process:
    • Testing time
    • Development time
    • Team size
    • Budget
    • Code debt
  • Memory:
    • Out of memory -> kills process, go into swap & grind to a halt
    • Out of memory causing Disk Thrashing (related to swap)
    • Memory library overhead
    • Memory fragmentation
      • In Java requires GC pauses
      • In C, malloc's start taking forever


If you have any more to add or you have suggested fixes, please jump in.

Thanks to Aurelien and Russell for all their applied brain power.


©️2020 CSDN 皮肤主题: 数字20 设计师:CSDN官方博客 返回首页