State Threads for Internet Applications


State Threads is an application library which provides a foundation for writing fast and highly scalable Internet Applications on UNIX-like platforms. It combines the simplicity of the multithreaded programming paradigm, in which one thread supports each simultaneous connection, with the performance and scalability of an event-driven state machine architecture.

1. Definitions

1.1 Internet Applications

An Internet Application (IA) is either a server or client network application that accepts connections from clients and may or may not connect to servers. In an IA the arrival or departure of network data often controls processing (that is, IA is a data-driven application). For each connection, an IA does some finite amount of work involving data exchange with its peer, where its peer may be either a client or a server. The typical transaction steps of an IA are to accept a connection, read a request, do some finite and predictable amount of work to process the request, then write a response to the peer that sent the request. One example of an IA is a Web server; the most general example of an IA is a proxy server, because it both accepts connections from clients and connects to other servers.

We assume that the performance of an IA is constrained by available CPU cycles rather than network bandwidth or disk I/O (that is, CPU is a bottleneck resource).

1.2 Performance and Scalability

The performance of an IA is usually evaluated as its throughput measured in transactions per second or bytes per second (one can be converted to the other, given the average transaction size). There are several benchmarks that can be used to measure throughput of Web serving applications for specific workloads (such as SPECweb96, WebStone, WebBench). Although there is no common definition for scalability, in general it expresses the ability of an application to sustain its performance when some external condition changes. For IAs this external condition is either the number of clients (also known as "users," "simultaneous connections," or "load generators") or the underlying hardware system size (number of CPUs, memory size, and so on). Thus there are two types of scalability: load scalability and system scalability, respectively.

The figure below shows how the throughput of an idealized IA changes with the increasing number of clients (solid blue line). Initially the throughput grows linearly (the slope represents the maximal throughput that one client can provide). Within this initial range, the IA is underutilized and CPUs are partially idle. Further increase in the number of clients leads to a system saturation, and the throughput gradually stops growing as all CPUs become fully utilized. After that point, the throughput stays flat because there are no more CPU cycles available. In the real world, however, each simultaneous connection consumes some computational and memory resources, even when idle, and this overhead grows with the number of clients. Therefore, the throughput of the real world IA starts dropping after some point (dashed blue line in the figure below). The rate at which the throughput drops depends, among other things, on application design.

We say that an application has a good load scalability if it can sustain its throughput over a wide range of loads. Interestingly, the SPECweb99 benchmark somewhat reflects the Web server's load scalability because it measures the number of clients (load generators) given a mandatory minimal throughput per client (that is, it measures the server's capacity). This is unlike SPECweb96 and other benchmarks that use the throughput as their main metric (see the figure below).

System scalability is the ability of an application to sustain its performance per hardware unit (such as a CPU) with the increasing number of these units. In other words, good system scalability means that doubling the number of processors will roughly double the application's throughput (dashed green line). We assume here that the underlying operating system also scales well. Good system scalability allows you to initially run an application on the smallest system possible, while retaining the ability to move that application to a larger system if necessary, without excessive effort or expense. That is, an application need not be rewritten or even undergo a major porting effort when changing system size.

Although scalability and performance are more important in the case of server IAs, they should also be considered for some client applications (such as benchmark load generators).

1.3 Concurrency

Concurrency reflects the parallelism in a system. The two unrelated types are virtual concurrency and real concurrency.

Virtual (or apparent) concurrency is the number of simultaneous connections that a system supports.

Real concurrency is the number of hardware devices, including CPUs, network cards, and disks, that actually allow a system to perform tasks in parallel.

An IA must provide virtual concurrency in order to serve many users simultaneously. To achieve maximum performance and scalability in doing so, the number of programming entities than an IA creates to be scheduled by the OS kernel should be kept close to (within an order of magnitude of) the real concurrency found on the system. These programming entities scheduled by the kernel are known as kernel execution vehicles. Examples of kernel execution vehicles include Solaris lightweight processes and IRIX kernel threads. In other words, the number of kernel execution vehicles should be dictated by the system size and not by the number of simultaneous connections.

2. Existing Architectures

There are a few different architectures that are commonly used by IAs. These include the Multi-Process, Multi-Threaded, and Event-Driven State Machine architectures.

2.1 Multi-Process Architecture

In the Multi-Process (MP) architecture, an individual process is dedicated to each simultaneous connection. A process performs all of a transaction's initialization steps and services a connection completely before moving on to service a new connection.

User sessions in IAs are relatively independent; therefore, no synchronization between processes handling different connections is necessary. Because each process has its own private address space, this architecture is very robust. If a process serving one of the connections crashes, the other sessions will not be affected. However, to serve many concurrent connections, an equal number of processes must be employed. Because processes are kernel entities (and are in fact the heaviest ones), the number of kernel entities will be at least as large as the number of concurrent sessions. On most systems, good performance will not be achieved when more than a few hundred processes are created because of the high context-switching overhead. In other words, MP applications have poor load scalability.

On the other hand, MP applications have very good system scalability, because no resources are shared among different processes and there is no synchronization overhead.

The Apache Web Server 1.x ([Reference 1]) uses the MP architecture on UNIX systems.

2.2 Multi-Threaded Architecture

In the Multi-Threaded (MT) architecture, multiple independent threads of control are employed within a single shared address space. Like a process in the MP architecture, each thread performs all of a transaction's initialization steps and services a connection completely before moving on to service a new connection.

Many modern UNIX operating systems implement a many-to-few model when mapping user-level threads to kernel entities. In this model, an arbitrarily large number of user-level threads is multiplexed onto a lesser number of kernel execution vehicles. Kernel execution vehicles are also known as virtual processors. Whenever a user-level thread makes a blocking system call, the kernel execution vehicle it is using will become blocked in the kernel. If there are no other non-blocked kernel execution vehicles and there are other runnable user-level threads, a new kernel execution vehicle will be created automatically. This prevents the application from blocking when it can continue to make useful forward progress.

Because IAs are by nature network I/O driven, all concurrent sessions block on network I/O at various points. As a result, the number of virtual processors created in the kernel grows close to the number of user-level threads (or simultaneous connections). When this occurs, the many-to-few model effectively degenerates to a one-to-one model. Again, like in the MP architecture, the number of kernel execution vehicles is dictated by the number of simultaneous connections rather than by number of CPUs. This reduces an application's load scalability. However, because kernel threads (lightweight processes) use fewer resources and are more light-weight than traditional UNIX processes, an MT application should scale better with load than an MP application.

Unexpectedly, the small number of virtual processors sharing the same address space in the MT architecture destroys an application's system scalability because of contention among the threads on various locks. Even if an application itself is carefully optimized to avoid lock contention around its own global data (a non-trivial task), there are still standard library functions and system calls that use common resources hidden from the application. For example, on many platforms thread safety of memory allocation routines (malloc(3), free(3), and so on) is achieved by using a single global lock. Another example is a per-process file descriptor table. This common resource table is shared by all kernel execution vehicles within the same process and must be protected when one modifies it via certain system calls (such as open(2), close(2), and so on). In addition to that, maintaining the caches coherent among CPUs on multiprocessor systems hurts performance when different threads running on different CPUs modify data items on the same cache line.

In order to improve load scalability, some applications employ a different type of MT architecture: they create one or more thread(s) per task rather than one thread per connection. For example, one small group of threads may be responsible for accepting client connections, another for request processing, and yet another for serving responses. The main advantage of this architecture is that it eliminates the tight coupling between the number of threads and number of simultaneous connections. However, in this architecture, different task-specific thread groups must share common work queues that must be protected by mutual exclusion locks (a typical producer-consumer problem). This adds synchronization overhead that causes an application to perform badly on multiprocessor systems. In other words, in this architecture, the application's system scalability is sacrificed for the sake of load scalability.

Of course, the usual nightmares of threaded programming, including data corruption, deadlocks, and race conditions, also make MT architecture (in any form) non-simplistic to use.

2.3 Event-Driven State Machine Architecture

In the Event-Driven State Machine (EDSM) architecture, a single process is employed to concurrently process multiple connections. The basics of this architecture are described in Comer and Stevens [Reference 2]. The EDSM architecture performs one basic data-driven step associated with a particular connection at a time, thus multiplexing many concurrent connections. The process operates as a state machine that receives an event and then reacts to it.

In the idle state the EDSM calls select(2) or poll(2) to wait for network I/O events. When a particular file descriptor is ready for I/O, the EDSM completes the corresponding basic step (usually by invoking a handler function) and starts the next one. This architecture uses non-blocking system calls to perform asynchronous network I/O operations. For more details on non-blocking I/O see Stevens [Reference 3].

To take advantage of hardware parallelism (real concurrency), multiple identical processes may be created. This is called Symmetric Multi-Process EDSM and is used, for example, in the Zeus Web Server ([Reference 4]). To more efficiently multiplex disk I/O, special "helper" processes may be created. This is called Asymmetric Multi-Process EDSM and was proposed for Web servers by Druschel and others [Reference 5].

EDSM is probably the most scalable architecture for IAs. Because the number of simultaneous connections (virtual concurrency) is completely decoupled from the number of kernel execution vehicles (processes), this architecture has very good load scalability. It requires only minimal user-level resources to create and maintain additional connection.

Like MP applications, Multi-Process EDSM has very good system scalability because no resources are shared among different processes and there is no synchronization overhead.

Unfortunately, the EDSM architecture is monolithic rather than based on the concept of threads, so new applications generally need to be implemented from the ground up. In effect, the EDSM architecture simulates threads and their stacks the hard way.

3. State Threads Library

The State Threads library combines the advantages of all of the above architectures. The interface preserves the programming simplicity of thread abstraction, allowing each simultaneous connection to be treated as a separate thread of execution within a single process. The underlying implementation is close to the EDSM architecture as the state of each particular concurrent session is saved in a separate memory segment.

3.1 State Changes and Scheduling

The state of each concurrent session includes its stack environment (stack pointer, program counter, CPU registers) and its stack. Conceptually, a thread context switch can be viewed as a process changing its state. There are no kernel entities involved other than processes. Unlike other general-purpose threading libraries, the State Threads library is fully deterministic. The thread context switch (process state change) can only happen in a well-known set of functions (at I/O points or at explicit synchronization points). As a result, process-specific global data does not have to be protected by mutual exclusion locks in most cases. The entire application is free to use all the static variables and non-reentrant library functions it wants, greatly simplifying programming and debugging while increasing performance. This is somewhat similar to a co-routine model (co-operatively multitasked threads), except that no explicit yield is needed -- sooner or later, a thread performs a blocking I/O operation and thus surrenders control. All threads of execution (simultaneous connections) have the same priority, so scheduling is non-preemptive, like in the EDSM architecture. Because IAs are data-driven (processing is limited by the size of network buffers and data arrival rates), scheduling is non-time-slicing.

Only two types of external events are handled by the library's scheduler, because only these events can be detected by select(2) or poll(2): I/O events (a file descriptor is ready for I/O) and time events (some timeout has expired). However, other types of events (such as a signal sent to a process) can also be handled by converting them to I/O events. For example, a signal handling function can perform a write to a pipe (write(2) is reentrant/asynchronous-safe), thus converting a signal event to an I/O event.

To take advantage of hardware parallelism, as in the EDSM architecture, multiple processes can be created in either a symmetric or asymmetric manner. Process management is not in the library's scope but instead is left up to the application.

There are several general-purpose threading libraries that implement a many-to-one model (many user-level threads to one kernel execution vehicle), using the same basic techniques as the State Threads library (non-blocking I/O, event-driven scheduler, and so on). For an example, see GNU Portable Threads ([Reference 6]). Because they are general-purpose, these libraries have different objectives than the State Threads library. The State Threads library is not a general-purpose threading library, but rather an application library that targets only certain types of applications (IAs) in order to achieve the highest possible performance and scalability for those applications.

3.2 Scalability

State threads are very lightweight user-level entities, and therefore creating and maintaining user connections requires minimal resources. An application using the State Threads library scales very well with the increasing number of connections.

On multiprocessor systems an application should create multiple processes to take advantage of hardware parallelism. Using multiple separate processes is the only way to achieve the highest possible system scalability. This is because duplicating per-process resources is the only way to avoid significant synchronization overhead on multiprocessor systems. Creating separate UNIX processes naturally offers resource duplication. Again, as in the EDSM architecture, there is no connection between the number of simultaneous connections (which may be very large and changes within a wide range) and the number of kernel entities (which is usually small and constant). In other words, the State Threads library makes it possible to multiplex a large number of simultaneous connections onto a much smaller number of separate processes, thus allowing an application to scale well with both the load and system size.

3.3 Performance

Performance is one of the library's main objectives. The State Threads library is implemented to minimize the number of system calls and to make thread creation and context switching as fast as possible. For example, per-thread signal mask does not exist (unlike POSIX threads), so there is no need to save and restore a process's signal mask on every thread context switch. This eliminates two system calls per context switch. Signal events can be handled much more efficiently by converting them to I/O events (see above).

3.4 Portability

The library uses the same general, underlying concepts as the EDSM architecture, including non-blocking I/O, file descriptors, and I/O multiplexing. These concepts are available in some form on most UNIX platforms, making the library very portable across many flavors of UNIX. There are only a few platform-dependent sections in the source.

3.5 State Threads and NSPR

The State Threads library is a derivative of the Netscape Portable Runtime library (NSPR) [Reference 7]. The primary goal of NSPR is to provide a platform-independent layer for system facilities, where system facilities include threads, thread synchronization, and I/O. Performance and scalability are not the main concern of NSPR. The State Threads library addresses performance and scalability while remaining much smaller than NSPR. It is contained in 8 source files as opposed to more than 400, but provides all the functionality that is needed to write efficient IAs on UNIX-like platforms.

  NSPR State Threads
Lines of code ~150,000 ~3000
Dynamic library size(debug version)
IRIX ~700 KB ~60 KB
Linux ~900 KB ~70 KB


State Threads is an application library which provides a foundation for writing Internet Applications. To summarize, it has the following advantages:

It allows the design of fast and highly scalable applications. An application will scale well with both load and number of CPUs.

It greatly simplifies application programming and debugging because, as a rule, no mutual exclusion locking is necessary and the entire application is free to use static variables and non-reentrant library functions.

The library's main limitation:

All I/O operations on sockets must use the State Thread library's I/O functions because only those functions perform thread scheduling and prevent the application's processes from blocking.


  1. Apache Software Foundation, http://www.apache.org.
  2. Douglas E. Comer, David L. Stevens, Internetworking With TCP/IP, Vol. III: Client-Server Programming And Applications, Second Edition, Ch. 8, 12.
  3. W. Richard Stevens, UNIX Network Programming, Second Edition, Vol. 1, Ch. 15.
  4. Zeus Technology Limited, http://www.zeus.co.uk.
  5. Peter Druschel, Vivek S. Pai, Willy Zwaenepoel, Flash: An Efficient and Portable Web Server. In Proceedings of the USENIX 1999 Annual Technical Conference, Monterey, CA, June 1999.
  6. GNU Portable Threads, http://www.gnu.org/software/pth/.
  7. Netscape Portable Runtime, http://www.mozilla.org/docs/refList/refNSPR/.

Other resources covering various architectural issues in IAs

  1. Dan Kegel, The C10K problem, http://www.kegel.com/c10k.html.
  2. James C. Hu, Douglas C. Schmidt, Irfan Pyarali, JAWS: Understanding High Performance Web Systems, http://www.cs.wustl.edu/~jxh/research/research.html.

Portions created by SGI are Copyright ? 2000 Silicon Graphics, Inc. All rights reserved.





1. 定义

1.1 网络程序(Internet Applications)

网络程序(Internet Application)(IA)是一个网络的客户端或者服务器程序,它接受客户端连接,同时可能需要连接到其他服务器。在IA中,数据的到达和发送完毕经常操纵控制流,就是说IA是数据驱动的程序。对每个连接,IA做一些有限的工作,包括和peer的数据交换,peer可能是客户端或服务器。IA典型的事务步骤是:接受连接,读取请求,做一些有限的工作处理请求,将相应写入peer。一个iA的例子是Web服务器,更典型的例子是代理服务器,因为它接受客户端连接,同时也连接到其他服务器。


1.2 性能和可扩展性

IA的性能一般可以用吞吐量来评估,即每秒的事务数,或每秒的字节数(两者可以相互转换,给定事务的平均大小就可以)。有很多种工具可以用来测量Web程序的特定负载,譬如SPECweb96, WebStone, WebBench。尽管对扩展性没有通用的定义,一般而言,可扩展性指系统在外部条件改变时维持它的性能的能力。对于IAs而言,外部条件指连接数(并发),或者底层硬件(CPU数目,内存等)。因此,有两种系统的扩展性:负载能力和系统能力。



我们将系统有好的负载能力,是指系统在高负载时仍能很好的工作。SPECweb99基准测试能较好的反应系统的负载能力,因为它测量的是连接在最小流量需求时系统能支持的最大连接数(译注:如图中Capacity所指出的点即灰色斜线和蓝色线交叉的点)。而不像SPECweb96或其他的基准测试,是以系统的吞吐量来衡量的(译注:图中Max throughout,即蓝色线的天花板)。




灰色的线(min acceptable throughout pre client)表示是客户端的需要的吞吐量,至少这个量才流畅。








1.3 并发




IA必须提供虚拟并发来支持用户的并发访问,为了达到最大的性能,IA创建的由内核调度的编程实体数目基本上和物理并发的数量要保持一致(在一个数量级上)(译注:有多少个CPU就用多少个进程)。内核调度的编程实体即内核执行对象(kernel execution vehicles),包括Solaris轻量级进程,IRIX内核线程。换句话说,内核执行对象应该由物理条件决定,而不是由并发决定(译注:即进程数目应该由CPU决定,而不是由连接数决定)。

2. 现有的架构

IAs(Internet Applications)有一些常见的被广泛使用的架构,包括基于进程的架构(Multi-Process),基于线程的架构(Multi-Threaded), 和事件驱动的状态机架构(Event-Driven State Machine)。

2.1 基于进程的架构:MP






2.2 基于线程的架构:MT








2.3 基于事件的状态机架构:EDSM

在基于事件驱动的状态机架构(EDSM)中,一个进程用来处理多个并发。Comer和Stevens[Reference 2]描述了这个架构的基础。EDSM架构中,每次每个连接只由数据驱动一步(译注:例如,收一个包,动作一次),因此必须复用多个并发的连接(译注:必须复用一个进程处理多个连接),进程设计成状态机每次收到一个时间就处理并变换到下一个状态。

在空闲状态时,EDSM调用select/poll/epoll等待网络事件,当一个特殊的连接可以读写时,EDSM调用响应的处理函数处理,然后处理下一个连接。EDSM架构使用非阻塞的系统调用完成异步的网络IO。关于非阻塞的IO,请参考Stevens [Reference 3]。

为了利用硬件并行性能,可以创建多个独立的进程,这叫均衡的多进程EDSM,例如ZeusWeb服务器[Reference 4](译注:商业的高性能服务器)。为了更好的利用多磁盘的IO性能,可以创建一些辅助进程,这叫非均衡的多进程EDSM,例如DruschelWeb服务器[Reference 5]。




3. State Threads Library



3.1 状态改变和调度



1. IO事件:一个文件描述符可读写时。

2. 定时器时间:指定了timeout。



有一些通用的线程库,实现了多对一的模型(多个用户空间的线程,对一个内核执行对象),使用了和StateThreads库类似的技术(非阻塞IO,事件驱动的调度器等)。譬如,GNU Portable Threads [Reference 6]。因为他们是通用库,所以它们和StateThreads有不同的目标。StateThreads不是通用的线程库,而是为少数的需要获得高性能、高并发、高扩展性和可读性的IAs系统而设计的。

3.2 可扩展性



3.3 性能


3.4 便携性


3.5 State Threads 和 NSPR

StateThreads库是从Netscape Portable Runtime library (NSPR) [Reference 7]发展来的。NSPR主要的目标是提供一个平台无关的系统功能,包括线程,线程同步和IO。性能和可扩展性不是NSPR主要考虑的问题。StateThreads解决了性能和可扩展性问题,但是比NSPR要小很多;它仅仅包含8个源文件,却提供了在UNIX下写高效IAs系统的必要功能:

  NSPR State Threads
Lines of code ~150,000 ~3000
Dynamic library size(debug version)
IRIX ~700 KB ~60 KB
Linux ~900 KB ~70 KB



1. 能设计出高效的IA系统,包括很高的负载能力和系统能力。

2. 简化了编程和调试,因为没有同步锁,可以使用静态变量和不可重入函数。


1. 所有socket的IO必须要使用库的IO函数,因为调度器可以避免被阻塞(译注:用操作系统的socket的IO函数自然调度器就管不了了)。


  1. Apache Software Foundation, http://www.apache.org.
  2. Douglas E. Comer, David L. Stevens, Internetworking With TCP/IP, Vol. III: Client-Server Programming And Applications, Second Edition, Ch. 8, 12.
  3. W. Richard Stevens, UNIX Network Programming, Second Edition, Vol. 1, Ch. 15.
  4. Zeus Technology Limited, http://www.zeus.co.uk.
  5. Peter Druschel, Vivek S. Pai, Willy Zwaenepoel, Flash: An Efficient and Portable Web Server. In Proceedings of the USENIX 1999 Annual Technical Conference, Monterey, CA, June 1999.
  6. GNU Portable Threads, http://www.gnu.org/software/pth/.
  7. Netscape Portable Runtime, http://www.mozilla.org/docs/refList/refNSPR/.

Other resources covering various architectural issues in IAs

  1. Dan Kegel, The C10K problem, http://www.kegel.com/c10k.html.
  2. James C. Hu, Douglas C. Schmidt, Irfan Pyarali, JAWS: Understanding High Performance Web Systems, http://www.cs.wustl.edu/~jxh/research/research.html.



#include <stdio.h>

build and execute
    gcc -I../obj -g huge_threads.c ../obj/libst.a  -o huge_threads;
    ./huge_threads 10000
10K report:
    10000 threads, running on 1 CPU 512M machine,
    CPU 6%, MEM 8.2% (~42M = 42991K = 4.3K/thread)
30K report:
    30000 threads, running on 1CPU 512M machine,
    CPU 3%, MEM 24.3% (4.3K/thread)
#include <st.h> 

void* do_calc(void* arg){
    int sleep_ms = (int)(long int)(char*)arg * 10;
        printf("in sthread #%dms\n", sleep_ms);
        st_usleep(sleep_ms * 1000);
    return NULL;

int main(int argc, char** argv){
    if(argc <= 1){
        printf("Test the concurrence of state-threads!\n"
            "Usage: %s <sthread_count>\n"
            "eg. %s 10000\n", argv[0], argv[0]);
        return -1;
    if(st_init() < 0){
        return -1;
    int i;
    int count = atoi(argv[1]);
    for(i = 1; i <= count; i++){
        if(st_thread_create(do_calc, (void*)i, 0, 0) == NULL){
            return -1;
    return 0;

个人分类: SRS 英文屌文
想对作者说点什么? 我来说一句