Today's increasing bandwidth, and faster networking hardware, has made it difficult for a single CPU to keep up. Multiple cores and packages have helped matters on the transmit side, but the receive side is trickier. Tom Herbert's receive packet steering (RPS) patches, which we looked at back in November, provide a way to steer packets to particular CPUs based on a hash of the packet's protocol data. Those patches were applied to the network subsystem tree and are bound for 2.6.35, but now Herbert is back with an enhancement to RPS that will attempt to steer packets to the CPU on which the receiving application is running: receive flow steering (RFS).
RFS uses the RPS hash table to store the CPU of an application when it calls recvmsg() or sendmsg(). Instead of picking an arbitrary CPU based on the hash and a CPU mask optionally set by an administrator, as RPS does, RFS tries to use the CPU where the receiving application is running. Based on the hash calculated on the incoming packet, RFS can look up the "proper" CPU and assign the packet there.
The RPS CPU masks, which can be set via sysfs for each device (and queue for devices with multiple queues), represent the allowable CPUs to assign for a packet. But dynamically changing those values introduces the possibility of out-of-order packets. For RPS, with largely static CPU masks, it was not necessarily a big problem. For RFS, however, multiple threads trying to read from the same socket, while potentially bouncing around to different CPUs, would cause the CPU value in the hash table to change frequently, thus increasing the likelihood of out-of-order packets.
For RFS, that was considered to be a "non-starter", Herbert said, so a different approach was required. To eliminate the out-of-order packets, two types of hash tables are created, both indexed by the hash calculated from the packet information. The global rps_sock_flow_table is populated by the recvmsg() or sendmsg() call with the CPU number where the application is running (this is called the "desired" CPU). Each device queue then gets a rps_dev_flow_table which contains the most recent CPU used to handle packets for that connection (which is called the "current" CPU). In addition, the value of the tail queue counter for the current CPU's backlog queue is stored in the rps_dev_flow_table entry.
The two CPU values are compared when deciding which CPU to process the packet on (which is done in get_rps_cpu()). If the current CPU (as determined from the rps_dev_flow_table hash table) is unset (presumably for the first packet) or that CPU is offline, the desired CPU (from rps_sock_flow_table) is used. If the two CPU values are the same, obviously, that CPU is used. But if they are both valid CPU numbers, but different, the backlog tail queue counter is consulted.
Backlog queues have a queue head counter that gets incremented when packets are removed from the queue. Using that and the queue length, a queue tail counter value can be calculated. That is what gets stored in rps_dev_flow_table. When the kernel makes its decision about which CPU to assign the packet to, it needs to consider both the current (really last used by the kernel) CPU and the desired (last used by an application for sending or receiving) CPU.
The kernel compares the current CPU's queue tail counter (as stored in the hash table) with that CPU's queue head counter. If the tail counter is less than or equal the head counter, that means that all packets that were put on the queue by this connection have been processed. That in turn means that switching to the desired CPU will not result in out-of-order packets.
Herbert's current patch is for TCP, but RFS should be "usable for other flow oriented protocols". The benefit is that it can achieve better CPU locality for the processing of the packet, both by the kernel, and the application itself. Depending on various factors—cache hierarchy and application are given as examples—it can and does increase the packets per second that can be processed as well as lowering the latency before a packet gets processed. But, interestingly, "on simple benchmarks, we don't necessarily see improvement and sometimes see degradation".
For more complex benchmarks, the performance increase looks to be significant. Herbert gave numbers for a netperf run where the transactions per second went from 104K without either RFS or RPS, to 290K for the best RPS configuration, and to 303K with RFS and RPS. A different test, with 100 threads handling an RPC-like request/response with some user-space work being done, was even more dramatic. That test showed 103K, 174K, and 223K respectively, but also showed a marked decrease in the latency for both RPS and RPS + RFS.
These patches are coming from Google, which has been known to process a few packets using the Linux kernel. If RFS is being used on production systems at Google, that would seem to bode well for its reliability and performance beyond just benchmarks. The patches were posted April 2, and seemed to be generally well-received, so it's a little early to tell when they might make it into the mainline. But it seems rather likely that we will see them in either 2.6.35 or 36.