http://lwn.net/Articles/438256/
Since the dawn of time, or for at least as long as I have been involved, the Linux kernel has deployed a concept called "plugging" on block devices. When I/O is queued to an empty device, that device enters a plugged state. This means that I/O isn't immediately dispatched to the low level device driver, instead it is held back by this plug. When a process is going to wait on the I/O to finish, the device is unplugged and request dispatching to the device driver is started. The idea behind plugging is to allow a buildup of requests to better utilize the hardware and to allow merging of sequential requests into one single larger request. The latter is an especially big win on most hardware; writing or reading bigger chunks of data at the time usually yields good improvements in bandwidth. With the release of the 2.6.39-rc1 kernel, block device plugging was drastically changed. Before we go into that, lets take a historic look at how plugging has evolved.
Back in the early days, plugging a device involved global state. This was before SMP scalability was an issue, and having global state made it easier to handle the unplugging. If a process was about to block for I/O, any plugged device was simply unplugged. This scheme persisted in pretty much the same form until the early versions of the 2.6 kernel, where it began to severely impact SMP scalability on I/O-heavy workloads.
In response to this problem, the plug state was turned into a per-device entity in 2004. This scaled well, but now you suddenly had no way to unplug all devices when going to sleep waiting for page I/O. This meant that the virtual memory subsystem had to be able to unplug the specific device that would be servicing page I/O. A special hack was added for this: sync_page() in struct address_space_operations; this hook would unplug the device of interest.
If you have a more complicated I/O setup with device mapper or RAID components, those layers would in turn unplug any lower-level device. The unplug event would thus percolate down the stack. Some heuristics were also added to auto-unplug the device if a certain depth of requests had been added, or if some period of time had passed before the unplug event was seen. With the asymmetric nature of plugging where the device was automatically plugged but had to be explicitly unplugged, we've had our fair share of I/O stall bugs in the kernel. While crude, the auto-unplug would at least ensure that we would chuck along if someone missed an unplug call after I/O submission.
With really fast devices hitting the market, once again plugging had become a scalability problem and hacks were again added to avoid this. Essentially we disabled plugging on solid-state devices that were able to do queueing. While plugging originally was a good win, it was time to reevaluate things. The asymmetric nature of the API was always ugly and a source of bugs, and the sync_page() hook was always hated by the memory management people. The time had come to rewrite the whole thing.
The primary use of plugging was to allow an I/O submitter to send down multiple pieces of I/O before handing it to the device. Instead of maintaining these I/O fragments as shared state in the device, a new on-stack structure was created to contain this I/O for a short period, allowing the submitter to build up a small queue of related requests. The state is now tracked in struct blk_plug, which is little more than a linked list and a should_sort flag informing blk_finish_plug() whether or not to sort this list before flushing the I/O. We'll come back to that later.
struct blk_plug { unsigned long magic; struct list_head list; unsigned int should_sort; };
The magic member is a temporary addition to detect uninitialized use cases, it will eventually be removed. The new API to do this is straightforward and simple to use:
struct blk_plug plug; blk_start_plug(&plug); submit_batch_of_io(); blk_finish_plug(&plug);
blk_start_plug() takes care of initializing the structure and tracking it inside the task structure of the current process. The latter is important to be able to automatically flush the queued I/O should the task end up blocking between the call to blk_start_plug() and blk_finish_plug(). If that happens, we want to ensure that pending I/O is sent off to the devices immediately. This is important from a performance perspective, but also to ensure that we don't deadlock. If the task is blocking for a memory allocation, memory management reclaim could end up wanting to free a page belonging to a request that is currently residing on our private plug. Similarly, the caller may itself end up waiting for some of the plugged I/O to finish. By flushing this list when the process goes to sleep, we avoid these types of deadlocks.
If blk_start_plug() is called and the task already has a plug structure registered, it is simply ignored. This can happen in cases where the upper layers plug for submitting a series of I/O, and further down in the call chain someone else does the same. I/O submitted without the knowledge of the original plugger will thus end up on the originally assigned plug, and be flushed whenever the original caller ends the plug by calling blk_finish_plug(), or if some part of the call path goes to sleep or is scheduled out.
Since the plug state is now device agnostic, we may end up in a situation where multiple devices have pending I/O on this plug list. These may end up on the plug list in an interleaved fashion, potentially causing blk_finish_plug() to grab and release the related queue locks multiple times. To avoid this problem, a should_sort flag in the blk_plug structure is used to keep track of whether we have I/O belonging to more than I/O distinct queue pending. If we do, the list is sorted to group identical queues together. This scales better than grabbing and releasing the same locks multiple times.
With this new scheme in place, the device need no longer be notified of unplug events. The queue unplug_fn() used to exist for this purpose alone, it has now been removed. For most drivers it is safe to just remove this hook and the related code. However, some drivers used plugging to delay I/O operations in response to resource shortages. One example of that was the SCSI midlayer; if we failed to map a new SCSI request due to a memory shortage, the queue was plugged to ensure that we would call back into the dispatch functions later on. Since this mechanism no longer exists, a similar API has been provided for such use cases. Drivers may now use blk_delay_queue() for this:
blk_delay_queue(queue, delay_in_msecs);
The block layer will re-invoke request queueing after the specified number of milliseconds have passed. It will be invoked from process context, just as it would have been with the unplug event. blk_delay_queue() honors the queue stopped state, so if blk_stop_queue() was called before blk_delay_queue(), or if is called after the fact but before the delay has passed, the request handler will not be invoked. blk_delay_queue() must only be used for conditions where the caller doesn't necessarily know when that condition will change states. If resources internal to the driver cause it to need to halt operations for a while, it is more efficient to use blk_stop_queue() and blk_start_queue() to manage those directly.
These changes have been merged for the 2.6.39 kernel. While a few problems have been found (and fixed), it would appear that the plugging changes have been integrated without greatly disturbing Linus's calm development cycle.
( Log in to post comments)
Explicit block device plugging
Posted Apr 15, 2011 23:31 UTC (Fri) by giraffedata (subscriber, #1954) [Link]
The idea behind plugging is to allow a buildup of requests to better utilize the hardware and to allow merging of sequential requests into one single larger request. The latter is an especially big win on most hardware; writing or reading bigger chunks of data at the time usually yields good improvements in bandwidth.
I have never understood this analysis.
First of all, plugging doesn't improve utilization -- it decreases it. It causes the device to be idle more than it otherwise would for a given workload. Improved utilization would be eliminating waste of hardware so you can have less hardware. Plug-free I/O doesn't waste hardware; it uses only time that in a plugged scenario would be unused.
"Bandwidth" here means the amount you can shove down the pipe to the disk, and is meaningful only in a situation where you drive the disk as fast as it will go -- it is never idle. In that case, plugging plays no role.
I saw plugging help once, when the device was poorly designed so that it sucked up all the I/Os at interface speed into a large buffer, then proceeded to do the mechanical processing without reordering or coalescing. In that case, defeating that in-device queue via Linux plugging was a win. The device should have simply pushed back as soon as it had one turnaround time's worth of I/O in its queue, then the block layer would have coalesced and reordered without any need for plugging.
I've also imagined some carefully constructed burst patterns where response time (not bandwidth or utilization) improves with plugging. But in more common cases, plugging hurts response time for the obvious reason that it lets the device be idle for a brief period while the user is waiting for I/O.
Posted Apr 17, 2011 7:05 UTC (Sun) by walex (subscriber, #69836) [Link]
The reason is that "giraffedata" seems to be entirely unaware that there are devices called "disks" that have extremely high and variable latencies, and given these it is possible that bunching IO requests allows the elevator to minimize latencies in such a way that the pauses imposed by plugging may be worthwhile.
There are two very big problems with plugging, or rather its current implementation, which makes it a tremendously stupid thing as a result:
* Putting it in the block layer is extremely bad, because there are storage devices that don't have high and variable latencies and for which plugging is counterproductive. If plugging makes any sense it should be in the device drivers.
* Plugging quantizes the flow of IO requests making them essentially synchronous with the periodic expiry of the plugging timer, and the resulting bunching, which is indeed the intended effect as to scheduling, can have bad consequences on page cache usage, and limits the bandwidth usable for the device in common cases (long nearly contiguous writes).
Plugging was introduced IIRC as a way to cheat on some common benchmark.
Plugging is a gross mistake that should be entirely removed from the page cache, and perhaps turned into some kind of scheduling library available to device drivers for use when the relevant device latency profiles might conceivably benefit (almost never actually).
Posted Apr 17, 2011 9:44 UTC (Sun) by neilbrown (subscriber, #359) [Link]
There is no unplugging timer. There was a timer in the previous code that would unplug after 3ms, but that was mainly a 'just in case' measure. Normally the unplug would happen much earlier. If it was only the timer that triggered an unplug then you are right, performance would be terrible.
Also plugging was only relevant when the device was idle. If the queue was not empty it would not get plugged. So with long, nearly continuous writes, the queue would only be plugged once at the very beginning.
The purpose of plugging is primarily about latency, not bandwidth. If bandwidth is an issue, your queue will not be empty, so the time it takes for a request to get to the front of the queue is long enough for any other related requests to be seen and merged so there is no point in plugging (at least the old style - the new style still brings benefits).
The new plugging code is a bit different. Rather than only plugging idle devices it doesn't really plug devices at all. It plugs request submitters instead. (So the new code is a lot closer to the page cache than the old code).
So when a process submits a request, it gets queued in the process, not in the device. Once a process has submitted all that it wants to submit (or whenever it schedules - to avoid deadlocks), the requests queued in the process are released.
So you get a similar effect on the starting transient - a large number of pre-sorted requests gets handled all at once. However there are other advantages as well in terms of lock contention.
I think it is very hard to imaging plugging slowing down even a very fast low latency device. The page cache has a bunch of pages that it wants to perform IO on and it assembles them into a big chunk and sends them all to the device. It is true that the first request won't get there quite as soon, but unless the device is as fast as memory, then it will still get there fast enough that you probably cannot measure the difference.
And if a device is as fast as memory, then it probably shouldn't be under the request_queue code at all - a driver more like the umem.c driver might be appropriate. It just takes bios directly and turns them into DMA descriptors.. But even that uses plugging so it can start a chain of dmas at once rather than just one.
I've probably rambled a but there, but I really think you aren't being fair to plugging. It may not be perfect but I would need a lot more evidence before I could see any justification for it being a gross mistake.
Posted Apr 27, 2011 5:26 UTC (Wed) by dlang (✭ supporter ✭, #313) [Link]
we had a similar discussion in rsyslog when we introduced the capability to write log entries to databases in batches rather than individually. my initial proposal was to try and queue a set number of entries to write at once (with a timer to make sure they get written _reasonably_ soon in any case), but it was pointed out that if you just write what's ready, and let everything else queue up in the meantime, the size of the writes auto-tunes itself.
i.e. by always writing whatever's pending in the queue (up to a max) when you have the ability to write, you achieve both low-latency for the initial writes (and low load), and high efficiency under heavy load (because the queue backs up while you are doing the 'inefficient' small writes). This auto-tunes for the lowest available latency.
I see two costs in doing this.
1. more device actions than an optimally batched mode
2. more CPU cycles used to process the small batches
if the system is idle enough, these don't matter, I could see them becoming an issue if they either cost more power, or use a resource that could otherwise be used by another process (cpu cycles, or bus bandwidth)
has anyone tried just doing away with plugging and see what the results are? (especially on anything that measures more than how short the device utilized time can be)
Posted Apr 29, 2011 7:04 UTC (Fri) by andresfreund (subscriber, #69562) [Link]
If your piece of code does two writes to a normal rotating media disk without plugging - as far as I understand it - the first write will cause a disk activity which will take up to 15ms for an idle disk.
Some microseconds later your code will submit a second page. Unfortunately that will wait in some queue until the disk is finished writing the first block.
Which means you will need ~30ms in the worst case.
On the other hand, if you plugged the device before doing those writes, it will sort those writes to be in disk order and the disk will be able to do it in one rotation. Which means its ~15ms.
Posted Apr 29, 2011 7:27 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]
if the second write arrives 15ms after the first and the device has been plugged, then with plugging it takes 30ms to get both writes on disk (15ms of delay + 15 ms of activity), without plugging it takes 15ms to get both writes on disk (15ms of activity for the first one and 15ms of activity for the second one)
if the data is submitted in a shorter period of time, then the two writes may finish faster with plugging, but if they arrive further apart (and the device remains plugged) all you are doing is delaying how long it takes for the first one to get to disk.
so you should never plug longer than it would take to write the first block to disk, but figuring out how long that will be is hard, so the timer to unplug is set long
but if you are writing a larger amount, during the 15ms while the disk is processing the first write, multiple additional requests will queue up and be able to be combined.
and if they don't, then the disk is active for 30ms instead of for 15ms, but nothing else had a need for it so why do you care? If anything else had a need for the disk it would have generated additional requests that would be combined with the second request instead of the first and second being combined and finished with the third being processed independently (possibly with plugging of it's own)
yes, some mobile users may care in an attempt to save power, but in that case they really want to have much higher latency, on the order of seconds or tens of seconds to avoid spinning up the drive, so that's not the relevant use case.
if this was being done under application control (because after all, the application is the only thing that knows what's going to happen in the future) I could see it. but you are trying to have the kernel guess if there is going to be more activity in the near future.
case 1. if there is no activity in the near future the plug just delayed the write
if there is a tiny bit of activity in the near future, each one can be treated independently as if there is no activity in the future.
case 2. if there is a lot of activity, the second request will get delayed by the time it takes to process the first request instead of by the time the plug is in place.
is there really good enough prediction of future activity to make the the kernel guessing correctly that case 2 will happen be worth the complexity, time spent manging the plugs, nd increased latency for the first activity?
Posted Apr 29, 2011 7:40 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]
assume that each disk action takes 15ms and data arrives every 10ms
with plugging of up to 15ms or a second item you write
2 blocks starting at 10ms finishing at 25ms
2 blocks starting at 30ms finishing at 45ms
2 blocks starting at 50ms finishing at 65ms
2 blocks starting at 70ms finishing at 85ms
etc
without plugging you write
1 block starting at 0ms finishing at 15ms
1 block starting at 15ms finishing at 30ms
2 blocks starting at 30ms finishing at 45ms
1 block starting at 45 ms finishing at 60ms
2 blocks starting at 60ms finishing at 75ms
1 block starting at 75ms finishing at 90ms
etc
does this really make a difference? yes, in the second case the disk is busy continuously rather than having a 5ms pause between activity but does that matter?
say the data arrives twice as fast (ever 5ms)
with plugging
2 blocks starting at 5ms finishing at 20ms
3 blocks starting at 20ms finishing at 35ms
3 blocks starting at 35ms finishing at 50ms
without plugging
1 block starting at 0ms finishing at 15ms
3 blocks starting at 15ms finishing at 30ms
3 blocks starting at 30ms finishing at 45ms
where is the gain?
Posted Apr 29, 2011 9:29 UTC (Fri) by neilbrown (subscriber, #359) [Link]
In the linux kernel, plugging is not timer based.
(There was a timer in the previous implementation, but it was only a last-ditch unplug in case there were bugs: slow is better than frozen).
In the old code a device would plug whenever it wanted to which was typically when a new request arrived for an empty queue. It would then unplug as soon as some thread started waiting for a request on that device to complete. I think it also would unplug explicitly in some cases after submitting lots of writes that were expected to by synchronous, but I'm not 100% certain.
So in the read case for example a read syscall would submit a request to read a page, then another request to read the next page (Because it was an 8K read), then maybe a few more requests to read-ahead some more pages, then wait for that first read to complete. Waiting for the read-ahead requests maybe isn't critical, but waiting for that second page would reduce latency. Now to be fair, if the two pages were adjacent on the disk they would probably have been combined into a single request before begin submitted, and if there aren't then maybe keeping them together isn't so important. But as soon as you get 3 non-adjacent pages in the read, there is a real possible gain from sorting before starting IO.
The new plugging code is quite different. The unplug happens when the thread submitting requests has finished submitting a bunch of requests. It is explicit rather than using the heuristic of 'unplug when someone waits' (hence the title of the article). This means it happens a little bit sooner - there is never any timer delay at all.
Rather than thinking of it as 'plugging' it is probably best to think of it as early-aggregation. hch has suggested that this be even more explicit. i.e. the thread generates a collection of related requests (quite possibly several files full of writes in the write-back case) and submits them all to the device at once. Not only does this clearly give a good opportunity to sort requests - more importantly it means we only take the per-device lock once for a large number of requests. If multiple threads are writing to a device concurrently, this will reduce lock contention making it useful even when the device queue is fairly full (when normal plugging would not apply at all).
The equivalent logic in a 'syslogd' style program would be to simply always service read requests before write requests.
So when a log message comes in, it is queued to be written.
Before you actually write it though you check if another log message is ready to be read from some incoming socket. If it is you read it and queue it. You only write when there is nothing to be read, or your queue is full.
I agree that having a timed unplug event doesn't make much sense.
Posted Apr 29, 2011 14:57 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]
how can the kernel know when the application has finished submitting a bunch of requests?
or is it that the application submits one request, but something in the kernel is breaking it into a bunch of requests that all get submitted at once, and plugging is an attempt to allow the kernel to recombine them? (but that doesn't match your comment about sorting 3 non-adjacent requests being a win, how can one action by an application generate 3 non-adjacent requests?)
I'm obviously missing something here.
if the application is doing multiple read/write commands, I don't see how the kernel can possibly know how soon the next activity will be submitted after each command is run.
if the application is doing something with a single command, it seems like the problem is that it shouldn't be broken up to begin with, so there would be no need to plug to try and combine them
Explicit block device plugging
Posted Apr 29, 2011 22:58 UTC (Fri) by neilbrown (subscriber, #359) [Link]
Actions of the application and requests to devices are fairly well disconnected thanks to the page cache. An app writes to the page cache and the page cache doesn't even think about writing to the device for typically 30 seconds. Of course if the app calls fsync, that expedites things. So a partial answer to "how can the kernel know when the application has finished submitting a bunch of requests" is "the application calls 'fsync' - if it cares".
On the read side, the page cache performs read-ahead so that hopefully every read request can be served from cache - and certainly the device gets large read requests even if the app is making lots of small read requests.
Also the kernel does break things into a bunch of requests which then need to be sorted. If a file is not contiguous on disk, then you need at least one request each separate chunk. Plugging allows this sorting to happen before the first request is started.
There is a good reason why the page cache submits lots of individual requests rather than a single list with lots of requests. Every request requires an allocation. when memory gets tight (which affects writes more than reads) it could be that I cannot allocate memory for another request until the previous ones have been submitted and completed. So we submit the requests individually, but combine them at a level a little lower down, and 'unplug' that queue either when all have been submitted or when the thread 'schedules' - which it will typically only do if it blocks on a memory allocation.
So there are two distinct things here that could get confused.
Firstly there is the page cache which deliberately delays writes and expedites reads to allow large requests independent of the request size used by the application.
Then there is the fact that the page cache sends smallish requests to the device, but tends to send a lot in quick succession. These need to be combined when possible, but also flushed as soon as there is any sign of any complication. This last is what "plugging" does.