An alternative to suspend blockers

http://lwn.net/Articles/416690/

If you have been following Linux kernel development over the past few months, it has been hard to overlook the massive thread on the Linux Kernel Mailing List (LKML) resulting from an attempt to merge the Google Android's suspend blockers framework into the main kernel tree. Arguably, the presentation of the patches might have been better and the explanation of the problems they addressed might have been more straightforward [PDF], but in the end it appears that merging them wouldn't be the smartest thing from the technical point of view. Unfortunately, though, it is difficult to explain that without diving into the technical issues behind the suspend blockers patchset, so I wrote a paper, Technical Background of the Android Suspend Blockers Controversy[PDF], discussing them in a detailed way, which is summarized in this article.

Suspend blockers, or wakelocks in the original Android terminology, are a part of a specific approach to power management, which is based on aggressive utilization of full system suspend to save as much energy as reasonably possible. In this approach the natural state of the system is a sleep state [PDF], in which energy is only used for refreshing memory and providing power to a few devices that can generate wakeup signals. The working state, in which the CPUs are executing instructions and the system is generally doing some useful work, is only entered in response to a wakeup signal from one of the selected devices. The system stays in that state only as long as necessary to do certain work requested by the user. When the work has been completed, the system automatically goes back to the sleep state.

This approach can be referred to as opportunistic suspend to emphasize the fact that it causes the system to suspend every time there is an opportunity to do so. To implement it effectively one has to address a number of issues, including possible race conditions between system suspend and wakeup events (i.e. events that cause the system to wake up from sleep states). Namely, one of the first things done during system suspend is to freeze user space processes (except for the suspend process itself) and after that's been completed user space cannot react to any events signaled by the kernel. In consequence, if a wakeup event occurs exactly at the time the suspend process is started, user space may be frozen before it will have a chance to consume the event, which will be delivered to it only after the system is woken up from the sleep state as a result of another wakeup event. Unfortunately, on a cell phone the "deferred" wakeup event may be a very important incoming call, so the above scenario is hardly acceptable for this type of device.

Wakelocks

On Android this issue has been addressed with the help of wakelocks. Essentially, a wakelock is an object that can be in one of two states, active or inactive, and the system cannot be suspended if at least one wakelock is active. Thus, if the kernel subsystem handling a wakeup event activates a wakelock right after the event has been signaled and deactivates it after the event has been passed to user space, the race condition described in the previous paragraph can be avoided. Moreover, on Android, the suspend process is started from kernel space whenever there are no active wakelocks, which addresses the problem of deciding when to suspend, and user space is allowed to manipulate wakelocks. Unfortunately, that requires every user space process doing important work to use wakelocks, which creates unusual and cumbersome issues for application developers to deal with.

Of course, processes using wakelocks can impact the system's battery life quite significantly, so the ability to use them has to be regarded as a privilege that should not be given unwittingly to all applications. Unfortunately, however, there is no general principle the system designer can rely on to figure out what applications will be important enough to the system user to allow them to use wakelocks by default. Therefore, ultimately the decision is left to the user which, naturally, is only going to really work if the user is qualified enough to make the decision. Moreover, if the user is expected to make such a decision, they should be informed exactly of the possible consequences of it. The user also should be able to disallow chosen applications the use of wakelocks at any time. On Android, though, at least up to and including version 2.2, that simply doesn't happen.

Apart from this, some advertised features of applications don't really work on Android because of its use of opportunistic suspend. Namely, some applications are supposed to periodically check things on remote Internet servers. For this purpose they need to run when there's the time to make their checks, but they obviously aren't running when the system is in a sleep state. Thus the periodic checks the applications are supposed to make aren't really made at that time. In fact, they are only made when the system is in the working state incidentally for another reason, and there happens to be the time to make them. This most likely is not what the users of the affected applications would have expected.

Timekeeping issues

There is one more problem with full system suspend that is related to time measurements, although it is not limited to the opportunistic suspend initiated from kernel space. Namely, every suspend-resume cycle, regardless of the way it is initiated, introduces inaccuracies into the kernel's timekeeping subsystem. Usually, when the system goes into a sleep state, the hardware that the kernel's timekeeping subsystem relies on is powered off, so it has to be reinitialized during a subsequent system resume. Then, among other things, the global kernel variables representing the current time need to be readjusted to keep track of the time spent in the sleep state. This involves reading the current time value from a persistent clock which typically is much less accurate than the clock sources used by the kernel in the system's working state. So that introduces a random shift of the kernel's representation of current time, depending on the resolution of the persistent clock, during every suspend-resume cycle. Moreover, kernel timers used for scheduling the future execution of work inside of the kernel also are affected by this issue in a similar way. In consequence, the timing of some events in a suspending and resuming system is different from their analogous timing without a suspend-resume cycle.

If system suspend is initiated by user space, the kernel may assume that user space is ready for it and is somehow prepared to cope with the consequences. For example, it may want to use settimeofday() to set the kernel's monotonic clock using a time value taken from an NTP server right after the subsequent system resume. On the other hand, if system suspend is started by the kernel in an opportunistic fashion, user space doesn't really have a chance to do anything like that.

For this reason, one may think that it's better not to suspend the system at all and use the cpuidle framework for the entire system power management. This approach appears to allow some systems to be put into a low-power state resembling a sleep state. However, it may not guarantee that the system will be put into that state sufficiently often because of applications using busy loops to excess and kernel timers. PM quality of service (QoS) requests [PDF] may also prevent cpuidle from using deep low-power state of the CPUs. Moreover, while only a few selected devices are enabled to signal wakeup during system suspend, the runtime power management routines that may be used bycpuidle for suspending I/O devices tend to enable all of them to signal wakeup. Thus the system wakes up from low-power states entered as a result of cpuidle transitions relatively more often than from "real" sleep states, so its ability to save energy is limited. This basically means that cpuidle-based system power management may not be sufficient to save as much energy as opportunistic suspend on the same system.

The alternative implementation

Even if opportunistic suspend is not going to be used on a given system, it generally makes sense to suspend the system sometimes, for example when its user knows in advance that it will not need to be in the working state in the near future. However, the problem of possible races between the suspend process and wakeup events, addressed on Android with the help of the wakelocks framework, affects all forms of system suspend, not only the opportunistic one. Thus this problem should be addressed in general and it is not really convenient to simply use the Android's wakelocks for this purpose, because that would require all of user space to be modified to use wakelocks. While that may be good for Android, whose user space already is designed this way at least to some extent, it wouldn't be very practical for other Linux-based systems, whose user space is not aware of the wakelocks interface. This observation led to the kernel patch that introduced the wakeup events framework, which was shipped in the 2.6.36 kernel.

This patch introduced a running counter of signaled wakeup events, event_count, and a counter of wakeup events whose data is being processed by the kernel at the moment, events_in_progress. Two interfaces have been added to allow kernel subsystems to modify these counters in a consistent way. pm_stay_awake() is meant to keep the system from suspending, while pm_wakeup_event() ensures that the system stays awake during the processing of a wakeup event.

In order to do that, pm_stay_awake() increments events_in_progress and the complementary function pm_relax() decrements it and increments event_count at the same time. pm_wakeup_event() increments events_in_progress and sets up a timer to decrement it and increment event_count in the future.

The current value of event_count can be read from the new sysfs file /sys/power/wakeup_count. In turn, writing to it causes the current value of event_count to be stored in the auxiliary variable saved_count, so that it can be compared with event_count in the future. However, the write operation will only succeed if the written number is already equal to event_count. If that happens, another auxiliary variable events_check_enabled is set, which tells the PM core to check whether event_count has changed or events_in_progress is different from zero while suspending the system.

This relatively simple mechanism allows the PM core to react to wakeup events signaled during system suspend if it is asked to do so by user space and if the kernel subsystems detecting wakeup events use either pm_stay_awake() orpm_wakeup_event(). Still, its support for collecting device statistics related to wakeup events is not comparable to the one provided by the wakelocks framework. Moreover, it assumes that wakeup events will always be associated with devices, or at least with entities represented by device objects, which need not be the case in all situations. The need to address these shortcomings led to a kernel patch introducing wakeup source objects and adding some flexibility to the existing framework.

Most importantly, the new patch introduces objects of type struct wakeup_source to represent entities that can generate wakeup events. Those objects are created automatically for devices enabled to signal wakeup and are used internally by pm_wakeup_event()pm_stay_awake(), and pm_relax(). Although the highest-level interfaces are still designed to report wakeup events relative to devices, which is particularly convenient to device drivers and subsystems that generally deal with device objects, the new framework makes it possible to use wakeup source objects directly.

A "standalone" wakeup source object is created by wakeup_source_create() and added to the kernel's list of wakeup sources by wakeup_source_add(). Afterward one can use three new interfaces, __pm_wakeup_event()__pm_stay_awake() and__pm_relax(), to manipulate it and, when it is not necessary any more, it may be removed from the global list of wakeup sources by calling wakeup_source_remove(). It can then be deleted with the help of wakeup_source_destroy(). Thus reported wakeup events need not be associated with device objects any more. Also, at the kernel level, wakeup source objects may be used to replace Android's wakelocks on a one-for-one basis because the above interfaces are completely analogous to the ones introduced by the wakelocks framework.

The infrastructure described above ought to make it easier to port device drivers from Android to the mainline kernel. It hasn't been designed with opportunistic suspend in mind, but in theory it may be used for implementing a very similar power management technique. Namely, in principle, all wakelocks in the Android kernel can be replaced with wakeup source objects. Then, if the /sys/power/wakeup_count interface is used correctly, the resulting kernel will be able to abort suspend in progress in reaction to wakeup events in the same circumstances in which the original Android kernel would do that. Yet, user space cannot access wakeup source objects, so the part of the wakelocks framework allowing user space to manipulate them has to be replaced with a different mechanics implemented entirely in user space, involving a power manager process and a suitable IPC interface for the processes that would use wakelocks on Android.

The IPC interface in question may be implemented using three components, a shared memory location containing a counter variable referred to as the "suspend counter" in what follows, a mutex, and a conditional variable associated with that mutex. Then, a process wanting to prevent the system from suspending will acquire the mutex, increment the suspend counter, and release the mutex. In turn, a process wanting to permit the system to suspend will acquire the mutex and decrement the suspend counter. If the suspend counter happens to be equal to zero at that point, the processes waiting on the conditional variable will be unblocked. The mutex will be released afterward.

With the above IPC interface in place the power manager process can perform the following steps in a loop:

  1. Read from /sys/power/wakeup_count (this will block until the events_in_progress kernel variable is equal to zero).
  2. Acquire the mutex.
  3. Check if the suspend counter is equal to zero. If that's not the case, block on the conditional variable (that releases the mutex automatically) and go to step 2 when unblocked.
  4. Release the mutex.
  5. Write the value read from /sys/power/wakeup_count in step 1 back to this file. If the write fails, go to step 1.
  6. Start suspend or hibernation and go to step 1 when it returns.
Of course, this design will cause the system to be suspended very aggressively. Although it is not entirely equivalent to the Android's opportunistic suspend, it appears to be close enough to yield the same level of energy savings. However, it also suffers from a number of problems affecting the Android's approach. Some of them may be addressed by adding complexity to the power manager and the IPC interface between it and the processes permitted to block and unblock suspend, but the others are not really avoidable. Thus it may be better to use system suspend less aggressively, but in combination with some other techniques described above.

Overall, while the idea of suspending the system extremely aggressively may be controversial, it doesn't seem reasonable to entirely dismiss automatic suspending of it as a valid power management measure. Many different operating systems do that and they achieve good battery life [PDF] with the help of it. There don't seem to be any valid reasons why Linux-based systems shouldn't do that, especially if they are battery-powered. As far as desktop and similar (e.g. laptop or netbook) systems are concerned, it makes sense to configure them to suspend automatically in specific situations so long as system suspend is known to work reliably on the given configuration of hardware. The new interfaces and ideas presented above may be used to this end.


( Log in to post comments)

An alternative to suspend blockers

Posted Nov 24, 2010 22:58 UTC (Wed) by Cyberax (subscriber, #52523) [Link]

>Then, if the /sys/power/wakeup_count interface is used correctly, the resulting kernel will be able to abort suspend in progress in reaction to wakeup events in the same circumstances in which the original Android kernel would do that. Yet, user space cannot access wakeup source objects, so the part of the wakelocks framework allowing user space to manipulate them has to be replaced with a different mechanics implemented entirely in user space, involving a power manager process and a suitable IPC interface for the processes that would use wakelocks on Android.

Why can't we just create a '/dev/wakelock' device, which calls wakeup_source_create() when it is opened? And add a few ioctl()s to allow userspace to set descriptive names for wakelocks.

Additional userspace IPC framework looks almost exactly like the kernel API.

Or am I missing something?

An alternative to suspend blockers

Posted Nov 25, 2010 22:57 UTC (Thu) by rvfh (subscriber, #31018) [Link]

My thought exactly. This should also take care of processes crashing (and thus having their fds closed).
One could:
* open device on app start if privileges satisfied (so not all apps can)
* ioctl to lock/unlock
* close automatically unlocks

Maybe the wakelock name should just be the app name and PID?

Or are we missing something?

An alternative to suspend blockers

Posted Nov 28, 2010 0:06 UTC (Sun) by rjw@sisk.pl (subscriber, #39252) [Link]

Kernel developers are generally opposed to adding a separate /dev interface specifically for this purpose, generally speaking because it will only be useful to Android at this point (no one else seems to be interested in it, because user space on the other systems would have to be modified to use this interface).

An alternative to suspend blockers

Posted Nov 28, 2010 20:26 UTC (Sun) by Cyberax (subscriber, #52523) [Link]

So? It'll still be _cleaner_ than a userspace IPC daemon which essentially does the same thing.

And not accepting a driver for being Android-specific - that's also strange. Anyway, when suspend blockers infrastructure is in place, all it takes to provide /dev/wakelocks is a small loadable module which can live out-of-tree.

An alternative to suspend blockers

Posted Dec 2, 2010 21:46 UTC (Thu) by oak (guest, #2786) [Link]

Reading this and the following article gave an immediate feeling of wakelocks being presented as an example of future "High-maintenance designs" mistakes...

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值