Linux power management: The documentation I wanted to read

This is an article from LWN (http://lwn.net/Articles/505683/), which I found very helpful in understanding Linux Power Management so it deserves to be saved here.


Last week we discussed three elements that might serve to guide the creation of introductory technical documentation. This week we put those elements to the test by using them to create some introductory documentation for Linux power management. For me, this exercise precisely answers the question "What were you looking for that you didn't find?", as it is the documentation I would have liked to read.

This documentation is necessarily incomplete, partly because my own experience is not yet broad enough to provide a comprehensive document, and partly because doing so might try the patience of the present readership. As such it stops short of delving into the details of hibernation and completely omits any treatment of quality-of-service and wakeup sources, all of which would have an important place in a more complete document. Fortunately there are still sufficient topics to showcase the presentation of structure, purpose, and examples.

Three perspectives on Linux power management

The power management infrastructure in Linux is quite complex, but hopefully not intractably so. To get a handle on this complexity it is helpful to view it from three different perspectives. The first perspective highlights the different holistic states of the system which roughly divide into "in use", "not in use", and "indefinitely not in use", corresponding to "run time power management", "suspend" and "hibernate". One of the distinctions between these is the size of the power switch. The first uses lots of little power switches at different times, while the last turns off everything all at once (except maybe a real-time clock or similar).

The second of these states is somewhat harder to define. It covers a range of states which are not easy to clearly differentiate. At one end of the spectrum we have the traditional "suspend" mode of a laptop, which is almost like hibernation but uses a little more power and is a little quicker to get into and out of. Once the laptop has entered suspend it really must stay there using minimal power until it is explicitly wakened, as it might have been placed in a padded case for transport and any increase in power usage could result in over-heating and damage. This state is often entered with help from BIOS firmware so, to the OS, it is a bit like a single power switch which transitions from "on" to "suspend".

At the other end of the spectrum is the way that "suspend" is used in the Android mobile platform and similar devices. These devices are expected to wake up spontaneously for various reasons, whether due to an incoming phone call, a reminder alarm, or just a periodic check for new email or software updates. Management of power and temperature is generally better than notebooks so the risk of over-heating is not present. There is normally little or no firmware and the entire power-management transition is handled by the OS, so it is responsible for turning off each individual device in the correct order, and then restoring them again later.

Between these extremes of a light hibernation and a heavy snooze there is room for other possibilities. A server might use a BIOS-based suspend to save power after arranging for wake-ups via wake-on-LAN or a realtime clock alarm. This can be seen as a deeper sleep than an Android phone normally enters, but not as deep as the laptop in its padded case. The "suspend" mode in Linux attempts to cater to all of these and that flexibility leads to some of the complexity.

The second perspective highlights the broad variety in components that need to be managed. Some, like rotating disk drives, have a high cost in power and time for turning off and on again, while others like an LED have essentially no cost. Some, such as a UART, need to either be off or sufficiently on to be able to accept full-rate data at any moment. Others, such as USB, can enter intermediate states where they can receive external signaling, but are free to take some time to fully wake up.

Other sources of variability include the level of independence from other devices, the degree of involvement of user-space in management of the device, and how power is routed - whether through the same bus that commands and data flow, or through some separate regulator or "power domain". These are just some of the ways that devices can vary and thus some of the issues that Linux power management needs to be prepared for.

The final perspective highlights the different stages on the way towards a low-power state, and on the way back to full functionality. The key elements of the low-power transition are to move all relevant components to a quiescent state, to record that state, then to stop powering some or all of the components; similar elements apply on the way back up. The details of managing all the aforementioned complexity through this simple transition means that we have quite a few stages as we will shortly see.

Two Understandings

Part of understanding the solution to managing this complexity is understanding the balance that has been chosen between a "mid-layer" solution and a "library" solution. That is, how much responsibility for correct behavior and sequencing is taken by the core code and imposed on the drivers, and how much of the responsibility is left in the hands of the drivers. Centralizing responsibility tends to be safe but inflexible, while distributing it is risky but versatile. Linux power management takes a middle road, so it is important to understand where each responsibility lies.

The main imposition made by the PM core is the over-all sequencing of suspend and resume. Allowing individual drivers to take a more active role in this process would probably require a general dependency solver and would undoubtedly make debugging a lot harder. In contrast, choices that are local to a specific device, such as timeouts before power management activates, or the use of a separate thread for performing power management actions are actively supported by the core without being imposed on drivers that don't want them.

One other imposition, which will be raised again later, involves interaction with interrupts. The PM core strongly encourages a specific sequencing, but does provide hooks for a driver to escape it if absolutely necessary.

Understanding Linux power management also requires knowing how devices are classified in Linux. The most obvious classification is shown by the "subsystem" link that can be found in the sysfs entry for the device. This points to either a "bus" or a "class" that the device belongs to. This subsystem roughly describes the interface that the device provides. Together with this can be a "device type" which allows further specialization. A simple example is that members of the class "block" - which are block devices such a disk drives - can be of type "disk" or type "partition" reflecting the fact that both the whole device and each individual partition is a block device, but that they do have some specific behaviors that are quite different.

Finally each device can have a "power domain" (or pm_domain). This is an abstraction that is currently only used for ARM SoC modules and represents the fact that different collections of devices within the SoC can be powered on or off together, thus the power domain may need to know when each device changes power state so it can re-evaluate or adjust the overall state of the domain.

These classifications are used to direct all the power management calls that are described below. If a device has a power domain, it gets to handle the call. If not, but the type, class, or bus declares any PM operations, those operations get to handle the call, otherwise the call is handled by the driver for the device. The PM core doesn't attempt to call all of the possible handlers for a particular device, just the first that is found. This is an example of distribution of responsibility. The first handler has the freedom to call more specific handlers, or to do all the work itself, and equally has the responsibility to ensure all required handlers are called.

For example, the power domain handler for the OMAP platform (in arch/arm/plat-omap/omap_device.c) calls the driver-specific handler (bypassing any subsystem handlers) before or after doing any OMAP-specific handling. The MMC bus handlers call into driver-specific handlers which are stored in a non-standard location - presumably for historical reasons.

With these perspective and understandings in place, we can move on to some specifics.

Runtime PM

Runtime power management has the fewest states and so is probably the best place to start digging into details. This is the part of Linux PM that manages power for individual devices without taking the whole system into a low-power state.

In this case the most interesting stage of the transition to lower power is "move to quiescent state". Once that is done there is one method call (runtime_suspend()) which combines "record current state" and "remove power", and another (runtime_resume()) which must restore power and reload any needed device state.

For runtime PM, the "move to quiescent state" transition is a cause, not an effect - the new state isn't requested, it is simply noticed. The PM core keeps track of the activity of each device using two counters and an optional timer. One counter (usage_count) counts active references to the device. These may be external references such as open file handles, or other devices that are making use of this one, or they may be internal references used to hold the device active for the duration of some operation. The other counter (child_count) counts the number of children that are active. The timer can be used to add a delay between the counters reaching zero and the device being considered to be idle. This is useful for devices with a high cost for turning on or off.

This "autosuspend" timer is not widely used at present, with only nine drivers calling pm_runtime_put_autosuspend() to start the timer, while 14 callpm_runtime_set_autosuspend_delay() which sets the timeout (though that can be set via sysfs). One user is the omap_hsmmc driver for the High Speed Multi-Media Card interface in OMAP processors. It sets a 100ms delay before declaring a device to be truly idle, presumably due to costs in activating and deactivating the cards.

The counter of active children can optionally be ignored when determining whether a device is idle. Normally the parent is needed to access the child - typically the parent is a bus sending commands to the child - so powering down the parent while children are active would be counterproductive. Sometimes it is useful though.

One good example is an I2C bus. I2C (inter-integrated circuit) is a very simple 2-wire bus for signaling between integrated circuits on a board. It doesn't carry power, only a clock signal and a bidirectional data signal. The bus is entirely master-driven. Slaves (which appears as children in the Linux device tree) cannot signal the master directly at all, they simply respond to commands from the master.

As an I2C controller is very cheap to turn on before a command is sent, and off after the response is received, there is no need to keep it powered just because its child (which could be a sensor that is monitoring the environment and may have a higher turn-on cost) is left on. Consequently some I2C controllers, such as i2c-nomadik and i2c-sh_mobile use pm_suspend_ignore_children() to allow them to report as idle even when they have active children.

When a device is deemed to be idle by the above criteria its runtime_idle() method is called. This function will normally perform any further checks (as doesusb_runtime_idle()) and possibly call pm_runtime_suspend() to initiate the change in power state. For a slight variation, lnw_gpio_runtime_idle() in thegpio-langwell.c driver doesn't call pm_runtime_suspend() directly but rather calls pm_schedule_suspend() with a 500ms delay. Presumably this design predates the introduction of the autosuspend feature.

There is one class of devices that does not follow this structure for power management, and that is the CPU. The general pattern of entering a quiescent state, recording state information, and reducing power usage is the same, however the particular implementation is vastly different. This is partly due to the uniquely central role that the CPU plays, and partly due to the fact that a CPU typically has many more levels and styles of power saving. Runtime PM for the CPU is implemented using the cpuidlecpufreq, and CPU hotplug subsystems, which will not be discussed further here; see this article for an introduction to cpuidle.

System Suspend

It can be helpful to view the "suspend" process as forcing all devices into a quiescent state, and then simply allowing runtime power management to put them all to sleep. The last to go to sleep would be the CPU (or CPUs) under the guidance of "cpuidle". While this isn't the way it is actually implemented, it provides a perspective which exposes the relationship between suspend and runtime PM quite well.

There are several reasons for not implementing it this way. Possibly the most unavoidable is that PM_RUNTIME and SUSPEND are separate kernel config options and there is a desire to keep it that way, so neither can rely on the other being present. There is also the fact that a BIOS (such as ACPI) might be involved in one or the other and may impose different handling requirements. Finally, individual drivers might want to make different decisions based on what sort of power management is happening, so it is generally best to tell them what is actually happening, rather than pretending that one thing is a form of another.

Forcing devices into a quiescent state has an important difference from just allowing them to get there on their own - any interdependencies between devices need to be explicitly handled. Linux PM has chosen to manage this by having a clear sequence of steps for transitioning to low power, and an explicit ordering of devices so that they make each step in a well defined order.

The ordering (stored in dpm_list linked through dev->power.entry) is normally the order in which devices are registered, with new devices added to the end, thus being after any devices that they depend on. However it is possible to reorder the list using device_move() which gives a device a new parent, and can place it directly after that parent, or at the end of the list. For example, when an rfcomm tty-over-bluetooth device is opened, a bluetooth connection is created and the tty device is reparented under the relevant bluetooth device and placed at the end of the device list.

The first stage of suspend, after some preliminaries like calling "sync" to flush out dirty data and switching to a separate virtual console, is to move all processes into a quiescent state. Devices which interact closely with processes need a chance to have one last chat before their process goes to sleep and this is achieved by registering a "notifier" which gets called before processes are put to sleep, and again when they are woken up.

This is variously used to:

  • load firmware in case it will be needed during resume

  • copy device state to swappable memory as may be needed when the device state can be enormous such as video RAM (drivers/gpu/drm/vmwgfx)

  • avoid deadlock when interacting with sysfs (drivers/acpi/battery.c)

  • preemptively remove devices that might be removed while the system is suspended, so appropriate cleanup can happen (drivers/mmc/core)

and a few other minor tasks. Once these notifiers run, all processes are sent a special signal which results them being moved to the "freezer" where they are forced to wait for system resume to happen.

Once all processes are quiescent, the next step is to instruct all devices to also become quiescent. To do this we need to walk the list in reverse order, putting children to sleep before their parents - as the parent may be needed to help put the child to sleep. However as a new child could be born at any moment (e.g. due to a device being plugged in), and as children get added to the end of the list, we might miss some children on the first pass. To avoid this, the PM core makes two passes over the list. The first pass starts at the beginning and simply asks devices to stop adding children by calling their "prepare()" method. If children are born during this time they can only be added after the current pointer in the list, and so will not be missed. Once this is complete we know that no new devices will be added, so the list is walked in the reverse order calling the "suspend" method.

The "suspend" method is actually three separate methods, suspend()suspend_late(), and suspend_noirq(), which can share among themselves the three tasks of making the device quiescent, saving any state, and reducing power usage. How much of which task is allocated to which methods is largely up to each driver providing that the division works with the calling patterns of the three methods.

Calls to these methods are made to all devices in child-before-parent order and the sets of calls are interleaved with system-wide suspend operations, made largely through the suspend_ops dispatch table. The ordering is roughly:

  • system wide begin()
  • per-device suspend()
  • system wide prepare()
  • per-device suspend_late()
  • system wide disable (almost) all interrupt handlers
  • per-device suspend_noirq()
  • system wide prepare_late()
  • disable nonboot CPUs
  • syscore_suspend()
  • system wide enter()

Note that it is possible for the sequence from system wide prepare() onwards to be repeated (after being unwound by corresponding "resume" actions) without going all the way up to fully awake and starting the sequence from the top. This happens if the suspend_again() suspend operation requests it. Currently this is only requested by the charger manager which often needs to wake up parts of the system to check battery charging state, without wanting the cost of a full wakeup.

Deducing the purpose of these method calls by looking for example usage in the code is problematic for a number of reasons.

  • For the system-wide operations (begin()prepare()prepare_late()), there are few users and those that exist do not make their purpose clear to an untrained observer. The most complete user is ACPI, so possibly a full understanding of that specification would help. Unfortunately that is beyond the scope of this article (and of this author).

    In general, ACPI recommends specific procedures for entering and leaving system sleep states (such as suspend) and Linux PM was modeled on that and then adjusted to meet broader needs. For example, prepare_late() was added to resolve a conflict between the needs of ACPI and the needs of the ARM platform.

  • For the per-device operations, suspend_late() was only recently added (commit cf579dfb82550e3) and there are no users in Linux-3.4. So any examples we find may be working around the absence of suspend_late() and so should not be copied.

  • The initial reason for producing this document was finding code in drivers that simply wasn't working correctly and trying to understand what "correctly" might mean. Those drivers clearly cannot be used as good examples and there is evidence that other drivers aren't always doing the right thing, so any example may be equally faulty.

Examining the documentation brings a little more useful information.

  • suspend() should leave the device in a quiescent state

  • suspend_late() can often be the same as runtime_suspend() (see also commit cf579dfb82550e3)

  • suspend_noirq() happens after interrupts are disabled and is useful when shared interrupts are used as you can be certain that the interrupt handler will not be called after suspend_noirq() runs. Some interrupts, such as the timer interrupt, are not disabled.

One observation from the code that seems to be important before we try to paint the big picture is that, after calling the suspend() method on a device, runtime power management is disabled. The purpose of this seems to be to stop runtime PM from racing with system suspend PM - we really don't want two threads trying to power off a device at once, and this is the interlock that prevents that. It also prevents runtime PM from powering the device back on again, so any device that might be needed in the late states of power management needs to be left on when runtime PM is disabled.

Tying all these threads together we get that:

  • The suspend() method should cause the device to stop doing anything, and enter a state much like it would be just before runtime PM might decide to turn it off. So it should wait for any DMA requests to complete and ensure new ones won't start. It should stop transmitting information and ensure that incoming information is either ignored, or triggers a wake-from-suspend (possibly marking the interrupt for wakeups). It should cancel any timers and generally prepare for nothing to happen for a while.

    If the device might be needed to power down other devices, such as an I2C controller that might be needed to tell some regulator to turn off, then the device should be activated for runtime PM purposes so that it will still be active when runtime PM is disabled.

    Part of the task of ignoring incoming information is to ensure that no new children will be created much like the runtime PM prepare() method does. Having new devices appear after suspend() would be awkward.

  • The suspend_late() method should power off the device in much the same way that runtime_suspend() does, and it may be exactly the same routine asruntime_suspend(). Occasionally preparing the device to wake up may differ between the system suspend and runtime PM cases. This would be one situation where suspend_late() might need to be different from runtime_suspend().

    The only case where suspend_late() should not be used is where interrupts might still be delivered, but the interrupt handler cannot tolerate the device being off. In many cases the suspend() routine will have put the device in a state in which it will not generate interrupts. Likely exceptions to this are when the interrupt line is shared, or when the device supports wake-from-suspend and so deliberately does not disable interrupts.

    If the platform that the device runs on uses BIOS support to enter suspend, then it is possible that this support will power off the device, so suspend_late()does not need to bother. If it doesn't, it could still be that the device gets powered off by instructing the BIOS to effect the state change, and it may require different power-off procedures for runtime PM and for entering suspend. If this is the case, then suspend_late() will quite likely be very different fromruntime_suspend().

  • The suspend_noirq() method is an alternative to suspend_late() but is run without interrupts enabled. It is unlikely that any driver will provide both methods.

    Having interrupts disabled means not only that an interrupt will not occur at an awkward time, but also that using any functionality that requires interrupts will not work. So if the driver uses an I2C bus or similar to tell the device to turn off, and if the I2C bus uses interrupts to indicate completion (which is normal), then either the device must be powered-off in suspend_late, or the I2C interrupt must be marked IRQF_NO_SUSPEND.

Paired with each of these methods is a method that is called when returning back towards full-functionality: resume_noirq()resume_early() and resume(). These simply do the reverse of what the corresponding "suspend" function did.

Closing

Structure, purpose, and examples - these seem to be the elements that distinguish good documentation and enable the reader not just to collect knowledge but to gain understanding. I'll leave you, dear reader, to be the judge of whether their presence here is sufficient to bring an understanding of power management, or indeed an understanding of quality documentation.

I would like to thank Rafael Wysocki for valuable review of an early draft of this article.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值