cpu x86 linux,Linux Power Management for x86 CPU ---- C-State

Modern CPUs are more and more powerful. When there is no job to do,

it

enters into idle state. During its ilde period, we certainly can

cut

off its power and have it enter into low-power state only if we

know

when there is new assignment and we can re-activate CPU and have it

do

its jobs again. The process is like this:

no

job

cut off power

CPU in active  ----------> CPU in

idle --------------> low-power state

^

|

|

|

|         re-power

up                       v

To achieve the above goal, we need to answer the following

questions:

1) How to

know CPU is idle so that we can cut off power;

2) How to

cut off power;

3) When and

how to re-power up CPU;

1.  When CPU is idle

-----------------

The answer to the first question is very simple as a matter of

fact: When

it is idle, CPU runs the swapper process (process ID is 0. Pobably,

it

should be called idle thread, anyway, it is a legacy name, and all

text-

books call it that way). So, CPU must be idle when it runs into

swapper.

Traditionally, the swapper process does nothing. In a forever loop,

it just

checks if there is other task to do, if not, delays for a while and

then

checks again, otherwise, it tells process scheduler to schedule

other task.

The code is like like this:

while (1)

{

while (no_job_to_do)

{

delay

for a while;

}

schedule_other_process;

}

So, To cut CPU power, we change the above code to,

while (1)

{

while (no_job_to_do)

{

cut_off_cpu_power;

...

}

schedule_other_process;

}

2. How to Cut Off Power

-----------------------

Note that CPU consists of many units, besides core logic, it has

cache, BIU

(Bus Interface Unit), Local APIC. when a CPU is in idle state, we

can cut

clock signal and power from some units. The more units are stopped,

the more

power saved.

We need to consider another side effect of cuting CPU

power: Each unit spends

some time to power up. So, the more units are stopped, the more

time it takes

for CPU to be re-activated (wake up). We call the time as

entry/exit latency.

2.1  C-State

-------------

To find a balance between power-saving and entry/exit latnecy,

Intel CPUs

provide many low-power states called C-State, or sleeping state.

Deponding

on CPU models, Intel CPUs support C-States: C1, C2,

C3, C4 C5, C6, ...

(C0 is active state).  While in sleeping state(C1 or above),

CPU doesn't

execute any instruction, but consumes less power.

C0 - CPU is full-powered, and executes

instruction;

C1 - stop main internal core clocks;

C2 - C2 has two sub-mode:

Stop-Grant & Stop-Clock;

While in C1/C2,

CPU still processes bus snoop & snoop from

other

cores. That means

CPU automatically exits C1/C2, handle snoop and

then returns C1/C2

again.

C3 - Flush cache. So, it won't exit C3 to handle

snoop.

C4 - for multi-core processors. For example, for

Duo 2, if both cores

are in C4, the

package will enter a deeper sleep state.

C5 - I don't know :)

C6 - For Intel Core i7, the package enters more

deeper sleep if all

cores in C6, and

some additional power-saving from QPI link.

Cn -  ...    Sigh~,

Besides Cx, some Intel CPUs have enhanced CxE states. For example,

Intel

Core 2 Duo instroduced enhanced C-States:

C1E, C2E, C3E, C4E. The enhanced

states have an additional feature than Cx-State:

they reduce CPU voltage

before entering Cx-state (In fact, voltage-reducing is implemented

based

on EIST/T-States).

2.2  HLT, P_LVLx and MWait

---------------------------

Then, how to enter into some certain C-State ? Intel provides three

methods.

2.2.1  HLT instruction

----------------------

As we know, Intel x86 has a HLT (halt) instruction. From 486DX4,

this

instruction will cause CPUs to enter into C1 or C1E state. If

BIOSes

enable C1E feature, CPU enters C1E, otherwise CPU enters C1.

BIOSes

enables C1E via some MSR register. For example, for Intel Xeon

7000,

BIOS can set bit 25 of IA32_MISC_ENABLE_MSR (MSR 1A0).

Note that HLT can be used for C1 entry only. That means, you

cannot

enable CPU to enter C2 or above by HLT.

2.2.2  P_LVLx I/O registers

----------------------------

And Intel defines P_LVLx I/O registers (x is 2 ~ 5). I/O reading

P_LVLx

register will cause CPU to enter into C-state. Generally, P_LVL2

for C2,

but P_LVL3 of Core i7 for C6 while P_LVL3 of Duo 2 for C3. It

depends on

CPU model.

2.2.3  Monitor/MWait instruction

--------------------------------

Except HLT instruction and P_LVLx registers, Intel provides another

way

to enable CPU to enter into C-State: MWait. This

instruction should be

used together with Monitor. Normally, we use monitor instruction

to

watch a range of memory, and then use mwait with some hints to

enable CPU

to enter into Cx-state.

Without this instruction, when a CPU is in sleeping state, if other

CPUs

want to wake it up, the only way is to send an IPI. However, IPI is

an

expensive operation, it takes much time (compared to

Monitor/MWait). With

Monitor/MWait pair, other CPUs can wakup sleeping CPU by modify the

memory

watched (monitored) by the sleeping CPU.

3.  Re-activate CPU

-----------------------------

When a CPU runs into swapper process, there might be some processes

in

various wait queues of this CPU. Once the condition changes,

those

processes could become runnable again. Because they have been

already

assigned to this CPU, before sleeping, the CPU must prepare to run

the

processes in wait state in the near future.

Then, what's the conditions which a process can wait for ? Yes,

time and/

or interrupt. A process can wait on a timer or interrupt or some

events

that will be triggered in interrupt handling.

Intel CPU returns to C0 from sleeping state once receiving

interrupt, and

timer is implemented via hardware timer interrupt. So those

processes in

waitqueues would be executed once they becomes runnable (we skip

tickless

kernel and C3-stop LAPIC timer for the time being).

Besides, other CPUs can assign some jobs to an idle CPU and wake it

up via

interrupt or the method provided by monitor/mwait.

4.  ACPI & C-State

-------------------

ACPI defines two methods (control interfaces) to control CPU

C-states. And

ACPI specification defines 3 C-states. Note that ACPI C-states is

not the

same as Intel CPU C-States. For example, we can map Intel CPU

C1/C1E to

ACPI C1, Intel C2/C2E to ACPI C2, Intel C3, C4, C5, C6 to ACPI

C3.

4.1.  P_LVLx registers in P_BLK

-------------------------------

In DSDT table, each processor optionaly can have a P_BLK register

block,

For example,

Processor (

\_PR.CPU0,      // Namespace name

1,

0x120,          //

P_BLK system I/O address

6

// size of P_BLK

)

{...}

P_LVL2:   P_BLK + 4, 1

byte, system I/O space;

P_LVL3:   P_BLK + 5, 1

byte, system I/O space;

Reading P_LVL2 causes CPU to enter C2 state; reading P_LVL3 causes

CPU to

enter C3 state.

In FADT table, there are two fields to give C2 and C3 entry/exit

latency

respectivly,

FADT.P_LVL2_LAT,

The worst-case hardware latency to enter/exit a

C2 state. A value > 100 indicates the system

does

not support a C2 state.

FADT.P_LVL3_LAT,  The worst-case hardware

latency to enter/exit a

C3 state. A value > 1000 indicates the system

does

not support a C3 state.

Based on entry/exit latency, OS can select which C-state should be

entered

into when CPU is idle. OS should select as deeper sleeping state as

possible,

so as to save more power. In fact, the hardware entry/exit latency

is used

as a reference point, and OS will adjust the entry/exit latency for

each

C-state during runtime.

When CPU is idle, OS checks the most recent impending timer, and

compares

the interval with C-State latency, and select one of C-state to

enter.

4.2. _CST & _CSD ACPI objects

-----------------------------

4.2.1 _PDC

----------

_PDC, OS uses it to inform the platform of the level cpu power

managemet

support provided by OS;

Note that OS must use _PDC/_OSC method to inform the platform of

the level of

power management which OS can handle. Based on this information,

ACPI firmware

can return different values(package) for_CST and _CSD.

4.2.2 _CST

----------

_CST, the platform declares the supported C-States. ACPI can define

a _CST

object for a processor like,

Name (_CST,

Package()) {Count, CState,…,

CState},  where,

CState: Package (Register, Type, Latency,

Power)

For example,

Processor (\_PR.CPU0,1, 0x120, 6)

{

...

Name (_CST, Package()

{

4,      //the number of supported

C-States

Package(){ResourceTemplate(){Register(FFixedHW,

0, 0, 0)}, 1, 20, 1000},

Package(){ResourceTemplate(){Register(SystemIO,

8, 0, 0x161)}, 2, 40, 750},

Package(){ResourceTemplate(){Register(SystemIO,

8, 0, 0x162)}, 3, 60, 500},

Package(){ResourceTemplate(){Register(SystemIO,

8, 0, 0x163)}, 3, 100, 250}

})

...

}

In this example, CPU0 has 4 C-states, C1, C2 and

two C3 with different

latency and average power consumption.

C1:

FFixedHW, it means using "halt" or "mwait" instruction to enter

C1;

C2:

SystemIO, 8-bit size, so a byte-read to I/O addr 0x161 to enter

C2;

If Cx state uses FFixedHW, we check if the CPU

supports mwait instruction. Calling

cpuid.ax = 0x05, the returned value in edx

register tells us which C-state is

supported by mwait instruction (including the

number of sub-state of each C-State).

4.2.3 _CSD

------------

_CSD, the platform provides C-State control cross logical

processor

dependency information to OS;

CSDPackage: Package

(CStateDep,…, CStateDep), where,

CStateDep:

Package (NumberOfEntries, Revision, Domain, CoordType,

NumProcessors, Index)

For example,

Processor (\_SB.CPU0, 1, 0x120, 6)

{

Name (_CST, Package()

{

3,

Package(){ResourceTemplate(){Register(FFixedHW,

0, 0, 0)}, 1, 20, 1000},

Package(){ResourceTemplate(){Register(SystemIO,

8, 0, 0x161)}, 2, 40, 750},

Package(){ResourceTemplate(){Register(SystemIO,

8, 0, 0x162)}, 3, 60,

500}

})

Name(_CSD, Package()

{

Package(){6, 0, 0, 0xFD, 2, 1},

// 6 entries, Revision 0, Domain 0, OSPM Coordinate

// Initiate on Any Proc, 2 Procs, Index 1 (C2-type)

Package(){6, 0, 0, 0xFD, 2, 2} //

6 entries, Revision 0, Domain 0, OSPM Coordinate

// Initiate on Any Proc, 2 Procs, Index 2 (C3-type)

})

}

Processor (\_SB.CPU1, 2, 0x130, 6)

{

Name(_CST, Package()

{

3,

Package(){ResourceTemplate(){Register(FFixedHW,

0, 0, 0)}, 1, 20, 1000},

Package(){ResourceTemplate(){Register(SystemIO,

8, 0, 0x161)}, 2, 40, 750},

Package(){ResourceTemplate(){Register(SystemIO,

8, 0, 0x162)}, 3, 60, 500}

})

Name(_CSD, Package()

{

Package(){6, 0, 0, 0xFD, 2, 1},

// 6 entries (fields in this package), Revision 0,

// Domain 0, OSPM Coordinate

// Initiate on any Proc, 2 Procs, Index 1 (C2-type)

Package(){6, 0, 0, 0xFD, 2, 2} //

6 entries, Revision 0, Domain 0, OSPM Coordinate

// Initiate on any Proc, 2 Procs, Index 2 (C3-type)

})

}

I am copying the following words from ACPI sepc,

OSPM can coordinate the transitions between logical processors,

choosing to initiate

the transition when doing so does not lead to incorrect or

non-optimal system behavior.

This OSPM coordination is referred to as Software Coordination.

Alternately, it might

be possible for the underlying hardware to coordinate the state

transition requests

on multiple logical processors, causing the processors to

transition to the target

state when the transition is guaranteed to not lead to incorrect or

non-optimal

system behavior. This scenario is referred to as Hardware (HW)

coordination

5. Linux C-State Related Code

--------------------------

Linux has a global function pointer pm_idle, if nobody changes it,

it is set

to default_idle(). The routine default_idle() just calls HLT

instruct to put

CPU into halt state. If CPU supports C-state, this will cause CPU

to enter C1

or into C1E if BIOS enabled C1E feature.

In fact, there are many module trying to have pm_idle point to a

specific

routine. For example,

APM

apm_cpu_idle()  //legacy APM power management

cpuidle

cpuidle_idle_call()

AMD-CPU

c1e_idle()     //AMD C1E acts like Intel

C3

CPU supporting

MWait       mwait_idle()

//C1 only

idle=poll by

kernel-param    poll_idle()

//noop, no power reducing

idle=halt by

kernel-param    default_idle()

...

The priotrity of swapper process is very low, it executes only when

there is

no other runable process. Any runnable process can preempt CPU from

swapper

process. In a forever loop, swapper process executes cpu_idle()

like this,

void

cpu_idle(void)

{

...

while (1) {

while (!need_resched()) {

local_irq_disable();

pm_idle();

}

...

schedule();

...

}

5.1  Architecture Overview

--------------------------

Linux CPU C-State related modules/drivers are orgnized as

follows,

----------------

|   sysfs      |

----------------

|

--------     ------  |

| ladder |    |menu|  |

---------     -----   |

|

|     |

------------------------

|cpuidle infrastructure |

------------------------

|

|

----------------------

|acpi-cpuidle driver |

----------------------

|

|

----------------------------

|ACPI

processor bus driver |

----------------------------

5.1.1 Driver Register

-----------------------

In acpi_processor_init(), which is a module initialization routine

and

called by do_initcalls(), two related drivers, acpi processor bus

driver

and acpi_idle_driver, are registered. If you really want to look

into it,

take a look at the following path:

kernel_init()

==> do_basic_setup()

==>

do_initcalls()

==>   ... acpi_processor_init();

==>

cpuidle_register_driver(&acpi_idle_driver);

acpi_bus_register_driver(&acpi_processor_driver);

Among, the registering of drivers is in

driver/acpi/processor_core.c;

notes:

a) cpuidle insfrastructure is NOT a driver, and

it is initialized by

core_initcall(). It

provides:

I) In userland

apps/users can check/switch cpuilde governor by

sysfs interface:

/sys/devices/system/cpu/(cpuX)/cpuidle/

II) interfaces for

governor registering;

III) interfaces

for cpuilde devices, cpuilde driver;

IV) Set global

pm_idle pointer to cpuilde_idle_call();

b) acpi_idle_driver is registered into cpuidle

infrastruct, while

acpi_processor_driver is

registered acpi subsystem as an acpi bus

driver;

c) cpuilde infrastructure allows only one driver

to register, it uses

a global pointer to the

registered acpi_idle_driver. Refer to

cpuidle_register_driver()

provided by cpuidle infrastructure in

driver/cpuidle/driver.c

d) ACPI process driver registers a hotplug

callback for cpu hotplug,

so it will get notification

when a CPU is online/offline.

5.1.2 Device Discovery & Register

---------------------------------

ACPI subsystem parses ACPI tables, and for each ACPI processor

object,

it calls acpi processor bus driver's add entrypoint,

acpi_processor_add(),

to add an acpi processor device.

After adding an acpi processor device, acpi subsystem will call

processor

driver's start entrypoint function, acpi_processor_start().

In acpi_processor_start(), the routine acpi_processor_power_init()

is

called to evaluate _PDC, and read & parse _CST,

_CSD or use FADT/MADT

info to initialize processors' power state information, and then

calls

cpuidle_register_device() to register a cpuidle device into

cpuidle

infrastructure.

For hotplug CPUs, during acpi_processor_init() execution, the

routine

acpi_processor_install_hotplug_notify() is called to register a

CPU

hotplug callback. when a CPU is online, acpi_processor_start()

gets

execution.

Please note that both the processors operate the same physical

CPUs,

besides cpuidle driver, there are some other processor-related

drivers,

such as T-State driver, P-state driver,  CPU-hotplug

infrastructure,

etc. The ACPI processor driver acts as a bridge/coordinator

among

those drivers.

5.1.3 Driver/Device attach

-----------------------

acpi subsystem registered processors into acpi_process_driver,

if/when

the registered CPU is online, the start entrypoint,

acpi_processor_start()

is called. This entry function takes many initialization jobs for

T-state,

P-state and C-state. Now we just look at c-state, it calls

acpi_processor_power_init();

==> acpi_processor_get_power_info();

==>

acpi_processor_setup_cpuidle();

The first called routine will evaluate _CST or read FADT if _CST

failed,

to get C-state description from ACPI tables. Refer to section

4.1/4.2,

and see how to handle c-state information.

The second one will setup some information for each valid c-state,

note

for most cases (without kernel parameter, bus master,

etc)

C1, state->enter =

acpi_idle_enter_c1;

C2, state->enter =

acpi_idle_enter_simple;

C3, state->enter =

acpi_idle_enter_bm;

This enter routine is used to enter corresponding C-state.

5.1.4  Governor

-----------------

The governors of cpuilde are simple to read/understand. It provides

3

main callbacks for cpuidle infrastructure.

rating

enable()

select()

reflect()

Each governor has a rating in its structure. When governors are

registered

into cpuidle insfrastructure by the routine

cpuidle_register_governor(),

cpuidle will select the one with max rating unless users specified

one

via sysfs interface. The cpuilde_curr_governor pointers point to

the

selected one.

Only one governor can be used at the same time. When, OS decides to

put a

CPU into C-state, it calls select entrypoint of current governor,

governor

will by its policy choose one C-state,

cpuilde_idle_call()

{

next_state =

cpuilde_curr_governor->select();

target_state

= &dev->states[next_state];

dev->last_state = target_state;

dev->last_residency =

target_state->enter(dev, target_state);

cpuilde_curr_governor->reflect();

}

6. Linux Files related to C-States

----------------------------------

driver/acpi/processor_core.c

driver/acpi/processor_idle.c

driver/cpuidle/cpuidle.c

driver/cpuidle/driver.c

driver/cpuidle/governor.c

driver/cpuidle/sysfs.c

driver/cpuidle/governor/ladder.c

driver/cpuidle/governor/menu.c

7. Some Kernel Parameters

-------------------------------

idle=poll,

polling, always in C0, most no power-saving;

idle=halt,

use HLT instruction only,  only enter C1;

idle=nomwait

don't use mwait, P_LVLx method is used;

idle=mwait

force OS to use mwait for C-state;

max_cstate=n

specifiy available max C-state, n is a number

Others (which may help locate issue when C-State doesn't

work),

nohz=off

don't use dynamic tick/tickless mode

nolapic_timer        don't use

local APIC timer

lapic_timer_c2_ok    Local APIC timer is ok in

C2

clocksource=tsc (or hpet, pit, acpi_pm,

jiffies),      override clock source

8. Sysfs & Proc

-----------------

Check C-State stastics & state,

/proc/acpi/processor/CPUX/

Check governor & driver,

/sys/devices/system/cpu/cpuidle/

(for

system0-wide)

/sys/devices/system/cpu/cpuX/cpuidle/  (for

CPU)

9. TBD

-----------

9.1 Broadcast Timer

------------------

When some CPU enters deep C (C3 or above), their Local APIC timer

will

stop as well (Linux uses LAPIC timer as tick device in most cases).

This

issue is handled by "broadcast timer scheme.

9.2 Dynamic Tick /Tickless

--------------------------

Linux supports tickless which causes the C-State code more

complex.

9.3 Idle Load balancing

-----------------------

When CPUs enter into idle state, one of idle CPU will be nominated

as ILB

(Idle Load Balancer). It is responsible for pulling task from busy

CPUs and

re-assigne the tasks to idle CPUs and have idle CPUs to

start-up.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值