linux numa,NUMA工作原理

Contrasting the SMP/UMA and NUMA architectures

The SMP/UMA architecture

be42b0def3d47ad2e3abfb458265368d.png

The SMP,

or UMA architecture, simplified

When the PC world first got multiple processors, they were all arranged with

equal access to all of the memory in the system. This is called , or sometimes Uniform Memory Architecture (UMA,

especially in contrast to NUMA). In the past few years this architecture has

been largely phased out between physical socketed processors, but is still alive

and well today within a single processor with multiple cores: all cores have

equal access to the memory bank.

The NUMA architecture

132d67d6c7ed08fbb6784bc437f47a83.png

The NUMA

architecture, simplified

The new architecture for multiple processors, starting with and 2processors (we’ll call these “modern PC CPUs”), is a architecture, or more correctly . In this architecture, each processor has a “local” bank of

memory, to which it has much closer (lower latency) access. The whole system may

still operate as one unit, and all memory is basically accessible from

everywhere, but at a potentially higher latency and lower performance.

Fundamentally, some memory locations (“local” ones) are faster, that is, cost

less to access, than other locations (“remote” ones attached to other

processors). For a more detailed discussion of NUMA implementation and its

support in Linux, see .

How Linux handles a NUMA system

Linux automatically understands when it’s running on a NUMA architecture

system and does a few things:

Enumerates the hardware to understand the physical layout.

Divides the processors (not cores) into “nodes”. With modern PC processors,

this means one node per physical processor, regardless of the number of cores

present.

Attaches each memory module in the system to the node for the processor it

is local to.

Collects cost information about inter-node communication (“distance” between

nodes).

You can see how Linux enumerated your system’s NUMA layout using the numactl --hardwarecommand:

# numactl --hardware

available: 2 nodes (0-1)

node 0 size: 32276 MB

node 0 free: 26856 MB

node 1 size: 32320 MB

node 1 free: 26897 MB

node distances:

node 0 1

0: 10 21

1: 21 10

This tells you a few important things:

The number of nodes, and their node numbers — In this case

there are two nodes numbered “0″ and “1″.

The amount of memory available within each node — This

machine has 64GB of memory total, and two physical (quad core) CPUs, so it has

32GB in each node. Note that the sizes aren’t exactly half of 64GB, and aren’t

exactly equal, due to some memory being stolen from each node for whatever

internal purposes the kernel has in mind.

The “distance” between nodes — This is a representation of

the cost of accessing memory located in (for example) Node 0 from Node 1. In

this case, Linux claims a distance of “10″ for local memory and “21″ for

non-local memory.

How NUMA changes things for Linux

Technically, as long as everything runs just fine, there’s no reason that

being UMA or NUMA should change how things work at the OS level.

However, if you’re to get the best possible performance (and indeed in some

cases with extreme performance differences for non-local NUMA access, any

performance at all) some additional work has to be done, directly dealing with

the internals of NUMA. Linux does the following things which might be unexpected

if you think of CPUs and memory as black boxes:

Each process and thread inherits, from its parent, a NUMA policy. The

inherited policy can be modified on a per-thread basis, and it defines the CPUs

and even individual cores the process is allowed to be scheduled on, where it

should be allocated memory from, and how strict to be about those two decisions.

Each thread is initially allocated a “preferred” node to run on. The thread can be run elsewhere (if policy allows), but the scheduler attempts to

ensure that it is always run on the preferred node.

Memory allocated for the process is allocated on a particular node, by

default “current”, which means the same node as the thread is preferred to run

on. On UMA/SMP architectures all memory was treated equally, and had the same

cost, but now the system has to think a bit about where it comes from, because

accessing non-local memory has implications on performance and may cause cache

coherency delays.

Memory allocations made on one node will not be moved to

another node, regardless of system needs. Once memory is allocated on a node, it

will stay there.

The NUMA policy of any process can be changed, with broad-reaching effects,

very simply using as a wrapper for

the program. With a bit of additional work, it can be fine-tuned in detail by

linking in and

writing some code yourself to manage the policy. Some interesting things that

can be done simply with the numactl wrapper are:

Allocate memory with a particular policy:

locally on the “current” node — using --localalloc, and also the

default mode

preferably on a particular node, but elsewhere if necessary — using --preferred=node

always on a particular node or set of nodes — using --membind=nodes

interleaved, that is, spread evenly round-robin across all or a set

of nodes — using --interleaved=all or --interleaved=nodes

Run the program on a particular node or set of nodes, in this case that

means physical CPUs (--cpunodebind=nodes) or on a particular

core or set of cores (--physcpubind=cpus).

What NUMA means for MySQL and InnoDB

InnoDB, and really, nearly all database servers (), present an atypical workload (from the point of view of the

majority of installations) to Linux: a single large multi-threaded process which

consumes nearly all of the system’s memory and should be expected to consume as

much of the rest of the system resources as possible.

In a NUMA-based system, where the memory is divided into multiple nodes, how

the system should handle this is not necessarily straightforward. The default

behavior of the system is to allocate memory in the same node as a thread is

scheduled to run on, and this works well for small amounts of memory, but when

you want to allocate more than half of the system memory it’s no longer

physically possible to even do it in a single NUMA node: In a two-node system,

only 50% of the memory is in each node. Additionally, since many different

queries will be running at the same time, on both processors, neither individual

processor necessarily has preferential access to any particular part of memory

needed by a particular query.

It turns out that this seems to matter in one very important way. Using we can see all of the allocations made by mysqld, and some interesting

information about them. If you look for a really big number in the anon=size, you can pretty easily find the buffer pool (which

will consume more than

51GB of memory for the 48GB that it has been configured to use)

[line-wrapped for clarity]:

2aaaaad3e000 default anon=13240527 dirty=13223315

swapcache=3440324 active=13202235 N0=7865429 N1=5375098

The fields being shown here are:

2aaaaad3e000 — The virtual address of the memory region. Ignore

this other than the fact that it’s a unique ID for this piece of memory.

default — The NUMA policy in use for this region.

anon=number — The number of anonymous pages mapped.

dirty=number — The number of pages that are dirty because

they have been modified. Generally memory allocated only within a single process

is always going to be used, and thus dirty, but if a process forks it may have

many copy-on-write pages mapped that are not dirty.

swapcache=number — The number of pages swapped out but

unmodified since they were swapped out, and thus they are ready to be freed if

needed, but are still in memory at the moment.

active=number — The number of pages on the “active list”;

if this field is shown, some memory is inactive (anon minus active) which means it may be paged out by the swapper soon.

N0=number and N1=number — The number of

pages allocated on Node 0 and Node 1, respectively.

The entire numa_maps can be quickly summarized by the a simple

script numa-maps-summary.pl,

which I’ve written while analyzing this problem:

N0 : 7983584 ( 30.45 GB)

N1 : 5440464 ( 20.75 GB)

active : 13406601 ( 51.14 GB)

anon : 13422697 ( 51.20 GB)

dirty : 13407242 ( 51.14 GB)

mapmax : 977 ( 0.00 GB)

mapped : 1377 ( 0.01 GB)

swapcache : 3619780 ( 13.81 GB)

An couple of interesting and somewhat unexpected things pop out to me:

The sheer imbalance in how much memory is allocated in Node 0 versus Node 1.

This is actually absolutely normal per the default policy. Using the default

NUMA policy, memory was preferentially allocated in Node 0, but Node 1 was used

as a last resort.

The sheer amount of memory allocated in Node 0. This is absolutely

critical — Node 0 is out of free memory! It only contains about 32GB of memory

in total, and it has allocated a single large chunk of more than 30GB to

InnoDB’s buffer pool. A few other smaller allocations to other processes finish

it off, and suddenly it has no memory free, and isn’t even caching anything.

The memory allocated by MySQL looks something like this:

0ca89b9bcc0d0f371893d203def5b7d2.png

Allocating memory severely imbalanced, preferring Node 0

Due to Node 0 being completely exhausted of free memory, even though the

system has plenty of free memory overall (over 10GB has been used for caches) it

is entirely on Node 1. If any process scheduled on Node 0 needs local

memory for anything, it will cause some of the already-allocated memory to be

swapped out in order to free up some Node 0 pages. Even though there is free

memory on Node 1, the Linux kernel in many circumstances (which admittedly I

don’t totally understand3) prefers to page out Node 0 memory rather

than free some of the cache on Node 1 and use that memory. Of course the paging

is far more expensive than non-local memory access ever would be.

A small change, to big effect

An easy solution to this is to interleave the allocated memory. It is

possible to do this using numactlas described above:

# numactl --interleave all command

We can use this with MySQL by making a , adding the following line (after cmd="$NOHUP_NICENESS"), which prefixes the command to start mysqld with a call tonumactl:

cmd="/usr/bin/numactl --interleave all $cmd"

Now, when MySQL needs memory it will allocate it interleaved across all

nodes, effectively balancing the amount of memory allocated in each node. This

will leave some free memory in each node, allowing the Linux kernel to cache

data on both nodes, thus allowing memory to be easily freed on either node just

by freeing caches (as it’s supposed to work) rather than paging.

Performance regression testing has been done comparing the two scenarios

(default local plus spillover allocation versus interleaved allocation) using

the DBT2 benchmark, and found that performance in the nominal case is identical.

This is expected. The breakthrough comes in that: In all cases where swap use

could be triggered in a repeatable fashion, the system no longer

swaps!

You can now see from the numa_maps that all allocated memory has

been spread evenly across Node 0 and Node 1:

2aaaaad3e000 interleave=0-1 anon=13359067 dirty=13359067

N0=6679535 N1=6679532

And the summary looks like this:

N0 : 6814756 ( 26.00 GB)

N1 : 6816444 ( 26.00 GB)

anon : 13629853 ( 51.99 GB)

dirty : 13629853 ( 51.99 GB)

mapmax : 296 ( 0.00 GB)

mapped : 1384 ( 0.01 GB)

In graphical terms, the allocation of all memory within mysqld has

been made in a balanced way:

de3f288667fc95edded27ad92aa2700d.png

Allocating memory balanced (interleaved) across nodes

An aside on zone_reclaim_mode

The zone_reclaim_mode tunable in /proc/sys/vm can be used to fine-tune memory reclamation

policies in a NUMA system. Subject to from the linux-mm mailing list, it doesn’t seem to

help in this case.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值