从一次NULL指针kernel panic说说内存屏障

See this very simple C code:

    n->next = first;
    n->pprev = &h->first;
    h->first = n;

Would the third line be executed before the first two lines?

A little different than what you might expect, most programmers will say YES.

And then they will go on and on about Out-of-Order-Execution, including compiler’s, or CPU’s…

Bla blabla…

The real world, however, is often less complicated!


Writing your own tcp_v4_rcv is essential for programmers who frequently customize TCP protocol, such as doing TCP optimizations.

Let’s take a look at an example about how to do this:

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/kallsyms.h>
#include <linux/tcp.h>
#include <net/protocol.h>

int (*orig_tcp_handler)(struct sk_buff *skb);
int my_tcp_v4_rcv(struct sk_buff *skb)
{
	// ... do sth...
	return orig_tcp_handler(skb);
}

struct net_protocol *prot = NULL;
int hook_init (void)
{
	if ((prot = (struct net_protocol*)kallsyms_lookup_name("tcp_protocol")) == NULL) {
		return -1;
	}

	orig_tcp_handler = prot->handler;
	prot->handler = my_tcp_v4_rcv;

	return 0;
}

void hook_cleanup(void)
{
	if (orig_tcp_handler) {
		prot->handler = orig_tcp_handler;
	}
}

MODULE_LICENSE("GPL");
module_init(hook_init);
module_exit(hook_cleanup);

But is it right?

Occasionally, but not often, we get a kernel panic caused by a NULL pointer.

The reason is that orig_tcp_handler is NULL.

Therefore, the code above is incorrect.

But why?

The C code you see is invisible to the CPU, and the CPU sees assembly instructions but C code.

Now, it might be more helpful to look at the assembler to see what’s going on rather than daydreaming about more complex possibilities.

Let’s go:

objdump -D hook.ko

We can see following stuff:

0000000000000010 <init_module>:
  10:   e8 00 00 00 00          callq  15 <init_module+0x5>
  15:   55                      push   %rbp
  16:   48 c7 c7 00 00 00 00    mov    $0x0,%rdi
  1d:   48 89 e5                mov    %rsp,%rbp
  20:   e8 00 00 00 00          callq  25 <init_module+0x15>
  25:   48 85 c0                test   %rax,%rax
  28:   74 17                   je     41 <init_module+0x31>
  2a:   48 8b 50 08             mov    0x8(%rax),%rdx
  2e:   48 c7 40 08 00 00 00    movq   $0x0,0x8(%rax)
  35:   00
  36:   31 c0                   xor    %eax,%eax
  38:   5d                      pop    %rbp
  39:   48 89 15 00 00 00 00    mov    %rdx,0x0(%rip)        # 40 <init_module+0x30>
  40:   c3                      retq
  41:   b8 ff ff ff ff          mov    $0xffffffff,%eax
  46:   5d                      pop    %rbp
  47:   c3                      retq
  48:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  4f:   00

Let’s look at 0x2a/Location. It means that the value of the prot->handler is temporarily stored in RDX. Address is still inflight.

Now, let’s go to the next line. "movq $0x0,0x8(%rax)" means that my_tcp_v4_rcv has been assigned to the prot->handler, which is a pointer assignment operation, on the x86_64 platform which is an atomic operation.

For atomic operations, see the Intel manual:
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf
8.1.1 Guaranteed Atomic Operations

I would say that kernel programming is much easier than user-mode programming, with the ability to keep Pointers 64bit aligned, and that the kernel rarely USES complex structures and casts of basic data types of different bit width, so it rarely crosses the cacheline boundary

As you can see, the assignment to orig_tcp_handler doesn’t happen until 0x39/Location.

This means that my_tcp_v4_rcv lands before orig_tcp_handler is assigned.

Any program that calls prot->handler will panic if it sees my_tcp_v4_rcv and executes orig_tcp_handler before orig_tcp_handler lands.

Obviously, the compiler has reordered the code.

We can stop it by inserting the following code between the two assignments:

asm volatile("" ::: "memory");

Further, the Linux kernel provides specialized interfaces to do this:

/*
 * ...
 * Inserts memory barriers on architectures that require them
 * (which is most of them), and also prevents the compiler from
 * reordering the code that initializes the structure after the pointer
 * assignment.  More importantly, this call documents which pointers
 * will be dereferenced by RCU read-side code.
 * ...
 */
#define rcu_assign_pointer(p, v) \
    __rcu_assign_pointer((p), (v), __rcu)

So, the correct code is as follows:

orig_tcp_handler = prot->handler;
rcu_assign_pointer(prot->handler, my_tcp_v4_rcv);

Finally, let’s look at the correct assembler(from 0xd1 to 0xe3):

  b0:   e8 00 00 00 00          callq  b5 <func3+0x5>
  b5:   55                      push   %rbp
  b6:   48 c7 c7 00 00 00 00    mov    $0x0,%rdi
  bd:   48 89 e5                mov    %rsp,%rbp
  c0:   e8 00 00 00 00          callq  c5 <func3+0x15>
  c5:   48 85 c0                test   %rax,%rax
  c8:   48 89 05 00 00 00 00    mov    %rax,0x0(%rip)        # cf <func3+0x1f>
  cf:   74 1e                   je     ef <func3+0x3f>
  d1:   48 8b 40 08             mov    0x8(%rax),%rax
  d5:   48 89 05 00 00 00 00    mov    %rax,0x0(%rip)        # dc <func3+0x2c>
  dc:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax        # e3 <func3+0x33>
  e3:   48 c7 40 08 00 00 00    movq   $0x0,0x8(%rax)
  ea:   00
  eb:   31 c0                   xor    %eax,%eax
  ed:   5d                      pop    %rbp
  ee:   c3                      retq
  ef:   b8 ff ff ff ff          mov    $0xffffffff,%eax
  f4:   5d                      pop    %rbp
  f5:   c3                      retq
  f6:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  fd:   00 00 00

And then, orig_tcp_handler and my_tcp_v4_rcv fall exactly in the order of the C code.

That’s the whole story. I don’t care how and when the compiler reorder the instructions, nor am I interested in it.


What? Is that all? What about CPU reordering?

Skinshoe!

Let’s go on.


x86_64 CPU almost was not reordering.

Let’s take a look at the relevant sections of the Intel manual:
在这里插入图片描述
from https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf [8.2.2 Memory Ordering in …]

For only x86_64, that’s all.

There is no need to look at strange or outdated architectures such as DEC.

If we don’t have a real experimental platform at hand, we can only dream about these strange and unusable systems. It makes no sense!.

For things that can’t be landed, the more you say, the more chaotic it will be, the more chaotic it will be able to fool those who don’t know anything, so that those who don’t know anything think you know so much!

We only have x86_64.So let’s just talk about x86_64, since we can experimentally prove what we’re saying.

So? Throw away Documents/memory-barriers.txt , it’s the union set of all architectures, and you’ll never come across them almost.

About the only reodering case, see :
https://preshing.com/20120515/memory-reordering-caught-in-the-act/

Here’s my simplified code, case for Store-Load(reordering):

// gcc -O0 StoreLoad.c -o StoreLoad -lpthread
#include <pthread.h>
#include <semaphore.h>
#include <stdio.h>

sem_t beginSema1;
sem_t beginSema2;
sem_t endSema;

int X, Y;
int r1, r2;

void *thread1Func(void *param)
{
    for (;;) {
        sem_wait(&beginSema1);  // Wait for signal

        X = 1; // Store
        asm volatile("" ::: "memory");  // Prevent compiler reordering
        r1 = Y; // Load

        sem_post(&endSema);  // Notify transaction complete
    }
    return NULL;  // Never returns
};

void *thread2Func(void *param)
{
    for (;;) {
        sem_wait(&beginSema2);  // Wait for signal

        Y = 1; // Store
        asm volatile("" ::: "memory");  // Prevent compiler reordering
        r2 = X; //Load

        sem_post(&endSema);  // Notify transaction complete
    }
    return NULL;  // Never returns
};

int main(int argc, char **argv)
{
    sem_init(&beginSema1, 0, 0);
    sem_init(&beginSema2, 0, 0);
    sem_init(&endSema, 0, 0);

    pthread_t thread1, thread2;
    pthread_create(&thread1, NULL, thread1Func, NULL);
    pthread_create(&thread2, NULL, thread2Func, NULL);

    int detected = 0;
    for (int iterations = 1; ; iterations++) {
        X = 0;
        Y = 0;
        sem_post(&beginSema1);
        sem_post(&beginSema2);
        sem_wait(&endSema);
        sem_wait(&endSema);
        if (r1 == 0 && r2 == 0) {
            detected++;
            printf("%d reorders detected after %d iterations\n", detected, iterations);
        }
    }
    return 0;
}

In this only case, use mfence to prevent the CPU from reordering instructions:

	X = 1;
	asm volatile("mfence" ::: "memory");  // Prevent CPU reordering
	r1 = Y;
...

Why does this happen?

That’s because of the store-buffer. In other words, the write operation is cached, while the read operation is directly.

Here is a quote from Linus:

IOW, on x86, loads are ordered wrt loads, and stores are ordered wrt other
stores, but loads are not ordered wrt other stores in the absence of a
serializing instruction, and it’s exactly because of the write buffer.

From:https://yarchive.net/comp/linux/store_buffer.html

So? When

  • Load-Load. It’s sequential.
  • Load-Store. The Load is already in the front, not to mention that the later Buffered-Store is even slower.
  • Store-Store. Always in order. But …see intel-x86-and-64-manual-vol3 section 8.2.2
  • Store-Load. Since the Store operation is buffered, it is slower than the direct Load operation, and Load can naturally land before the Store operation.

Here’s my POCs for the other three cases(do not suffer from CPU reordering):

  • case for Load-Store(not reordering):
    // gcc -O0 LoadStore.c -o LoadStore -lpthread
    #include <pthread.h>
    #include <semaphore.h>
    #include <stdio.h>
    
    sem_t beginSema1;
    sem_t beginSema2;
    sem_t endSema;
    
    int X, Y;
    int r1, r2;
    
    void *thread1Func(void *param)
    {
        for (;;) {
            sem_wait(&beginSema1);  // Wait for signal
    
            r1 = Y; // Load
            asm volatile("" ::: "memory");  // Prevent compiler reordering
            X = 1; // Store
    
            sem_post(&endSema);  // Notify transaction complete
        }
        return NULL;  // Never returns
    };
    
    void *thread2Func(void *param)
    {
        for (;;) {
            sem_wait(&beginSema2);  // Wait for signal
    
            r2 = X; //Load
            asm volatile("" ::: "memory");  // Prevent compiler reordering
            Y = 1; // Store
    
            sem_post(&endSema);  // Notify transaction complete
        }
        return NULL;  // Never returns
    };
    
    int main(int argc, char **argv)
    {
        sem_init(&beginSema1, 0, 0);
        sem_init(&beginSema2, 0, 0);
        sem_init(&endSema, 0, 0);
    
        pthread_t thread1, thread2;
        pthread_create(&thread1, NULL, thread1Func, NULL);
        pthread_create(&thread2, NULL, thread2Func, NULL);
    
        int detected = 0;
        for (int iterations = 1; ; iterations++) {
            X = 0;
            Y = 0;
            sem_post(&beginSema1);
            sem_post(&beginSema2);
            sem_wait(&endSema);
            sem_wait(&endSema);
            if (r1 == 1 && r2 == 1) {
                detected++;
                printf("%d reorders detected after %d iterations\n", detected, iterations);
            }
        }
        return 0;
    }
    
  • case for Load-Load(not reordering):
    // gcc -O0 LoadLoad.c -o LoadLoad -lpthread
    #include <pthread.h>
    #include <semaphore.h>
    #include <stdio.h>
    
    sem_t beginSema1;
    sem_t beginSema2;
    sem_t endSema;
    
    int X, Y;
    int x, y;
    int xx, yy;
    
    void *thread1Func(void *param)
    {
        for (;;) {
            sem_wait(&beginSema1);  // Wait for signal
    
            X = x; // Load
            asm volatile("" ::: "memory");  // Prevent compiler reordering
            Y = y; // Load
    
            sem_post(&endSema);  // Notify transaction complete
        }
        return NULL;  // Never returns
    };
    
    void *thread2Func(void *param)
    {
        for (;;) {
            sem_wait(&beginSema2);  // Wait for signal
    
            yy = Y; //Load
            asm volatile("" ::: "memory");  // Prevent compiler reordering
            xx = X; // Load
    
            sem_post(&endSema);  // Notify transaction complete
        }
        return NULL;  // Never returns
    };
    
    int main(int argc, char **argv)
    {
        sem_init(&beginSema1, 0, 0);
        sem_init(&beginSema2, 0, 0);
        sem_init(&endSema, 0, 0);
    
        pthread_t thread1, thread2;
        pthread_create(&thread1, NULL, thread1Func, NULL);
        pthread_create(&thread2, NULL, thread2Func, NULL);
    
        int detected = 0;
        for (int iterations = 1; ; iterations++) {
            X = Y = xx = yy = 0;
    	x = y = 1;
            sem_post(&beginSema1);
            sem_post(&beginSema2);
            sem_wait(&endSema);
            sem_wait(&endSema);
            if (yy == 1 && xx == 0) {
                detected++;
                printf("%d reorders detected after %d iterations\n", detected, iterations);
            }
        }
        return 0;
    }
    
  • case for Store-Store(not reordering):
    // gcc -O0 LoadLoad.c -o LoadLoad -lpthread
    #include <pthread.h>
    #include <semaphore.h>
    #include <stdio.h>
    
    sem_t beginSema1;
    sem_t beginSema2;
    sem_t endSema;
    
    int X, Y;
    int x, y;
    
    void *thread1Func(void *param)
    {
        for (;;) {
            sem_wait(&beginSema1);  // Wait for signal
    
            X = 1; // Store
            asm volatile("" ::: "memory");  // Prevent compiler reordering
            Y = 1; // Store
    
            sem_post(&endSema);  // Notify transaction complete
        }
        return NULL;  // Never returns
    };
    
    void *thread2Func(void *param)
    {
        for (;;) {
            sem_wait(&beginSema2);  // Wait for signal
    
            y = Y; //Load
            asm volatile("" ::: "memory");  // Prevent compiler reordering
            x = X; // Load
    
            sem_post(&endSema);  // Notify transaction complete
        }
        return NULL;  // Never returns
    };
    
    int main(int argc, char **argv)
    {
        sem_init(&beginSema1, 0, 0);
        sem_init(&beginSema2, 0, 0);
        sem_init(&endSema, 0, 0);
    
        pthread_t thread1, thread2;
        pthread_create(&thread1, NULL, thread1Func, NULL);
        pthread_create(&thread2, NULL, thread2Func, NULL);
    
        int detected = 0;
        for (int iterations = 1; ; iterations++) {
            X = Y = x = y = 0;
            sem_post(&beginSema1);
            sem_post(&beginSema2);
            sem_wait(&endSema);
            sem_wait(&endSema);
            if (y == 1 && x == 0) {
                detected++;
                printf("%d reorders detected after %d iterations\n", detected, iterations);
            }
        }
        return 0;
    }
    

That is all.

Further, why is it designed this way?

Very simple. Let’s talk about metaphysics.

A Load operation is a loop that refers to from the issuing of a Load instruction to the receipt of data, and it is latency sensitive.

The Store operation, on the other hand, is one-way. The Store operation just issues an instruction. So it has more room for landing data.

The modern x86_64 processor does not have a data storage unit inside to store the data that the Load instruction needs to retrieve. Therefore, the CPU core must immediately retrieve the data from the storage unit outside. The storage unit here could be a CPU cache, or it could be memory.

In contrast, Store operation instructions can be merged and delayed for optimal performance, as with disk IO.

Another interesting analogy is the design of CPU caches and VFS page caches.


Let’s get back to real world.

A final truth.

If you read Documentation/memory-barriers’s section on CACHE COHERENCY, and I mean you just only read it, you’ll remember it well:
在这里插入图片描述

Read on, and you’ll find some really tricky details that no doubt enrich your soul, but hardly make any real sense beyond giving you the ability of blabla!

Just pay attention to the last two paragraphs:
在这里插入图片描述

If you’re using x86_64, please shut up!

Back to the very simple code:

    n->next = first;
    n->pprev = &h->first;
    h->first = n;

Question again. Would the third row be executed before the first two rows?

The answer is:

  • On x86_64, If the compiler doesn’t reorder the instructions, the answer is NO!

But who guarantees the compiler’s behavior?

So, use rcu_assign_pointer :

    n->next = first;
    n->pprev = &h->first;
    rcu_assign_pointer(h->first, n);

On x86_64, it has nothing to do with CPU reordring. (It’s not the case of StoreLoad)

As for the rest, I don’t care. That’s all.


浙江温州皮鞋湿,下雨进水不会胖。

发布了1550 篇原创文章 · 获赞 4786 · 访问量 1065万+
展开阅读全文

没有更多推荐了,返回首页

©️2019 CSDN 皮肤主题: 编程工作室 设计师: CSDN官方博客

分享到微信朋友圈

×

扫一扫,手机浏览