【yebangyu博客】诡异的程序性能问题

最新推荐文章于 2024-07-11 16:48:43 发布

alex970448359

最新推荐文章于 2024-07-11 16:48:43 发布

阅读量150

点赞数

文章标签： c++

原文链接：http://www.yebangyu.org/blog/2015/12/30/falsesharing/

版权

转自：http://www.yebangyu.org/blog/2015/12/30/falsesharing/

本文所使用的环境是Ubuntu 14.04 32bit系统，Intel I5处理器，X86体系结构

提出问题

如果我说下面的程序存在性能问题，您信吗？

#include<thread>
int32_t global[2]={0};
void f()
{
  for(int i = 0; i < 100000000; i++) {
    ++global[0];
  }
}
void g()
{
  for(int i = 0; i < 100000000; i++) {
    ++global[1];
  }
}
int main()
{
  std::thread thread1(f);
  std::thread thread2(g);
  thread1.join();
  thread2.join();
  return 0;
}

这个程序，在我的电脑上，运行时间为：

real	0m0.822s
user	0m1.596s
sys     0m0.000s

分析问题

有人说，两个线程分别操作不同的计数器，这么完美的程序，会有性能问题？

答案是：有。

恩，原因在于大名鼎鼎的false sharing。如果您看过我以前写的这篇博客，应该还记得:现在的计算机一般由一个内存、一个CPU组成，而包含多个CPU Cores和Cache。如这幅图所示：

【】

cacheline是cache块单位，一个cacheline大小一般在32到256字节左右。cacheline是这张图中不同模块的数据交互元素。

在上面程序中，global是两个4字节变量构成的数组，大小为8字节，很可能被放到同一个cacheline里。当运行在CPU1 Core上的线程thread1修改了global[0]时，会让运行在CPU2 Core上对应global[0]和global[1]的cacheline失效，因此运行在CPU2 Core上的线程thread2修改global[1]时会发生cache miss。在它修改global[1]后，就让CPU1 Core中的cacheline失效。

很明显，这里面由于cacheline ping pong带来大量为了缓存一致性而产生的开销。

因此这种false sharing发生在多核、多线程环境中。单核或者单线程不会有false sharing问题。

遗憾的是，程序里存在这样的问题，并不容易通过肉眼发现。

幸运的是，这种问题一旦知道，就比较好解决。

解决问题

解决方法一：让这两个计数器间隔足够大，让它们不要在同一个cacheline里，不就行了么？

恩，定义一个global[10000]，然后线程1利用global[0]，线程2利用global[9999]，应该就可以了。

什么？这么愚蠢的方法都想得出来？接着往下看。

解决方法二：假如global不是一个数组呢？而是包含多个变量的结构体呢(这种情形也很常见)？上面的方法就不灵了吧？

恩，上面的方法不灵了，而且上面的方法太笨。网上有很多资料告诉您怎么定义变量让其cacheline aligned，这也是那些博客千篇一律的做法，不过还是值得在这里提及。

以gcc编译器为例：下面的做法不能保证global[0]和global[1]不在同一个cacheline里：

1
2

#define CACHELINE_SIZE 64
int32_t global[2] __attribute__((aligned(CACHELINE_SIZE)));

这样只能保证global数组的首元素是cacheline对齐的，不能保证global[0]和global[1]不在同一个cacheline。

以下是正确做法：

#define CACHELINE_SIZE 64
struct MyCounter
{
  int32_t counter;
}__attribute__((aligned (CACHELINE_SIZE)));
MyCounter global[2];

还有没有其他方法？有。接着往下看。

解决方法三：重点来了。

我们其实可以在线程里使用局部变量！

#include<thread>
int32_t global[2] = {0};
void f()
{
  int counter1 = 0;
  for(int i = 0; i < 100000000; i++) {
    ++counter1;
  }
  global[0] = counter1;
}
void g()
{
  int counter2 =0;
  for(int i = 0; i < 100000000; i++) {
    ++counter2;
  }
  global[1] = counter2;
}
int main()
{
  std::thread thread1(f);
  std::thread thread2(g);
  thread1.join();
  thread2.join();
  return 0;
}

counter1和counter2在自己的线程栈上，cacheline位于对应的CPU core里，大家相安无事。只有执行第9行和第17行时代价可能高点。

这个程序，在我的电脑上运行时间为：

real	0m0.293s
user	0m0.580s
sys     0m0.000s

解决方法四：

global神马变量？全局变量。counter1/counter2神马变量？局部变量。

有没有一种东东，既有全局的性质，又有局部的效果（线程私有）呢？

恩，如果您看过我以前写的这篇博客，就不会对__thread感到陌生。对！提供强大scalability的利器，就是它了！

#include<thread>
__thread int counter = 0;
void f()
{
  for(int i = 0; i < 100000000; i++) {
    ++counter;
  }
}
void g()
{
  for(int i = 0; i < 100000000; i++) {
    ++counter;
  }
}
int main()
{
  std::thread thread1(f);
  std::thread thread2(g);
  thread1.join();
  thread2.join();
  return 0;
}

这个程序在我的电脑上的运行时间为：

real	0m0.325s
user	0m0.644s
sys     0m0.000s

不过其他线程如何读取到每个计数线程的counter呢？不难，但是也不是那么简单，背后涉及到很多问题（其实本文最大的目的是通过false sharing，揭示partition这种并发编程里最大的设计原则）。我们下次专门聊。

附录

1，编译以上多线程程序的时候，请使用：

g++ -pthread -std=c++11 xxx.cpp

如果没有指定-pthread，那么程序可以编译链接通过，运行时报错：

terminate called after throwing an instance of ‘std::system_error’

what(): Enable multithreading to use std::thread: Operation not permitted

Aborted (core dumped)

2，程序计时我用的是 time ./a.out的方式。

alex970448359

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【yebangyu博客】诡异的程序性能问题

转自：http://www.yebangyu.org/blog/2015/12/30/falsesharing/本文所使用的环境是Ubuntu 14.04 32bit系统，Intel I5处理器，X86体系结构提出问题如果我说下面的程序存在性能问题，您信吗？ 12345678910111213141516171819202122 #include<thread>int32_t global[2..
复制链接

扫一扫