training become slow?训练速度奇怪的变慢?tcmalloc在tensorflow中的使用!

--------------------前言------------------------

在训练视频分类训练的时候,发现tensorflow莫名的变慢2~5 sec /batch, 之前一直是0.4 sec/batch, 联想到最早之前mxnet训练分类时候的类似情况,决定做排查(都是同一台训练服务器上):

(1)杀掉一些僵尸进程或多并行进程,eg. im2rec, 发现不见效,并且cpu利用率也不高,排除cpu性能的影响;

(2)杀掉一些系统进程,kworker, falcon-agent, gpu-monitor等,排除系统进程对训练性能的影响;

(3)iostat -x查看IO并不高,同时测试了image data loader,虽然性能有变化,但不是主要瓶颈占比,python cProfile热点分析瓶颈在session.run(),htop/top发现占用内存不多,但cache很大;

(4)top cpu利用率并不理想(与我设置的线程数不般配),猜测是多线程竞争引起,具体什么原因,不清楚;

(5)github搜索一些解决方法,tcmalloc有效(之前最简单的方法是重启,现在我没这么尝试^-^),总结如下:

一,----------------重载new与delete---------------------

首先,想明确一个c++中的概念,操作符的重载.直观印象中,都认为可重载的操作符是++, --, +, -, *, /,(),[]等,其实,操作符中有两个比较特殊的:operator new和operator delete,这两个都是可以重载的,重载的他们两个的主要目的是更好的内存管理(内存池);

其次,明确内存池(memory pool)的概念和应用.操作系统原生的memory management经常不满足要求,尤其在大型的客户端程序中.大型程序中我们经常new(malloc)与delete(free),会造成时间成本开销,同时造成内存碎片memory fragmentation(碎片的概念请自行google),另外由于程序复杂,出现内存泄露也不好定位和查找(),此时内存池便应运而生,内存池经常需要重载new delete操作符,对windows中MFC开发熟悉的想必对如下宏定义不陌生:

        #ifdef _DEBUG
        # define DEBUG_NEW new(THIS_FILE, __LINE__)
        #else
        # define DEBUG_NEW new
        #endif

MFC重载了new,从而方便的定位new的位置及文件路径,方便查看内存泄露.另一个例子如下:

#include <stdio.h>
#include<stdlib.h>

//重载全局new
void *operator new(size_t size)
{
	printf("operator new: allocate %d bytes\n", size);
	void *ptr = malloc(sz);//在这里也可以方便的操作自己的内存池
	return ptr;
}

//重载全局delete
void operator delete(void *ptr)
{
	puts("operator delete");
	free(ptr);
}

int main( )
{
	int *p=new int(0);//使用重载的operator new,不再使用系统的operator new
	delete p;
	return 0;
}

今天我们所说的tcmalloc或jemalloc就是类似的原理,如果选择了tcmalloc,则我们程序中会采用tcmalloc中的内存管理机制

二,---------------------tcmalloc是什么?-----------------

tcmalloc是一个内存分配器,管理堆内存,主要影响malloc和free,用于降低频繁分配、释放内存造成的性能损耗,并且有效地控制内存碎片。glibc中的内存分配器是ptmalloc2,tcmalloc要比它快。一次malloc和free操作,ptmalloc需要300ns,而tcmalloc只要50ns。同时,tcmalloc也优化了小对象的存储,需要更少的空间。tcmalloc特别对多线程做了优化,对于小对象的分配基本上是不存在锁竞争(lock contention),而大对象使用了细粒度、高效的自旋锁(spinlock)。分配给线程的本地缓存,在长时间空闲的情况下会被回收,供其他线程使用,这样提高了在多线程情况下的内存利用率,不会浪费内存,而这一点ptmalloc2是做不到的。

以下摘自gperftools文档,说出了tcmalloc的关键:

TCMalloc : Thread-Caching Malloc

Motivation

TCMalloc is faster than the glibc 2.3 malloc (available as a separate library called ptmalloc2) and other mallocs that I have tested. ptmalloc2 takes approximately 300 nanoseconds to execute a malloc/free pair on a 2.8 GHz P4 (for small objects). The TCMalloc implementation takes approximately 50 nanoseconds for the same operation pair. Speed is important for a malloc implementation because if malloc is not fast enough, application writers are inclined to write their own custom free lists on top of malloc. This can lead to extra complexity, and more memory usage unless the application writer is very careful to appropriately size the free lists and scavenge idle objects out of the free list

TCMalloc also reduces lock contention for multi-threaded programs. For small objects, there is virtually zero contention. For large objects, TCMalloc tries to use fine grained and efficient spinlocks. ptmalloc2 also reduces lock contention by using per-thread arenas but there is a big problem with ptmalloc2's use of per-thread arenas. In ptmalloc2 memory can never move from one arena to another. This can lead to huge amounts of wasted space. For example, in one Google application, the first phase would allocate approximately 300MB of memory for its data structures. When the first phase finished, a second phase would be started in the same address space. If this second phase was assigned a different arena than the one used by the first phase, this phase would not reuse any of the memory left after the first phase and would add another 300MB to the address space. Similar memory blowup problems were also noticed in other applications.

Another benefit of TCMalloc is space-efficient representation of small objects. For example, N 8-byte objects can be allocated while using space approximately 8N * 1.01 bytes. I.e., a one-percent space overhead. ptmalloc2 uses a four-byte header for each object and (I think) rounds up the size to a multiple of 8 bytes and ends up using 16N bytes.



更详细的tcmalloc可以参见google perf tools文档

http://goog-perftools.sourceforge.net/doc/tcmalloc.html

三,---------------------training become slow----------------

曾几何时,无论是mxnet训练,还是tensorflow训练,你都会奇怪的发现如下现象:

trainning speed random slow down

or, period become slow at each batch,

or, once it has been being very fast for training, but this go today

while I have not change any training code before~~

or, the gpu has a low usage today(首先需要排除是否IO瓶颈引起)

遇到上述的情况,则有可能是cache的问题,正是因为使用系统自带的内存分配机制引起的,

There may be some memory pressure caused by virtual address space fragmentation and high system buffer cache churn (reading large training datasets from the file system)

那么尝试解决:

(1)htop:有效的查看cache的工具:

htop发现cache占用太多,其中Mem的yellow bars代表cache占用情况
htop说明
​​​​

(2)Clear RAM Memory Cache

Linux provides a way to flush or clear ram cache.

How to Clear Cache in Linux?

Every Linux System has three options to clear cache without interrupting any processes or services.

    a. Clear PageCache only.

  # sync; echo 1 > /proc/sys/vm/drop_caches
   ​​​​b. Clear dentries and inodes.

   # sync; echo 2 > /proc/sys/vm/drop_caches

     c. Clear PageCache, dentries and inodes.

   # sync; echo 3 > /proc/sys/vm/drop_caches

    sync will flush the file system buffer. writing to drop_cache will clean cache without killing any application/service

    If you have to clear the disk cache, the first command is safest in enterprise and production as “...echo 1 > ….” will clear the PageCache only. It is not recommended to use third option above “...echo 3 >” in production until you know what you are doing, as it will clear PageCachedentries and inodes.

             更详细的说明请参见如下:

             https://www.tecmint.com/clear-ram-memory-cache-buffer-and-swap-space-on-linux/

   (3)使用tcmalloc when training:

             TCMalloc seems to improve training speed and avoids occasional slowdowns seen with the default allocator. You can enable it by installing it and setting LD_PRELOAD=/usr/lib/libtcmalloc.so. 

LD_PRELOAD=/usr/lib/libtcmalloc.so python train.py

             此时,训练中,分配buffer将使用tcmalloc中的内存池,提高效率,加速训练,恢复正常训练速度.

            备注:在部署c++程序中,如果遇到多线程频繁分配释放的情况,频繁操作大内存的情况,也可以通过观测cache,尝试利用tcmalloc改进.

   另外,tcmalloc不能在jni等形式下使用,因为jni有可能会导致先load系统运行时分配内存,后加载tcmalloc释放内存,造成内存冲突(切记在哪个库中分配的内存在哪个库中释放)

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

seasermy

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值