自定义博客皮肤VIP专享

*博客头图:

格式为PNG、JPG,宽度*高度大于1920*100像素,不超过2MB,主视觉建议放在右侧,请参照线上博客头图

请上传大于1920*100像素的图片!

博客底图:

图片格式为PNG、JPG,不超过1MB,可上下左右平铺至整个背景

栏目图:

图片格式为PNG、JPG,图片宽度*高度为300*38像素,不超过0.5MB

主标题颜色:

RGB颜色,例如:#AFAFAF

Hover:

RGB颜色,例如:#AFAFAF

副标题颜色:

RGB颜色,例如:#AFAFAF

自定义博客皮肤

-+
  • 博客(0)
  • 资源 (5)
  • 收藏
  • 关注

空空如也

Training Deeper Models by GPU Memory Optimization on TensorFlow

With the advent of big data, easy-to-get GPGPU and progresses in neural network modeling techniques, training deep learning model on GPU becomes a popular choice. However, due to the inherent complexity of deep learning models and the limited memory resources on modern GPUs, training deep models is still a nontrivial task, especially when the model size is too big for a single GPU. In this paper, we propose a general dataflow-graph based GPU memory optimization strategy, i.e.,“swap-out/in”, to utilize host memory as a bigger memory pool to overcome the limitation of GPU memory. Meanwhile, to optimize the memory-consuming sequence-to-sequence (Seq2Seq) models, dedicated optimization strategies are also proposed. These strategies are integrated into TensorFlow seamlessly without accuracy loss. In the extensive experiments, significant memory usage reductions are observed. The max training batch size can be increased by 2 to 30 times given a fixed model and system configuration.

2020-01-10

Distributed TensorFlow with MPI.pdf

Machine Learning and Data Mining (MLDM) algorithms are becoming increasingly important in analyzing large volume of data generated by simulations, experiments and mobile devices. With increasing data volume, distributed memory systems (such as tightly connected supercomputers or cloud computing systems) are becoming important in designing in-memory and massively parallel MLDM algorithms. Yet, the majority of open source MLDM software is limited to sequential execution with a few supporting multi-core/manycore execution. In this paper, we extend recently proposed Google TensorFlow for execution on large scale clusters using Message Passing Interface (MPI). Our approach requires minimal changes to the TensorFlow runtime – making the proposed implementation generic and readily usable to increasingly large users of TensorFlow. We evaluate our implementation using an InfiniBand cluster and several well known datasets. Our evaluation indicates the efficiency of our proposed implementation.

2020-01-09

CUDA优化2.pptx

CUDA存储优化,CPU-GPU 数据传输最小化。如果没有减少数据传输的话,将CPU代码移植到GPU可能无法提升性能,组团传输,内存传输与计算 时间重叠。

2020-01-09

虚拟与离散变量回归模型.pdf

前面五章所研究的回归模型,其变量都是取一些实际的数值,一般是连续的。实际工作中经常遇到变量取离散数值情形,它的回归模型需要给予特殊的考虑。在经济分析中还经常遇到因变量不是数值的情况,比如买与卖、升与降、有与无、盈与亏等。这些情况可以给予一个虚拟变量并赋以数值代表。这样的回归当然就更有特色了。本章就研究这一类回归模型。

2020-01-10

Tensorflow XLA详解.pdf

XLA(加速线性代数)是用于优化TensorFlow计算的线性代数的域特定编译器。 XLA 利用 JIT 编译技术分析用户在运行时创建的 TensorFlow 图表,根据实际运行时维度和类型将其专门化,将多个运算融合在一起并为它们生成高效的本机代码——适用于 CPU、GPU 之类的设备和自定义加速器(例如,Google 的 TPU)。

2020-01-14

空空如也

TA创建的收藏夹 TA关注的收藏夹

TA关注的人

提示
确定要删除当前文章?
取消 删除