- 博客(0)
- 资源 (5)
- 收藏
- 关注
Training Deeper Models by GPU Memory Optimization on TensorFlow
With the advent of big data, easy-to-get GPGPU and progresses in neural network
modeling techniques, training deep learning model on GPU becomes a popular
choice. However, due to the inherent complexity of deep learning models and the
limited memory resources on modern GPUs, training deep models is still a nontrivial
task, especially when the model size is too big for a single GPU. In this paper,
we propose a general dataflow-graph based GPU memory optimization strategy,
i.e.,“swap-out/in”, to utilize host memory as a bigger memory pool to overcome
the limitation of GPU memory. Meanwhile, to optimize the memory-consuming
sequence-to-sequence (Seq2Seq) models, dedicated optimization strategies are
also proposed. These strategies are integrated into TensorFlow seamlessly without
accuracy loss. In the extensive experiments, significant memory usage reductions
are observed. The max training batch size can be increased by 2 to 30 times given
a fixed model and system configuration.
2020-01-10
Distributed TensorFlow with MPI.pdf
Machine Learning and Data Mining (MLDM) algorithms are becoming increasingly important in analyzing large volume of data generated by simulations, experiments and mobile devices. With increasing data volume, distributed memory systems (such as tightly connected supercomputers or cloud computing systems) are becoming important in designing in-memory and massively parallel MLDM algorithms. Yet, the majority of open source MLDM software is limited to sequential execution with a few supporting multi-core/manycore execution. In this paper, we extend recently proposed Google TensorFlow for execution on large scale clusters using Message Passing Interface (MPI). Our approach requires minimal changes to the TensorFlow runtime – making the proposed implementation generic and readily usable to increasingly large users of TensorFlow. We evaluate our implementation using an InfiniBand cluster and several well known datasets. Our evaluation indicates the efficiency of our proposed implementation.
2020-01-09
CUDA优化2.pptx
CUDA存储优化,CPU-GPU 数据传输最小化。如果没有减少数据传输的话,将CPU代码移植到GPU可能无法提升性能,组团传输,内存传输与计算 时间重叠。
2020-01-09
虚拟与离散变量回归模型.pdf
前面五章所研究的回归模型,其变量都是取一些实际的数值,一般是连续的。实际工作中经常遇到变量取离散数值情形,它的回归模型需要给予特殊的考虑。在经济分析中还经常遇到因变量不是数值的情况,比如买与卖、升与降、有与无、盈与亏等。这些情况可以给予一个虚拟变量并赋以数值代表。这样的回归当然就更有特色了。本章就研究这一类回归模型。
2020-01-10
Tensorflow XLA详解.pdf
XLA(加速线性代数)是用于优化TensorFlow计算的线性代数的域特定编译器。
XLA 利用 JIT 编译技术分析用户在运行时创建的 TensorFlow 图表,根据实际运行时维度和类型将其专门化,将多个运算融合在一起并为它们生成高效的本机代码——适用于 CPU、GPU 之类的设备和自定义加速器(例如,Google 的 TPU)。
2020-01-14
空空如也
TA创建的收藏夹 TA关注的收藏夹
TA关注的人