
分布式机器学习论文笔记
文章平均质量分 91
可姆可汗
USTC CS
展开
-
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning(论文笔记)
论文笔记原创 2022-11-03 14:50:54 · 1079 阅读 · 1 评论 -
Multi-Resource Interleaving for Deep Learning Training(论文笔记)
用户提交DL训练任务到Muri调度器。Muri调度器保持一个任务队列缓冲提交的任务,并做任务调度决定。原创 2022-10-30 15:17:49 · 1275 阅读 · 0 评论 -
Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters(论文笔记)
论文笔记原创 2022-10-27 19:32:08 · 597 阅读 · 1 评论 -
基于深度学习的任务放置在具有异质工作负载的分布式机器学习集群中的应用(论文笔记)
Deep Learning-Based Job Placement in Distributed Machine Learning Clusters With Heterogeneous Workloads原创 2022-10-22 17:28:20 · 653 阅读 · 1 评论 -
Efficient Sparse Collective Communication and its application to Accelerate Distributed Deep Learnin
最普遍的分布式深度学习(DDL)算法是随机梯度下降(SGD)。分布式 SGD 分为两个步骤:(1) 每个 worker 训练一个模型的局部拷贝,并行地处理训练数据的不同子集。(2) 将所用的 worker 的计算结果组合,得到一个平均的梯度,并将模型参数进行更新。在 DDL 中,通信负载占比很高,严重影响了其并行化的训练速度,是一个众所周知的性能瓶颈。目前,深度学习模型的梯度稀疏性越来越高,其中 DeepLight,LSTM 两个深度神经网络的梯度稀疏度超过了 94%。原创 2021-09-29 22:05:53 · 686 阅读 · 0 评论 -
DL2: A Deep Learning-Driven Scheduler for Deep Learning Clusters(论文笔记)
DL2目标是找到最佳的资源调度策略,最小化所有并发任务的平均完成时间。原创 2022-10-17 16:16:20 · 1105 阅读 · 1 评论 -
Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters(论文笔记)
论文笔记原创 2022-10-13 18:36:18 · 715 阅读 · 1 评论 -
Near-Optimal Topology-adaptive Parameter Synchronization in Distributed DNN Training(论文笔记)
Near-Optimal Topology-adaptive Parameter Synchronization in Distributed DNN Training论文笔记原创 2022-09-23 19:37:57 · 397 阅读 · 3 评论 -
TOWARDS SCALABLE DISTRIBUTED TRAINING OF DEEP LEARNING ON PUBLIC CLOUD CLUSTERS(论文笔记)
TOWARDS SCALABLE DISTRIBUTED TRAINING OF DEEP LEARNING ONPUBLIC CLOUD CLUSTERS论文笔记原创 2022-09-16 14:33:06 · 787 阅读 · 0 评论 -
A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters笔记
BytePS论文笔记原创 2022-09-13 19:47:55 · 932 阅读 · 0 评论 -
Communication-efficient Decentralized Machine Learning over Heterogeneous Networks(笔记)
Communication-efficient Decentralized Machine Learning over Heterogeneous Networks论文笔记原创 2022-09-01 20:06:39 · 2123 阅读 · 0 评论 -
Hoplite: Efficient and Fault-Tolerant Collective Communication for Task-Based Distributed Systems
Hoplite: Efficient and Fault-Tolerant Collective Communication for Task-Based Distributed Systems论文笔记原创 2022-08-28 17:41:19 · 1003 阅读 · 0 评论