【文献阅读未遂】Understanding data storage and ingestion for large-scale deep recommendation model training

heisenberg.liu

已于 2022-02-17 18:14:07 修改

阅读量1.5k

点赞数

分类专栏：机器学习分布式训练文章标签：机器学习深度学习算法

于 2022-02-17 18:12:02 首次发布

本文链接：https://blog.csdn.net/weixin_45587004/article/details/122989128

版权

机器学习同时被 2 个专栏收录

23 篇文章 1 订阅

订阅专栏

分布式训练

1 篇文章 0 订阅

订阅专栏

Understanding data storage and ingestion for large-scale deep recommendation model training

不咋好读，不太懂

摘要

问题：Domain-specific accelerator特定领域加速器合并进数据中心尺度的集群，增加大数据集的训练的有效性和吞吐量，data storage and ingestion pipeline (DSI) 和用于存储和预处理训练数据的系统和硬件限制训练能力。DSI系统需要创新。

解决方法：Meta 端到端DSI pipeline，由central data warehouse按照distributed storage和data preprocessing service(DPP)组成，消除data stall。

展示上百个模型是如何训练的，大量数据集是如何存储和read的，在线预处理如何将大量需求放到硬件上的。

1.Introduction

DSA应用于训练大数据集的DNN网络，现有DSA优化模型训练的计算，即反向传播的矩阵运算。

DSI由offline data generation, dataset storage, online preprocessing services组成，存储和发送数据给trainer。DSI的设计影响训练的性能鲜有人关注。

理解DSI需求、unique workload characteristics, systems for industry-scale，deep learning recommendation model training(DLRM).

DSI：1.限制吞吐量，降低DSA利用率。2.消耗大量存储、预处理和训练资源，电力资源。3.模型复杂性和训练DSA增加数据存储和带宽需求。

end to end DSI pipeline保证大范围ML model training at-scale，训练数据由extract-transform-load ETL产生，PB数据存储在centralized data warehouse，data preprocessing service DPP处理大量线上预处理需求

介绍Meta的production-deplyed DSI pipeline，支持DNN 训练。需要存储大量动态改变的数据集，训练需要在线预处理，包括massive compute, network, and memory resourves

主要贡献：

1.介绍了DSI pipeline

2.提供端到端描述 production-deployed DSI pipeline 架构，根据DLRM需求定制

3.展示industry-scale DLRM training workloads特征，包括coordinated training, data generation and storage and online preprocessing

4.outlook

2.recommendation model background

推荐模型训练采用data parallelism和model parallelism。每一个训练工作依赖于data storage and ingestion(DSI) pipeline 提供训练数据，DSI流水线负责产生训练数据，存储和预处理样本

3.meta disaggregated data storage ingestion and training pipeline

最重要的就是这张图
在这里插入图片描述

A.data generation and storage

框架输入数据到预测模型，requesting service监控推荐系统的events的输出，避免数据泄露;subsequent streaming and batch extract-load-transform (ETL);subsequent streaming and batch extract-load-transform (ETL) 增加新的raw feature。

1.data generaton:

scribe

2.data storage

store training samples in a data warehouse as partitioned hive table because of hive’s compatibility with both internal systems and open sourve engines including spark and Presto.

two types of features: dense and sparse

B.online preprocessing

overview and requirements

raw bytes从storage中提取, decode into training samples. training sample transform into tensors. new features will be derived. after features are preprocessed, they are batched together into tensors. tensors are loaded into trainers, GPU.

1.scalable preprocessing with DPP

DPP提供在线预处理for training jobs across the datacenter fleet.

DPP control plane: DPP Master recieve a session specification, enable scalable work distribution, split to DPP workers, fault tolerance and auto-scaling, monitor worker health, implement auto-scaling via a controller.

DPP data plane: data workers and clients: for data plane operations of DPP.

Trainers:

4.coordinated training at scale

A.collaborative release process

避免模型版本之间冲突和保留受限制的训练能力

训练三步流程：1.提出想法, exploratory jobs training 几百个小的训练工作. 2.最有前景的ideas训练。 3.release candidates RCs继续训练，在fresh data上评估，最准确的模型应用。

训坏的模型kill然后重新训练

B.Global training demand

模型训练在global fleet of training infrastructure 在全球。

每个模型读不同dataset, cross-region bandwidth is highly-constrained. system and datacenter architects solve this by scheduler and bin-packing. scheduler balances training jobs for each model across regions by requiring each region to contain a copy of all models’ dataset. Bin-packing

C.feature engineering

feature are rapidly changing in production datasets with hundreds of new features added and deprecated each month.

因此需要高效数据存储。

D.summary of key takeaways

Ideas 周期性更新模型生成combo job，导致训练和DSI巅峰；设计全球datacenter，上百个模型进行训练和调度；训练工作不同架构和数据集

heisenberg.liu

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【文献阅读未遂】Understanding data storage and ingestion for large-scale deep recommendation model training

Understanding data storage and ingestion for large-scale deep recommendation model training不咋好读，不太懂摘要问题：Domain-specific accelerator特定领域加速器合并进数据中心尺度的集群，增加大数据集的训练的有效性和吞吐量，data storage and ingestion pipeline (DSI) 和用于存储和预处理训练数据的系统和硬件限制训练能力。DSI系统需要创新。解决方法
复制链接

扫一扫

专栏目录