- 博客(1394)
- 资源 (6)
- 收藏
- 关注
原创 深入理解:卡方分布(Chi-squared distribution)与伽马分布(Gamma Distribution)的关系
伽马分布是一个通用的分布,广泛用于描述随机变量的总和;卡方分布是伽马分布的特例,专注于正态变量平方和的情形;
2024-11-30 11:53:38
888
原创 威沙特分布(Wishart Distribution)和伽马分布(Gamma Distribution)的关系和使用场景
伽马分布描述的是一维随机变量的分布,而威沙特分布则描述协方差矩阵的分布。它们的联系可以看作是一种从标量到矩阵的推广过程。
2024-11-30 11:44:51
760
原创 架构师的英文:Architect
This role typically requires deep technical expertise and the ability to think systematically, focusing on designing and optimizing software systems at a high level.
2024-11-30 10:37:45
598
原创 跟李笑来学美式俚语(Most Common American Idioms): Part 43
Most Common American Idioms: Part 43
2024-11-30 10:32:59
834
原创 跟李笑来学美式俚语(Most Common American Idioms): Part 42
Most Common American Idioms: Part 42
2024-11-30 09:44:27
502
原创 大模型推理显存计算:为什么激活值显存可以忽略不计?(中英双语)
Inference in large language models (LLMs), such as LLaMA 2 7B, involves two primary components of GPU memory consumption: model parameter memory and activation memory
2024-11-30 09:40:31
532
原创 SGD、RMSProp 和 Adam 优化器的区别及训练显存消耗分析:以LLaMA-2 7B为例(中英双语)
A Detailed Analysis of SGD, RMSProp, and Adam Optimizers, and Their Memory Consumption
2024-11-30 09:06:31
622
原创 Adam与RMSProp优化器的区别以及训练显存消耗(以LLaMA-2 7B为例):中英双语
Both aim to improve training efficiency by adapting learning rates for each parameter, but they do so in different ways.
2024-11-30 09:05:38
809
原创 Adam 和 AdamW 优化器详解及其训练显存需求分析:以LLaMA-2 7B为例(中英双语)
Detailed Analysis of Adam and AdamW Optimizers and Their Memory Consumption with float32 and bfloat16 Precision
2024-11-30 09:03:46
968
原创 深入了解 Adam 优化器对显存的需求:以 LLaMA-2 7B 模型为例 (中英双语)
Understanding the Additional Memory Requirements of Adam Optimizer: Memory Consumption Breakdown for Model Parameters and Optimizer States
2024-11-30 09:02:14
1212
原创 bfloat16(BF16)和 float16(FP16)有什么区别?中英双语解释
BF16 offers a larger numerical range and is specifically optimized for deep learning tasks that require handling large gradients and weights.
2024-11-29 16:46:17
959
原创 数据并行、模型并行与张量并行:深度学习中的并行计算策略(中英双语)
Data Parallelism, Model Parallelism, and Tensor Parallelism: Parallel Computing Strategies in Deep Learning
2024-11-29 15:33:59
902
原创 DeepSpeed 的 hybrid_engine 参数详解:中英双语
By enabling and configuring the hybrid computation engine, DeepSpeed can intelligently manage memory and computation across multiple devices, improving efficiency and reducing training time.
2024-11-29 15:12:17
969
原创 深入了解 DeepSpeed 的 nebula_config 参数:中英双语介绍
This parameter allows users to manage and optimize the storage and version control of training states, facilitating efficient data storage and recovery during model training.
2024-11-29 15:03:08
1123
原创 课程学习 (Curriculum Learning) 介绍及其在 DeepSpeed 框架中的应用:中英双语
Curriculum learning aims to improve the learning efficiency of models by gradually increasing the difficulty of the training tasks.
2024-11-29 14:44:19
1053
原创 DeepSpeed框架配置解析:一份详细的日志分析
这些配置项涵盖了内存优化、自动调优、混合精度、分布式训练等多个方面,以及模型训练的其他细节方面,包括压缩、梯度处理、优化器配置、数据效率、流水线并行等
2024-11-29 14:17:28
813
原创 如何在 DeepSpeed 中开启梯度检查点(gradient checkpointing):中英双语介绍
Gradient checkpointing in DeepSpeed is a technique designed to reduce memory usage when training large models by storing only a subset of intermediate activations during the forward pass.
2024-11-29 13:48:50
804
原创 梯度检查点技术(Gradient Checkpointing)详细介绍:中英双语
By discarding intermediate activations and recomputing them when needed, gradient checkpointing reduces memory usage, making it feasible to train large models on memory-limited hardware.
2024-11-29 13:42:06
921
原创 大模型训练train_micro_batch_size_per_gpu 开得小,gradient_accumulation_steps 开得也小会怎样?
为了避免显存不足,可以通过减少 微批次大小 和 梯度累积步数、使用 混合精度训练、或应用 梯度检查点 等技术来优化显存的使用。
2024-11-29 13:30:15
874
原创 什么是分布式梯度累积(Distributed Gradient Accumulation)?gradient_accumulation_steps参数如何设置?
Distributed gradient accumulation helps overcome memory limitations by allowing us to simulate larger batch sizes while using smaller mini-batches.
2024-11-29 13:18:01
693
原创 DeepSpeed配置文件reduce_bucket_size参数详解:中英双语
reduce_bucket_size is an essential parameter in DeepSpeed's ZeRO Stage 2 optimization, controlling the size of the buckets during gradient reduction.
2024-11-29 13:09:00
690
原创 梯度规约(gradient reduction)是什么?中英双语解释
By understanding the mechanics of gradient reduction and the impact of contiguous memory, we can optimize distributed training setups and improve model training efficiency across multiple devices.
2024-11-29 12:51:04
585
原创 理解Parquet文件和Arrow格式:从Hugging Face数据集的角度出发
Understanding Parquet Files and Arrow Format: A Guide with Hugging Face Datasets
2024-11-29 12:12:54
978
原创 跟李笑来学美式俚语(Most Common American Idioms): Part 41
Most Common American Idioms: Part 41
2024-11-29 11:55:39
794
原创 跟李笑来学美式俚语(Most Common American Idioms): Part 40
Most Common American Idioms: Part 40
2024-11-28 21:49:15
885
原创 跟李笑来学美式俚语(Most Common American Idioms): Part 39
Most Common American Idioms: Part 39
2024-11-28 21:24:24
898
原创 跟李笑来学美式俚语(Most Common American Idioms): Part 38
Most Common American Idioms: Part 38
2024-11-28 20:34:52
883
原创 跟李笑来学美式俚语(Most Common American Idioms): Part 37
Most Common American Idioms: Part 37
2024-11-28 19:54:44
707
原创 似然分布(Likelihood Distribution)和似然函数(Likelihood Function)有什么区别?中英双语
Both involve the probability of data given a set of parameters, but they are used in different contexts and have distinct meanings.
2024-11-28 15:52:56
799
原创 贝叶斯统计的核心思想与基础知识:中英双语
Bayesian statistics is a framework that uses Bayes' theorem to update our beliefs about parameters or models based on observed data.
2024-11-28 15:28:48
894
原创 对max_seq_length参数的理解,基于open-instruct框架:中英文解释
It determines the maximum input sequence length (in tokens) that the model can process in a single forward pass after tokenization.
2024-11-28 14:42:32
749
原创 DeepSpeed配置文件contiguous_gradients参数详解:中英文
It controls whether the gradients are stored in a contiguous memory block.
2024-11-28 14:15:50
816
原创 跟李笑来学美式俚语(Most Common American Idioms): Part 36
Most Common American Idioms: Part 36
2024-11-28 13:09:18
764
原创 跟李笑来学美式俚语(Most Common American Idioms): Part 35
Most Common American Idioms: Part 35
2024-11-28 12:51:45
844
原创 DeepSpeed 配置文件(DeepSpeed Configuration Files)详解:中英文解释
DeepSpeed’s configuration is highly flexible, but tuning requires balancing memory efficiency and computational speed.
2024-11-27 22:08:58
1276
原创 英伟达GPU通信用的NCCL库是什么?中英双语介绍
NCCL (NVIDIA Collective Communications Library) is a high-performance communication library developed by NVIDIA.
2024-11-27 21:40:53
1064
原创 中英双语介绍DeepSpeed 的 ZeRO 优化
DeepSpeed introduces the ZeRO (Zero Redundancy Optimizer) optimization technique, a groundbreaking solution to reduce memory usage and improve efficiency during training.
2024-11-27 21:17:14
925
李永乐线代强化笔记2020年.rar
2020-10-27
空空如也
TA创建的收藏夹 TA关注的收藏夹
TA关注的人
RSS订阅