Experiment: 20231115-exp42-gpt2-124m-lr0.0006-john
Configuration
- Model: gpt2_124m
- Dataset: openwebtext
- Learning rate: 0.0006
- Batch size: 64
- ...
Results
- Best val loss: 2.85
- Training time: 4d 2h
- GPU memory: 38GB
Notes
- Changed optimizer to AdamW with betas (0.9, 0.95)
- Observed instability after epoch 10, reduced lr by 50%
### 4.2 Weights & Biases集成方案
在`train.py`中添加标准化的W&B初始化代码:
```python
import wandb
def init_wandb(config, run_name):
wandb.init(
project="nanoGPT",
name=run_name,
config=vars(config),
tags=[config.model_type, config.dataset, config.username],
save_code=True,
)
# 记录代码版本
wandb.config.git_commit = os.popen("git rev-parse HEAD").read().strip()
return wandb
关键指标记录:
- 训练/验证损失曲线
- 学习率调度
- GPU利用率
- 模型生成样本
- 代码与配置快照
五、环境一致性与分布式协作
5.1 开发环境标准化
Dockerfile示例:
FROM nvidia/cuda:11.7.1-cudnn8-devel-ubuntu22.04
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
# 设置默认配置
ENV PYTHONPATH=/app
ENV NCCL_IB_DISABLE=1
requirements.txt管理:
# 固定核心依赖版本
torch==2.0.1
numpy==1.24.3
transformers==4.28.1
datasets==2.12.0
tiktoken==0.4.0
wandb==0.15.3
tqdm==4.65.0
5.2 分布式训练协作流程
多节点训练协调步骤:
- 在
#training-coordinationSlack频道发布训练计划 - 使用共享Google Sheet记录GPU资源占用情况
- 训练开始前同步最新代码并运行
python check_env.py验证环境 - 通过
torchrun启动时附加--note="Experiment exp42, estimated 4d" - 训练过程中每6小时在频道更新进度
资源分配表格:
| 节点ID | GPU数量 | 当前使用者 | 实验ID | 开始时间 | 预计结束 | 状态 |
|---|---|---|---|---|---|---|
| node01 | 8xA100 | alice | exp42 | 2023-11-15 09:00 | 2023-11-19 09:00 | 运行中 |
| node02 | 8xA100 | bob | exp38 | 2023-11-14 14:30 | 2023-11-18 14:30 | 运行中 |
| node03 | 8xA100 | 空闲 | - | - | - | 可用 |
六、自动化测试与CI/CD实现
6.1 测试策略设计
针对nanoGPT的特殊性,构建三层测试体系:
单元测试示例(测试模型前向传播):
# tests/test_model.py
import torch
from model import GPT, GPTConfig
def test_gpt_forward():
config = GPTConfig(
block_size=128,
vocab_size=50257,
n_layer=12,
n_head=12,
n_embd=768
)
model = GPT(config)
x = torch.randint(0, 50257, (2, 128))
logits, loss = model(x)
assert logits.shape == (2, 128, 50257), "输出形状不正确"
assert loss is None, "无目标时不应计算损失"
6.2 GitHub Actions工作流配置
# .github/workflows/ci.yml
name: nanoGPT CI
on:
push:
branches: [ develop, main ]
pull_request:
branches: [ develop ]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Run tests
run: |
python -m pytest tests/ -v
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Lint with flake8
run: |
pip install flake8
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考



