Experiment: 20231115-exp42-gpt2-124m-lr0.0006-john

Experiment: 20231115-exp42-gpt2-124m-lr0.0006-john

【免费下载链接】nanoGPT The simplest, fastest repository for training/finetuning medium-sized GPTs. 【免费下载链接】nanoGPT 项目地址: https://gitcode.com/GitHub_Trending/na/nanoGPT

Configuration

  • Model: gpt2_124m
  • Dataset: openwebtext
  • Learning rate: 0.0006
  • Batch size: 64
  • ...

Results

  • Best val loss: 2.85
  • Training time: 4d 2h
  • GPU memory: 38GB

Notes

  • Changed optimizer to AdamW with betas (0.9, 0.95)
  • Observed instability after epoch 10, reduced lr by 50%

### 4.2 Weights & Biases集成方案

在`train.py`中添加标准化的W&B初始化代码:

```python
import wandb

def init_wandb(config, run_name):
    wandb.init(
        project="nanoGPT",
        name=run_name,
        config=vars(config),
        tags=[config.model_type, config.dataset, config.username],
        save_code=True,
    )
    # 记录代码版本
    wandb.config.git_commit = os.popen("git rev-parse HEAD").read().strip()
    return wandb

关键指标记录

  • 训练/验证损失曲线
  • 学习率调度
  • GPU利用率
  • 模型生成样本
  • 代码与配置快照

五、环境一致性与分布式协作

5.1 开发环境标准化

Dockerfile示例

FROM nvidia/cuda:11.7.1-cudnn8-devel-ubuntu22.04

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

# 设置默认配置
ENV PYTHONPATH=/app
ENV NCCL_IB_DISABLE=1

requirements.txt管理

# 固定核心依赖版本
torch==2.0.1
numpy==1.24.3
transformers==4.28.1
datasets==2.12.0
tiktoken==0.4.0
wandb==0.15.3
tqdm==4.65.0

5.2 分布式训练协作流程

多节点训练协调步骤:

  1. #training-coordination Slack频道发布训练计划
  2. 使用共享Google Sheet记录GPU资源占用情况
  3. 训练开始前同步最新代码并运行python check_env.py验证环境
  4. 通过torchrun启动时附加--note="Experiment exp42, estimated 4d"
  5. 训练过程中每6小时在频道更新进度

资源分配表格

节点IDGPU数量当前使用者实验ID开始时间预计结束状态
node018xA100aliceexp422023-11-15 09:002023-11-19 09:00运行中
node028xA100bobexp382023-11-14 14:302023-11-18 14:30运行中
node038xA100空闲---可用

六、自动化测试与CI/CD实现

6.1 测试策略设计

针对nanoGPT的特殊性,构建三层测试体系:

mermaid

单元测试示例(测试模型前向传播):

# tests/test_model.py
import torch
from model import GPT, GPTConfig

def test_gpt_forward():
    config = GPTConfig(
        block_size=128,
        vocab_size=50257,
        n_layer=12,
        n_head=12,
        n_embd=768
    )
    model = GPT(config)
    x = torch.randint(0, 50257, (2, 128))
    logits, loss = model(x)
    assert logits.shape == (2, 128, 50257), "输出形状不正确"
    assert loss is None, "无目标时不应计算损失"

6.2 GitHub Actions工作流配置

# .github/workflows/ci.yml
name: nanoGPT CI

on:
  push:
    branches: [ develop, main ]
  pull_request:
    branches: [ develop ]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      - name: Run tests
        run: |
          python -m pytest tests/ -v
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Lint with flake8
        run: |
          pip install flake8
          flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics

【免费下载链接】nanoGPT The simplest, fastest repository for training/finetuning medium-sized GPTs. 【免费下载链接】nanoGPT 项目地址: https://gitcode.com/GitHub_Trending/na/nanoGPT

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值