模型和数据集下载
因为国内的网络环境,造成我有 connection timeout 恐惧症,所以第一件事就是把该下载的下载好,不要在运行中去动态下载。本文用到的模型和数据集地址:
- https://modelscope.cn/models/Qwen/Qwen3-4B-Base
- https://huggingface.co/datasets/unsloth/OpenMathReasoning-mini
- https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed
下载命令:
modelscope download --model Qwen/Qwen3-4B-Base --revision master --local_dir /models/Qwen/Qwen3-4B-Basehuggingface-cli download --resume-download --repo-type dataset unsloth/OpenMathReasoning-mini --local-dir unsloth/OpenMathReasoning-minihuggingface-cli download --resume-download --repo-type dataset open-r1/DAPO-Math-17k-Processed --local-dir open-r1/DAPO-Math-17k-Processed
启动容器
# docker run --name unsloth0517 -itd --gpus '"device=4"' \ -v /data/ai/models:/models \ -v /data/ai/datasets:/datasets \ -v /data/ai/workspace/unsloth:/workspace \ unsloth:20250517_4cd5_cu121 bash
# docker exec -it unsloth0517 bashroot@ 1855d8235e1a:/home/unsloth# cd /workspace/scripts
root@1855d8235e1a:/workspace/scripts# python unsloth-grpo-qwen3.py🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.🦥 Unsloth Zoo will now patch everything to make training faster!Traceback (most recent call last): File "/workspace/scripts/unsloth-grpo-qwen3.py", line 17, in <module> model, tokenizer = FastLanguageModel.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/unsloth/models/loader.py", line 138, in from_pretrained raise ImportError(ImportError: Unsloth: Please install vLLM before enabling `fast_inference`!You can do this in a terminal via `pip install vllm`
因为 unsloth:20250517_4cd5_cu121 这个容器镜像并未包含 vllm,所以会报这个错误。
我们可以基于 unsloth:20250517_4cd5_cu121 镜像再制作一个包含了 vllm 的镜像。为了简单起见,这里直接在容器中安装 vllm:
root@1855d8235e1a:/workspace/scripts# export PIP_INDEX_URL=https://mirrors.aliyun.com/pypi/simple/root@1855d8235e1a:/workspace/scripts# pip install vllm。。。Successfully installed airportsdata-20250224 annotated-types-0.7.0 anyio-4.9.0 astor-0.8.1 blake3-1.0.5 cachetools-5.5.2 cloudpickle-3.1.1 compressed-tensors-0.9.3 cupy-cuda12x-13.4.1 deprecated-1.2.18 depyf-0.18.0 diskcache-5.6.3 einops-0.8.1 email-validator-2.2.0 fastapi-0.115.12 fastapi-cli-0.0.7 fastrlock-0.8.3 gguf-0.16.3 googleapis-common-protos-1.70.0 h11-0.16.0 hf-xet-1.1.2 httpcore-1.0.9 httptools-0.6.4 httpx-0.28.1 importlib_metadata-8.0.0 interegular-0.3.3 jinja2-3.1.6 jiter-0.10.0 lark-1.2.2 llguidance-0.7.22 llvmlite-0.44.0 lm-format-enforcer-0.10.11 mistral_common-1.5.5 msgpack-1.1.0 nest_asyncio-1.6.0 numba-0.61.2 nvidia-cublas-cu12-12.4.5.8 nvidia-cuda-cupti-cu12-12.4.127 nvidia-cuda-nvrtc-cu12-12.4.127 nvidia-cuda-runtime-cu12-12.4.127 nvidia-cufft-cu12-11.2.1.3 nvidia-curand-cu12-10.3.5.147 nvidia-cusolver-cu12-11.6.1.9 nvidia-cusparse-cu12-12.3.1.170 nvidia-cusparselt-cu12-0.6.2 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.4.127 openai-1.81.0 opencv-python-headless-4.11.0.86 opentelemetry-api-1.26.0 opentelemetry-exporter-otlp-1.26.0 opentelemetry-exporter-otlp-proto-common-1.26.0 opentelemetry-exporter-otlp-proto-grpc-1.26.0 opentelemetry-exporter-otlp-proto-http-1.26.0 opentelemetry-proto-1.26.0 opentelemetry-sdk-1.26.0 opentelemetry-semantic-conventions-0.47b0 opentelemetry-semantic-conventions-ai-0.4.9 outlines-0.1.11 outlines_core-0.1.26 partial-json-parser-0.2.1.1.post5 pillow-11.2.1 prometheus-fastapi-instrumentator-7.1.0 prometheus_client-0.22.0 py-cpuinfo-9.0.0 pycountry-24.6.1 pydantic-2.11.4 pydantic-core-2.33.2 python-dotenv-1.1.0 python-json-logger-3.3.0 python-multipart-0.0.20 pyzmq-26.4.0 ray-2.46.0 rich-toolkit-0.14.6 scipy-1.15.3 shellingham-1.5.4 sniffio-1.3.1 starlette-0.46.2 tiktoken-0.9.0 torch-2.6.0 torchaudio-2.6.0 torchvision-0.21.0 triton-3.2.0 typer-0.15.4 typing-inspection-0.4.1 uvicorn-0.34.2 uvloop-0.21.0 vllm-0.8.5.post1 watchfiles-1.0.5 websockets-15.0.1 wrapt-1.17.2 xformers-0.0.29.post2 xgrammar-0.1.18WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
启动训练
进入容器后,确保容器中能看到如下目录:
- 要训练的基础模型目录:/models/Qwen/Qwen3-4B-Base
- 数据集1:/datasets/unsloth/OpenMathReasoning-mini/data/cot-00000-of-00001.parquet
- 数据集2:/datasets/open-r1/DAPO-Math-17k-Processed/en/train-00000-of-00001.parquet
- 工作目录和训练代码文件:/workspace/scripts/unsloth-grpo-qwen3.py
在容器的 /workspace/scripts/ 目录下执行如下代码启动训练:
cat unsloth-grpo-qwen3.py > unsloth-grpo-qwen3.py.log && \ nohup python unsloth-grpo-qwen3.py >> unsloth-grpo-qwen3.py.log 2>&1 &
我们有意将训练代码刷到了训练日志的前面做固定,这样做的好处是,方便代码迭代过程中做问题排查。
其中训练代码 unsloth-grpo-qwen3.py 的最新版本已经针对 24G显存的4090卡做了参数边界优化,并且做了详细注释。代码内容在上一篇文章中:https://mp.weixin.qq.com/s/olblI2gE3HHDSEGnejGBrw 需要的小伙伴可自行取用。本文为了快速跑完测试,对 max_steps 等参数做了限制。下面是训练代码执行中各阶段对应的日志分析:
训练日志
加载模型
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.🦥 Unsloth Zoo will now patch everything to make training faster!INFO 05-25 10:56:06 [importing.py:53] Triton module has been replaced with a placeholder.INFO 05-25 10:56:06 [__init__.py:239] Automatically detected platform cuda.===== step1. 加载模型 =======================================================================((====))== Unsloth 2025.5.6: Fast Qwen3 patching. Transformers: 4.51.3. vLLM: 0.8.5.post1. \\ /| NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.65 GB. Platform: Linux.O^O/ \_/ \ Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0\ / Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False] "-____-" Free license: http://github.com/unslothai/unslothUnsloth: Fast downloading is enabled - ignore downloading bars which are red colored!Unsloth: vLLM loading /models/Qwen/Qwen3-4B-Base with actual GPU utilization = 68.76%Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 23.65 GB.Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 2048. Num Sequences = 224.Unsloth: vLLM's KV Cache can use up to 9.31 GB. Also swap space = 6 GB.INFO 05-25 10:59:17 [config.py:717] This model supports multiple tasks: {'embed', 'generate', 'score', 'reward', 'classify'}. Defaulting to 'generate'.INFO 05-25 10:59:17 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.INFO 05-25 10:59:17 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: 。。。WARNING 05-25 10:59:17 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f19973bbd50>INFO 05-25 10:59:28 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0INFO 05-25 10:59:28 [cuda.py:221] Using Flash Attention backend on V1 engine.WARNING 05-25 10:59:28 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.INFO 05-25 10:59:28 [gpu_model_runner.py:1329] Starting to load model /models/Qwen/Qwen3-4B-Base...Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:02<00:01, 1.20s/it]Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:04<00:00, 1.60s/it]Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:04<00:00, 1.52s/it]
INFO 05-25 10:59:33 [loader.py:458] Loading weights took 4.81 secondsINFO 05-25 10:59:33 [punica_selector.py:18] Using PunicaWrapperGPU.INFO 05-25 10:59:33 [gpu_model_runner.py:1347] Model loading took 7.6334 GiB and 5.084545 secondsINFO 05-25 10:59:48 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/f7b249c75c/rank_0_0 for vLLM's torch.compileINFO 05-25 10:59:48 [backends.py:430] Dynamo bytecode transform time: 14.75 sInductor Compilation: 100%|██████████| 6/6 [00:01<00:00, 5.13it/s, triton_poi_fused_add_mul_sub_5]INFO 05-25 10:59:53 [backends.py:136] Cache the graph of shape None for later use。。。Inductor Compilation: 100%|██████████| 5/5 [00:00<00:00, 28.37it/s, triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_4]INFO 05-25 11:00:39 [backends.py:148] Compiling a graph for general shape takes 49.26 sINFO 05-25 11:03:14 [monitor.py:33] torch.compile takes 64.01 s in totalINFO 05-25 11:03:18 [kv_cache_utils.py:634] GPU KV cache size: 49,856 tokensINFO 05-25 11:03:18 [kv_cache_utils.py:637] Maximum concurrency for 2,048 tokens per request: 24.34xINFO 05-25 11:04:19 [gpu_model_runner.py:1686] Graph capturing finished in 61 secs, took 3.94 GiBINFO 05-25 11:04:19 [core.py:159] init engine (profile, create kv cache, warmup model) took 286.49 secondsUnsloth 2025.5.6 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.
模型结构
对加载模型,插入 lora 后的结构:
/models/Qwen/Qwen3-4B-Base does not have a padding token! Will use pad_token = <|vision_pad|>.model:PeftModelForCausalLM( (base_model): LoraModel( (model): Qwen3ForCausalLM( (model): Qwen3Model( (embed_tokens): Embedding(151936, 2560, padding_idx=151654) (layers): ModuleList( (0-35): 36 x Qwen3DecoderLayer( (self_attn): Qwen3Attention( (q_proj): lora.Linear( (base_layer): Linear(in_features=2560, out_features=4096, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=4096, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (k_proj): lora.Linear( (base_layer): Linear(in_features=2560, out_features=1024, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=1024, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (v_proj): lora.Linear( (base_layer): Linear(in_features=2560, out_features=1024, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=1024, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (o_proj): lora.Linear( (base_layer): Linear(in_features=4096, out_features=2560, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=2560, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (q_norm): Qwen3RMSNorm((128,), eps=1e-06) (k_norm): Qwen3RMSNorm((128,), eps=1e-06) (rotary_emb): LlamaRotaryEmbedding() ) (mlp): Qwen3MLP( (gate_proj): lora.Linear( (base_layer): Linear(in_features=2560, out_features=9728, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=9728, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (up_proj): lora.Linear( (base_layer): Linear(in_features=2560, out_features=9728, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=2560, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=9728, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (down_proj): lora.Linear( (base_layer): Linear(in_features=9728, out_features=2560, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=9728, out_features=32, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=32, out_features=2560, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (act_fn): SiLU() ) (input_layernorm): Qwen3RMSNorm((2560,), eps=1e-06) (post_attention_layernorm): Qwen3RMSNorm((2560,), eps=1e-06) ) ) (norm): Qwen3RMSNorm((2560,), eps=1e-06) (rotary_emb): LlamaRotaryEmbedding() ) (lm_head): Linear(in_features=2560, out_features=151936, bias=False) ) ))
GRPO对话模版
===== step2. 准备 GRPO 对话模版 ================================================================ 对话模板输出样例:You are given a problem.Think about the problem and provide your working out.Place it between <start_working_out> and <end_working_out>.Then, provide your solution between <SOLUTION> and </SOLUTION><|endoftext|>What is 1+1?<start_working_out>I think it's 2.<end_working_out><SOLUTION>2</SOLUTION><|endoftext|>What is 2+2?<start_working_out>
格式遵循预微调
在 GRPO 训练之前,先用带推理过程的对话数据,对原始模型做一个简单的 SFT 训练,以让模型具备按我们的推理格式进行输出的能力。
先要对原始数据集进行清理,筛选出适合做格式遵循微调的数据来:
===== step3. 格式遵循预微调 ===============================================================----- 清洗后的数据集: expected_answer ... generated_solution0 14 ... <think>\nOkay, let's see. I need to solve the ...6 -2 ... <think>\nOkay, so I need to find the value of ...9 18 ... <think>\nOkay, so I need to solve the equation...13 2 ... <think>\nOkay, so I need to evaluate the infin...17 30 ... <think>\nAlright, so I need to find the larges...... ... ... ...19243 244 ... <think>\nOkay, so I need to find the value of ...19245 1 ... <think>\nOkay, so I have this problem where a ...19247 4 ... <think>\nOkay, let's tackle this problem step ...19248 18 ... <think>\nOkay, let's see. I need to find the n...19250 0.8960 ... <think>\nOkay, so I need to find the probabili...
[7507 rows x 3 columns]
----- 第1条 OpenMathReasoning 数据格式化后应用对话模板的输出:You are given a problem.Think about the problem and provide your working out.Place it between <start_working_out> and <end_working_out>.Then, provide your solution between <SOLUTION> and </SOLUTION><|endoftext|>Given $\sqrt{x^2+165}-\sqrt{x^2-52}=7$ and $x$ is positive, find all possible values of $x$.<start_working_out>Okay, let's see. I need to solve the equation √(x² + 165) - √(x² - 52) = 7, and find all positive values of x. Hmm, radicals can be tricky, but maybe if I can eliminate the square roots by squaring both sides. Let me try that.
First, let me write down the equation again to make sure I have it right:
√(x² + 165) - √(x² - 52) = 7.
Okay, so the idea is to isolate one of the radicals and then square both sides. Let me try moving the second radical to the other side:
√(x² + 165) = 7 + √(x² - 52).
Now, if I square both sides, maybe I can get rid of the square roots. Let's do that:
(√(x² + 165))² = (7 + √(x² - 52))².
Simplifying the left side:
x² + 165 = 49 + 14√(x² - 52) + (√(x² - 52))².
The right side is expanded using the formula (a + b)² = a² + 2ab + b². So the right side becomes 7² + 2*7*√(x² - 52) + (√(x² - 52))², which is 49 + 14√(x² - 52) + (x² - 52).
So putting it all together:
x² + 165 = 49 + 14√(x² - 52) + x² - 52.
Hmm, let's simplify the right side. The x² terms will cancel out, right? Let's subtract x² from both sides:
165 = 49 + 14√(x² - 52) - 52.
Simplify the constants on the right:
49 - 52 is -3, so:
165 = -3 + 14√(x² - 52).
Now, add 3 to both sides to isolate the radical term:
165 + 3 = 14√(x² - 52).
So 168 = 14√(x² - 52).
Divide both sides by 14:
168 / 14 = √(x² - 52).
12 = √(x² - 52).
Now, square both sides again to eliminate the square root:
12² = x² - 52.
144 = x² - 52.
Add 52 to both sides:
144 + 52 = x².
196 = x².
So x = √196 = 14.
But wait, since the problem states that x is positive, we only take the positive root. So x = 14.
But hold on, when dealing with squaring equations, sometimes extraneous solutions can come up. I should check if this solution actually satisfies the original equation.
Let's plug x = 14 back into the original equation:
√(14² + 165) - √(14² - 52) = ?
Calculate each term:
14² is 196.
So first radical: √(196 + 165) = √361 = 19.
Second radical: √(196 - 52) = √144 = 12.
So 19 - 12 = 7, which is exactly the right-hand side. So yes, it checks out.
Therefore, the only solution is x = 14. Since the problem says x is positive, we don't have to consider negative roots. So I think that's the answer.To solve the equation \(\sqrt{x^2 + 165} - \sqrt{x^2 - 52} = 7\) for positive \(x\), we proceed as follows:
1. Start with the given equation: \[ \sqrt{x^2 + 165} - \sqrt{x^2 - 52} = 7 \]
2. Isolate one of the square roots by moving \(\sqrt{x^2 - 52}\) to the right side: \[ \sqrt{x^2 + 165} = 7 + \sqrt{x^2 - 52} \]
3. Square both sides to eliminate the square root on the left: \[ (\sqrt{x^2 + 165})^2 = (7 + \sqrt{x^2 - 52})^2 \] Simplifying both sides, we get: \[ x^2 + 165 = 49 + 14\sqrt{x^2 - 52} + (x^2 - 52) \]
4. Combine like terms on the right side: \[ x^2 + 165 = x^2 - 52 + 49 + 14\sqrt{x^2 - 52} \] Simplifying further: \[ x^2 + 165 = x^2 - 3 + 14\sqrt{x^2 - 52} \]
5. Subtract \(x^2\) from both sides: \[ 165 = -3 + 14\sqrt{x^2 - 52} \]
6. Add 3 to both sides to isolate the term with the square root: \[ 168 = 14\sqrt{x^2 - 52} \]
7. Divide both sides by 14: \[ 12 = \sqrt{x^2 - 52} \]
8. Square both sides again to eliminate the square root: \[ 12^2 = x^2 - 52 \] Simplifying: \[ 144 = x^2 - 52 \]
9. Add 52 to both sides to solve for \(x^2\): \[ 196 = x^2 \]
10. Take the positive square root (since \(x\) is positive): \[ x = \sqrt{196} = 14 \]
11. Verify the solution by substituting \(x = 14\) back into the original equation: \[ \sqrt{14^2 + 165} - \sqrt{14^2 - 52} = \sqrt{196 + 165} - \sqrt{196 - 52} = \sqrt{361} - \sqrt{144} = 19 - 12 = 7 \] The solution checks out.
Thus, the only positive solution is:\[\boxed{14}\]<end_working_out><SOLUTION>14</SOLUTION><|endoftext|>num_proc must be <= 58. Reducing num_proc to 58 for dataset of size 58.[2025-05-25 11:05:41] WARNING arrow_dataset.py:3010: num_proc must be <= 58. Reducing num_proc to 58 for dataset of size 58.
dataset.shape:(58, 5)
----- 处理好的预微调数据集:Dataset({ features: ['expected_answer', 'problem', 'generated_solution', 'Messages', 'N', 'text', '__index_level_0__'], num_rows: 58})Unsloth: Tokenizing ["text"] (num_proc=58): 100%|██████████| 58/58 [00:07<00:00, 7.99 examples/s]
然后开始预微调训练:
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \\ /| Num examples = 58 | Num Epochs = 2 | Total steps = 116O^O/ \_/ \ Batch size per device = 1 | Gradient accumulation steps = 1\ / Data Parallel GPUs = 1 | Total batch size (1 x 1 x 1) = 1 "-____-" Trainable parameters = 66,060,288/4,088,528,384 (1.62% trained)100%|██████████| 116/116 [00:47<00:00, 2.46it/s]Unsloth: Will smartly offload gradients to save VRAM!{'loss': 0.7447, 'grad_norm': 0.6478227376937866, 'learning_rate': 0.00016, 'epoch': 0.09}{'loss': 0.6066, 'grad_norm': 0.640754759311676, 'learning_rate': 0.00019279279279279282, 'epoch': 0.17}{'loss': 0.4543, 'grad_norm': 0.6311891674995422, 'learning_rate': 0.0001837837837837838, 'epoch': 0.26}{'loss': 0.4684, 'grad_norm': 0.5015860199928284, 'learning_rate': 0.00017477477477477476, 'epoch': 0.34}{'loss': 0.4063, 'grad_norm': 0.5008582472801208, 'learning_rate': 0.00016576576576576578, 'epoch': 0.43}{'loss': 0.3979, 'grad_norm': 0.5995965600013733, 'learning_rate': 0.00015675675675675676, 'epoch': 0.52}{'loss': 0.4248, 'grad_norm': 0.4734836518764496, 'learning_rate': 0.00014774774774774775, 'epoch': 0.6}{'loss': 0.4197, 'grad_norm': 0.5012277960777283, 'learning_rate': 0.00013873873873873876, 'epoch': 0.69}{'loss': 0.4511, 'grad_norm': 0.548245906829834, 'learning_rate': 0.00012972972972972974, 'epoch': 0.78}{'loss': 0.3974, 'grad_norm': 0.42141056060791016, 'learning_rate': 0.00012072072072072073, 'epoch': 0.86}{'loss': 0.3317, 'grad_norm': 0.4644368886947632, 'learning_rate': 0.0001117117117117117, 'epoch': 0.95}{'loss': 0.3846, 'grad_norm': 0.3927017152309418, 'learning_rate': 0.0001027027027027027, 'epoch': 1.03}{'loss': 0.2501, 'grad_norm': 0.5447007417678833, 'learning_rate': 9.36936936936937e-05, 'epoch': 1.12}{'loss': 0.278, 'grad_norm': 0.4823240339756012, 'learning_rate': 8.468468468468469e-05, 'epoch': 1.21}{'loss': 0.2645, 'grad_norm': 0.5164972543716431, 'learning_rate': 7.567567567567568e-05, 'epoch': 1.29}{'loss': 0.2584, 'grad_norm': 0.5759400725364685, 'learning_rate': 6.666666666666667e-05, 'epoch': 1.38}{'loss': 0.2121, 'grad_norm': 0.5618821978569031, 'learning_rate': 5.765765765765766e-05, 'epoch': 1.47}{'loss': 0.2322, 'grad_norm': 0.5534489154815674, 'learning_rate': 4.8648648648648654e-05, 'epoch': 1.55}{'loss': 0.2256, 'grad_norm': 0.6181885600090027, 'learning_rate': 3.963963963963964e-05, 'epoch': 1.64}{'loss': 0.1841, 'grad_norm': 0.48197486996650696, 'learning_rate': 3.063063063063063e-05, 'epoch': 1.72}{'loss': 0.2789, 'grad_norm': 0.6069267988204956, 'learning_rate': 2.1621621621621624e-05, 'epoch': 1.81}{'loss': 0.2148, 'grad_norm': 0.5475031137466431, 'learning_rate': 1.2612612612612611e-05, 'epoch': 1.9}{'loss': 0.2263, 'grad_norm': 0.6717495918273926, 'learning_rate': 3.603603603603604e-06, 'epoch': 1.98}{'train_runtime': 47.1361, 'train_samples_per_second': 2.461, 'train_steps_per_second': 2.461, 'train_loss': 0.35218508824192246, 'epoch': 2.0}----- 训练后模型的 model.dtype: torch.bfloat16
----- 格式遵循预微调训练完成,检测下模型是否学会了遵循我们自定义的格式 -----You are given a problem.Think about the problem and provide your working out.Place it between <start_working_out> and <end_working_out>.Then, provide your solution between <SOLUTION> and </SOLUTION><|endoftext|>Jenifer has 82 cents in pennies and nickels. Her younger brother mistook all her nickels for dimes and counted the total as $1.47. How many pennies does Jenifer have?<start_working_out><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|>ol let's start by setting up the problem. Let's denote the number of pennies as \( p \) and the number of nickels as \( n \). We know that the total value of the coins is 82 cents, so we can write the equation:\[ p + 5n = 82 \]
Next, we need to consider the mistake made by Jenifer's younger brother. He mistook all the nickels for dimes, so he counted the total as $1.47, which is 147 cents. This gives us another equation:\[ p + 10n = 147 \]
Now, we have a system of two equations:1. \( p + 5n = 82 \)2. \( p + 10n = 147 \)
To solve this system, we can subtract the first equation from the second to eliminate \( p \):\[ (p + 10n) - (p + 5n) = 147 - 82 \]\[ 5n = 65 \]\[ n = 13 \]
Now that we have \( n = 13 \), we can substitute this value back into the first equation to find \( p \):\[ p + 5(13) = 82 \]\[ p + 65 = 82 \]\[ p = 17 \]
So, Jenifer has 17 pennies. Let's verify the solution:- The value of 17 pennies is \( 17 \times 1 = 17 \) cents.- The value of 13 nickels is \( 13 \times 5 = 65 \) cents.- The total value is \( 17 + 65 = 82 \) cents, which matches the given total.
Thus, the number of pennies Jenifer has is \(\boxed{17}\).<end_working_out><SOLUTION>17</SOLUTION><|endoftext|>
可以看到,预微调之后的模型输出,满足期望。推理过程放在了我们指定的标签<start_working_out> 和 <end_working_out> 之间,最终答案放在了指定的标签和之间。( 其中 <start_working_out> 会添加在对话问题中,引导模型输出推理。所以模型没有输出这个开始标签,而是直接输出思考内容,然后以<end_working_out>结束思考)
处理GRPO数据集
===== step4. 加载并处理数据集 ==============================================================----- 数据集 DAPO-Math-17k-Processed:Dataset({ features: ['prompt', 'solution', 'data_source', 'source_prompt', 'ability', 'reward_model', 'extra_info'], num_rows: 14116})----- 第一条的 prompt: In triangle $ABC$, $\sin \angle A = \frac{4}{5}$ and $\angle A < 90^\circ$. Let $D$ be a point outside triangle $ABC$ such that $\angle BAD = \angle DAC$ and $\angle BDC = 90^\circ$. Suppose that $AD = 1$ and that $\frac{BD}{CD} = \frac{3}{2}$. If $AB + AC$ can be expressed in the form $\frac{a\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.----- 第一条的 solution: 34Map: 100%|██████████| 14116/14116 [00:01<00:00, 11664.38 examples/s]----- 第1条对话格式的内容:{'prompt': [{'content': 'You are given a problem.\nThink about the problem and provide your working out.\nPlace it between <start_working_out> and <end_working_out>.\nThen, provide your solution between <SOLUTION> and </SOLUTION>', 'role': 'system'}, {'content': 'In triangle $ABC$, $\\sin \\angle A = \\frac{4}{5}$ and $\\angle A < 90^\\circ$. Let $D$ be a point outside triangle $ABC$ such that $\\angle BAD = \\angle DAC$ and $\\angle BDC = 90^\\circ$. Suppose that $AD = 1$ and that $\\frac{BD}{CD} = \\frac{3}{2}$. If $AB + AC$ can be expressed in the form $\\frac{a\\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.', 'role': 'user'}], 'solution': '34', 'data_source': 'math_dapo', 'source_prompt': [{'content': 'Solve the following math problem step by step. The last line of your response should be of the form Answer: $Answer (without quotes) where $Answer is the answer to the problem.\n\nIn triangle $ABC$, $\\sin \\angle A = \\frac{4}{5}$ and $\\angle A < 90^\\circ$. Let $D$ be a point outside triangle $ABC$ such that $\\angle BAD = \\angle DAC$ and $\\angle BDC = 90^\\circ$. Suppose that $AD = 1$ and that $\\frac{BD}{CD} = \\frac{3}{2}$. If $AB + AC$ can be expressed in the form $\\frac{a\\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.\n\nRemember to put your answer on its own line after "Answer:".', 'role': 'user'}], 'ability': 'MATH', 'reward_model': {'ground_truth': '34', 'style': 'rule-lighteval/MATH_v2'}, 'extra_info': {'index': '9a9b6eb4-a1cb-49d1-8c1e-62eaf2f74079'}, 'answer': '34'}Map: 100%|██████████| 14116/14116 [00:04<00:00, 3005.13 examples/s]You are given a problem.Think about the problem and provide your working out.Place it between <start_working_out> and <end_working_out>.Then, provide your solution between <SOLUTION> and </SOLUTION><|endoftext|>In triangle $ABC$, $\sin \angle A = \frac{4}{5}$ and $\angle A < 90^\circ$. Let $D$ be a point outside triangle $ABC$ such that $\angle BAD = \angle DAC$ and $\angle BDC = 90^\circ$. Suppose that $AD = 1$ and that $\frac{BD}{CD} = \frac{3}{2}$. If $AB + AC$ can be expressed in the form $\frac{a\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.<start_working_out>Map: 100%|██████████| 14116/14116 [00:02<00:00, 5114.02 examples/s]You are given a problem.Think about the problem and provide your working out.Place it between <start_working_out> and <end_working_out>.Then, provide your solution between <SOLUTION> and </SOLUTION><|endoftext|>In triangle $ABC$, $\sin \angle A = \frac{4}{5}$ and $\angle A < 90^\circ$. Let $D$ be a point outside triangle $ABC$ such that $\angle BAD = \angle DAC$ and $\angle BDC = 90^\circ$. Suppose that $AD = 1$ and that $\frac{BD}{CD} = \frac{3}{2}$. If $AB + AC$ can be expressed in the form $\frac{a\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.<start_working_out>Map: 100%|██████████| 14116/14116 [00:02<00:00, 5114.02 examples/s]==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \\ /| Num examples = 12,709 | Num Epochs = 1 | Total steps = 100O^O/ \_/ \ Batch size per device = 4 | Gradient accumulation steps = 1\ / Data Parallel GPUs = 1 | Total batch size (4 x 1 x 1) = 4 "-____-" Trainable parameters = 66,060,288/4,088,528,384 (1.62% trained)Max Length = 203
定义并测试奖励函数
训练脚本中定义了4个奖励函数,如下是对其中2个的打分样例:
===== step5. 定义并测试奖励函数 ============================================================match_format:re.compile('<end_working_out>.*?<SOLUTION>(.+?)</SOLUTION>[\\s]{0,}(?:<\\|endoftext\\|>)?[\\s]{0,}$', re.MULTILINE|re.DOTALL)
----- 奖励函数 check_answer 评分样例:
Case | Question | Response | Answer | Extracted | Score------------------------------------------------------------------------------------------1 | Q: 2+2 = ? | Let me think!<end_working_out><SOLUTION>4</SOLUTION> | 4 | 4 | 5.02 | Q: Hello? | <start_working_out>think..<end_working_out><SOLUTION> yes </SOLUTION> | yes | yes | 3.53 | Q: Value? | !<end_working_out><SOLUTION>9.5</SOLUTION> | 10 | 9.5 | 2.04 | Q: Value? | !<end_working_out><SOLUTION>8.3</SOLUTION> | 10 | 8.3 | 1.55 | Q: Value? | i!<end_working_out><SOLUTION>5</SOLUTION> | 10 | 5 | -2.56 | Q: Answer? | i!<end_working_out><SOLUTION>no digit</SOLUTION> | 42 | no digit | -4.57 | Q: String? | f!<end_working_out><SOLUTION>oobar</SOLUTION> | baz | oobar | -4.5
----- 奖励函数 check_numbers 评分样例:
Case | Question | Response(s) | Answer(s) | Score(s)--------------------------------------------------------------------------------------1 | Q: 2+2=? | <SOLUTION>4 | 4 | 3.52 | 问:总量? | <SOLUTION> 1,234.00 | 1234.0 | 3.53 | Q: 输出吧 | <SOLUTION>没有数字 | 0 | -2.54 | Q: 10-3=? | <SOLUTION>5 | 7 | -1.55 | Q: 1+1=?,2+2=? | <SOLUTION>2,4 | 2,4 | 3.5,-2.5
进行GRPO训练
===== step6. 训练模型 =====================================================================。。。 5%|▌ | 5/100 [02:19<40:11, 25.38s/it]********************Question:。。。100%|██████████| 100/100 [46:13<00:00, 27.74s/it]can fire at a single
Extracted:None{'loss': 0.0067, 'grad_norm': 0.2612026631832123, 'learning_rate': 2.7777777777777776e-07, 'rewards/match_format_exactly': 2.25, 'rewards/match_format_approximately': 0.375, 'rewards/check_answer': 3.25, 'rewards/check_numbers': 2.0, 'reward': 7.875, 'reward_std': 10.25, 'completion_length': 1379.25, 'kl': 0.1681276112794876, 'epoch': 0.01}{'loss': 0.0052, 'grad_norm': 0.22037602961063385, 'learning_rate': 2.2222222222222224e-07, 'rewards/match_format_exactly': 1.5, 'rewards/match_format_approximately': -0.75, 'rewards/check_answer': 1.5, 'rewards/check_numbers': 0.5, 'reward': 2.75, 'reward_std': 11.83568000793457, 'completion_length': 1665.75, 'kl': 0.1294582337141037, 'epoch': 0.01}{'loss': 0.0043, 'grad_norm': 0.18991202116012573, 'learning_rate': 1.6666666666666668e-07, 'rewards/match_format_exactly': 0.75, 'rewards/match_format_approximately': -1.875, 'rewards/check_answer': -0.25, 'rewards/check_numbers': -1.0, 'reward': -2.375, 'reward_std': 10.25, 'completion_length': 1775.5, 'kl': 0.1086689755320549, 'epoch': 0.01}{'loss': 0.0046, 'grad_norm': 0.025854697450995445, 'learning_rate': 1.1111111111111112e-07, 'rewards/match_format_exactly': 0.0, 'rewards/match_format_approximately': -3.0, 'rewards/check_answer': -2.0, 'rewards/check_numbers': -2.5, 'reward': -7.5, 'reward_std': 0.0, 'completion_length': 1844.0, 'kl': 0.11534835398197174, 'epoch': 0.01}{'loss': 0.0052, 'grad_norm': 0.05398529767990112, 'learning_rate': 5.555555555555556e-08, 'rewards/match_format_exactly': 0.0, 'rewards/match_format_approximately': -3.0, 'rewards/check_answer': -2.0, 'rewards/check_numbers': -2.5, 'reward': -7.5, 'reward_std': 0.0, 'completion_length': 1844.0, 'kl': 0.1298024207353592, 'epoch': 0.01}{'train_runtime': 2773.8914, 'train_samples_per_second': 0.144, 'train_steps_per_second': 0.036, 'train_loss': 0.005772246685810387, 'epoch': 0.01}
这里我们设置了 max_steps = 100 以便尽快完成测试。正式训练时,可以通过设置 epochs 来控制训练轮数,并根据 loss 收敛情况等提前结束训练。
测试训练好的模型
Processed prompts: 100%|██████████| 1/1 [00:11<00:00, 11.75s/it, est. speed input: 0.85 toks/s, output: 87.15 toks/s]----- 基础模型的回答: - AnswersMath and ArithmeticWhat is the sqrt of 101?Wiki User∙ 2010-05-29 22:38:13Best AnswerCopyThe square root of 101 is 10.0498756 approximatelyWiki User∙ 2010-05-29 22:38:13This answer is:🙏0🤨0😮0What is the square root of 101?It is approx. 10.0498756211.What is the square root of -101?The square root of -101 can be written as the product of the positive square root of 101 and i (where i is an imaginary number). The square root of 101 is approximately 10.04987751.What is the square root of 101 simplified?sqrt(101) is already simplified since 101 is not a perfect square. Also, we cannot simplify it since 101 is a prime number. (In other words, 101 = 1 x 101, so its only factorization is 1 and 101) In decimal form it is: 10.049875621120891586572919348985505109596599416484785647300807046What is the square root of 20200?sqrt (20200) = sqrt (4 x 100 x 505) = sqrt (4) x sqrt (100) x sqrt (505) = 2 x 10 x sqrt (505) = 20 x sqrt (505) = 10 x sqrt (4) x sqrt (505) = 50 sqrt (149)Is 17 a square root?17 is a square root.What is the square root of 3 in 101?sqrt(3 in 101) = sqrt(101) x sqrt(3) = sqrt(101 x 3) = sqrt(303) = 17.4069...approx.What is an irrational number?-101 as a number.Why is one sixth of one third the same as one square root of one hundred sixty nine?sqrt(169) = 13 1/6 of 1/3 = (1/6)(1/3) = 1/(6x3) = 1/18 = (1/13)(1/13) = 1/sqrt(169)What number when squared equals six?If you meant sqrt(6)2 then this = 6 and sqrt(6) = 2.4494... For the number to be a square root you need the 6 to be in the denominator or the square root of 6.What is -20 sqrt of 101?-20 square root of (101) - 20 * sqrt(101) - 20 * sqrt(101) is a real number and cannot be simplified any further.What is the square root of 101.8?9.046921979920897...What is the square root of 7561?As sqrt(7561) = sqrt(169)*sqrt(41) = 13*sqrt(41) ~= 274.84779...What is the sqrt of 0.25?The sqrt of 0.25 is 0.5What is the square root of 169 over 10?It is 1.3How do you find the square root of 51?The square root of 51 is approx. 7.1414 The easiest way to do that is to use a calculator. If you do not have a calculator, I strongly suggest using one, since the sqrt of 51 is an irrational number with an infinite amount of decimal places.Square and cube roots can be calculated the old-fashioned manner by using trial and error. 72 = 49 which is too small; 82 = 64 which is too big, etc. If you need to go that route, you need to know your basicProcessed prompts: 100%|██████████| 1/1 [00:26<00:00, 26.65s/it, est. speed input: 2.25 toks/s, output: 74.61 toks/s]
----- GRPO-LoRA模型的回答:Okay, so I need to find the square root of 101. Hmm, let me think. The square root of a number is the value that, when multiplied by itself, gives the original number. But 101 seems like it's not a perfect square, right? I remember that perfect squares like 100, 121, 144, etc., are numbers that have exact square roots since they're squares of integers. 101 is just 3 less than 100, which is 10 squared. So maybe √101 is close to 10 but not exactly 10. Let me calculate 10 squared first. 10 × 10 is 100. So √101 is a little more than 10. How much more? Well, 10.5 squared is 110.25, which is higher than 101. So it has to be between 10 and 10.5. Maybe 10.05? Let me try that. 10.05 squared is 10.05 × 10.05. Let me compute that. 10 × 10 is 100, 10×0.05 is 0.5, 0.05×10 is 0.5, and 0.05×0.05 is 0.0025. Adding them up: 100 + 0.5 + 0.5 + 0.0025 = 101.0025. That's very close to 101, so √101 is approximately 10.05. But wait, 10.04 squared might be slightly less. Let me check 10.04: 10.04 × 10.04. This is 10.04 × 10 = 100.4, then 0.04 × 10 = 0.4, and 0.04×0.04 = 0.0016. So total: 100.4 + 0.4 = 100.8, plus 0.0016, which is 100.8016. That's less than 101. So √101 is between 10.04 and 10.05. Maybe 10.045? Let me try 10.045: 10.045 squared. Hmm, 10×10=100, 10×0.045=0.45, 0.045×10=0.45, 0.045×0.045=0.002025. Adding: 100 + 0.45 + 0.45 = 100.9, plus 0.002025 is 100.902025, which is still less than 101. So it's closer to 10.04. Maybe 10.042? Let me try 10.042: 10.042 squared. 10×10=100, 10×0.042=0.42, 0.042×10=0.42, 0.042×0.042=0.001764. Adding: 100 + 0.42 + 0.42 = 100.84, plus 0.001764 is 100.841764, which is still below 101. So it's closer to 10.043. Let me try 10.043: 10.043 squared. 10×10=100, 10×0.043=0.43, 0.043×10=0.43, 0.043×0.043=0.001849. Adding: 100 + 0.43 + 0.43 = 100.86, plus 0.001849 is 100.861849, still less than 101. So √101 is approximately 10.043. But I wonder if there's a better way. Maybe using the Newton-Raphson method for square roots? Let me recall that method. Let x be the number we want to find the square root of. Start with an initial guess, say 10. Then the method is x_(new) = (x + N/x)/2, where N is the number, which is 101. So first guess is 10. Then, (10 + 101/10)/2 = (10 + 10.1)/2 = 20.1/2 = 10.05. Next iteration: (10.05 + 101/10.05)/2. 101 divided by 10.05 is approximately 10.049751. So 10.05 + 10.049751 = 20.099751, divided by 2 is 10.049875. So after two iterations, the approximation is approximately 10.049875. That's more accurate than my previous guesses. So the square root of 101 is approximately 10.049875. But the problem didn't specify how precise the answer needs to be, so maybe just the decimal approximation is acceptable. So let me express that in decimal form. 10.049875... Hmm, four decimal places would be 10.0499. But to be precise, I should keep more. So maybe 10.049875? But that's a bit too far. Let me verify with the Newton-Raphson method again. Start with 10. (10 + 101/10)/2 = 10.05. Then (10.05 + 101/10.05)/2. 101/10.05 is 10.04975. So 10.05 + 10.04975 = 20.09975. Divided by 2 is 10.049875. Then next iteration: (10.049875 + 101/10.049875)/2. 101 divided by 10.049875 is approximately 10.049874. So 10.049875 + 10.049874 = 20.099749, divided by 2 is 10.0498745. So after three iterations, it's approximately 10.0498745. So the square root of 101 is approximately 10.049875. Therefore, I think the answer is around 10.05. But for a better approximation, maybe using more iterations or a calculator. But since the problem doesn't specify, I'll go with 10.05. Let me check if 10.05 squared is 101. 10.05 × 10.05. Let me multiply that out. 10×10=100, 10×0.05=0.5, 0.05×10=0.5, 0.05×0.05=0.0025. Adding: 100 + 0.5 + 0.5 = 101, plus 0.0025 is 101.0025. So √101 is slightly less than 10.05. Therefore, the square root is approximately 10.049875. So in decimal form, that's about 10.050. Therefore, the square root of 101 is approximately 10.05.To find the square root of 101, we can use the Newton-Raphson method for approximation. The method starts with an initial guess and iteratively refines it.
1. **Initial Guess**: Start with \( x_0
这里看到回答因为长度问题被截断了。因为本文测试代码在推理时设置了最大生成token数 max_tokens = 1024; 这些问题在最新的训练代码中已经解决了。
合并及保存模型
Lora训练完后,不会更改原来的模型,也不会生成完整的新模型。而是生成一个额外的较小的 Lora 权重,里面就是训练好的内容。在如上测试中,是把这个 Lora 权重作为外挂,和原始权重一起加载做的推理测试。测试完成后,我们需要把外挂lora权重,和原始模型做合并,生成一个新的完整的模型。
===== step8. 合并及保存模型 ===============================================================Unsloth: Merging 4bit and LoRA weights to 16bit...Unsloth: Will use up to 336.32 out of 503.72 RAM for saving.Unsloth: Saving model... This might take 5 minutes ... 17%|█▋ | 6/36 [00:00<00:00, 57.02it/s]We will save to Disk and not RAM now.100%|██████████| 36/36 [00:05<00:00, 6.11it/s]Unsloth: Saving tokenizer... Done.Done.
得到的模型还可以进一步做需要的量化处理。
如何学习大模型 AI ?
由于新岗位的生产效率,要优于被取代岗位的生产效率,所以实际上整个社会的生产效率是提升的。
但是具体到个人,只能说是:
“最先掌握AI的人,将会比较晚掌握AI的人有竞争优势”。
这句话,放在计算机、互联网、移动互联网的开局时期,都是一样的道理。
我在一线互联网企业工作十余年里,指导过不少同行后辈。帮助很多人得到了学习和成长。
我意识到有很多经验和知识值得分享给大家,也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑,所以在工作繁忙的情况下还是坚持各种整理和分享。但苦于知识传播途径有限,很多互联网行业朋友无法获得正确的资料得到学习提升,故此将并将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。
第一阶段(10天):初阶应用
该阶段让大家对大模型 AI有一个最前沿的认识,对大模型 AI 的理解超过 95% 的人,可以在相关讨论时发表高级、不跟风、又接地气的见解,别人只会和 AI 聊天,而你能调教 AI,并能用代码将大模型和业务衔接。
- 大模型 AI 能干什么?
- 大模型是怎样获得「智能」的?
- 用好 AI 的核心心法
- 大模型应用业务架构
- 大模型应用技术架构
- 代码示例:向 GPT-3.5 灌入新知识
- 提示工程的意义和核心思想
- Prompt 典型构成
- 指令调优方法论
- 思维链和思维树
- Prompt 攻击和防范
- …
第二阶段(30天):高阶应用
该阶段我们正式进入大模型 AI 进阶实战学习,学会构造私有知识库,扩展 AI 的能力。快速开发一个完整的基于 agent 对话机器人。掌握功能最强的大模型开发框架,抓住最新的技术进展,适合 Python 和 JavaScript 程序员。
- 为什么要做 RAG
- 搭建一个简单的 ChatPDF
- 检索的基础概念
- 什么是向量表示(Embeddings)
- 向量数据库与向量检索
- 基于向量检索的 RAG
- 搭建 RAG 系统的扩展知识
- 混合检索与 RAG-Fusion 简介
- 向量模型本地部署
- …
第三阶段(30天):模型训练
恭喜你,如果学到这里,你基本可以找到一份大模型 AI相关的工作,自己也能训练 GPT 了!通过微调,训练自己的垂直大模型,能独立训练开源多模态大模型,掌握更多技术方案。
到此为止,大概2个月的时间。你已经成为了一名“AI小子”。那么你还想往下探索吗?
- 为什么要做 RAG
- 什么是模型
- 什么是模型训练
- 求解器 & 损失函数简介
- 小实验2:手写一个简单的神经网络并训练它
- 什么是训练/预训练/微调/轻量化微调
- Transformer结构简介
- 轻量化微调
- 实验数据集的构建
- …
第四阶段(20天):商业闭环
对全球大模型从性能、吞吐量、成本等方面有一定的认知,可以在云端和本地等多种环境下部署大模型,找到适合自己的项目/创业方向,做一名被 AI 武装的产品经理。
- 硬件选型
- 带你了解全球大模型
- 使用国产大模型服务
- 搭建 OpenAI 代理
- 热身:基于阿里云 PAI 部署 Stable Diffusion
- 在本地计算机运行大模型
- 大模型的私有化部署
- 基于 vLLM 部署大模型
- 案例:如何优雅地在阿里云私有部署开源大模型
- 部署一套开源 LLM 项目
- 内容安全
- 互联网信息服务算法备案
- …
学习是一个过程,只要学习就会有挑战。天道酬勤,你越努力,就会成为越优秀的自己。
如果你能在15天内完成所有的任务,那你堪称天才。然而,如果你能完成 60-70% 的内容,你就已经开始具备成为一名大模型 AI 的正确特征了。