LLM inference
文章平均质量分 84
张博208
知识搬运工
展开
-
Understanding Low-Rank Adaptation (LoRA) for Efficient Fine-Tuning of Large Language Models
This blog post will go into detail about how LoRA works to fine-tune LLMs, following the methodology set out in the “LoRA: Low-Rank Adaptation of Large Language Models” paper原创 2024-07-26 15:12:17 · 740 阅读 · 0 评论 -
vLLM 系列
架构概览转载 2024-07-16 16:20:50 · 25 阅读 · 0 评论 -
Dissecting model performance
【代码】Dissecting model performance。原创 2024-07-16 16:15:11 · 377 阅读 · 0 评论 -
KV caching, a deeper look
In the previous post, we introduced KV caching, a common optimization of the inference process of LLMs that make compute requirements of the (self-)attention mechanism to scale linearly rather than quadratically in the total sequence length (prompt + gener原创 2024-07-16 16:13:34 · 852 阅读 · 0 评论 -
KV caching explained
【代码】KV caching explained。原创 2024-07-16 16:11:11 · 881 阅读 · 0 评论 -
The two-phase process behind LLMs’ responses
LLM inference原创 2024-07-16 16:03:51 · 791 阅读 · 0 评论 -
Transformers KV Caching Explained
K-V cache原创 2024-07-12 13:15:15 · 614 阅读 · 0 评论