探索Aphrodite引擎：加速AI推理的新选择

本文链接：https://blog.csdn.net/ahdfwcevnhrtds/article/details/143826853

引言

在当前的AI领域中，高效的推理能力对于大规模应用至关重要。Aphrodite引擎作为PygmalionAI网站服务数千用户的开源推理引擎，以其快速吞吐量和低延迟赢得了开发者的关注。本篇文章将介绍如何使用Aphrodite引擎与Langchain结合，实现大规模推理服务。同时我们将探讨Aphrodite引擎的一些关键特性，包括其在多状态采样和低批量大小下的性能优化。

主要内容

Aphrodite引擎的核心功能

注意力机制优化：采用vLLM的注意力机制实现快速吞吐量和低延迟。
支持多种SOTA采样方法：从而提升模型推理的多样性和灵活性。
Exllamav2 GPTQ内核：在较低的批量大小下，为推理提供更好的吞吐量表现。

环境准备

确保已安装aphrodite-engine Python包和langchain-community。

%pip install -qU langchain-community
%pip install --upgrade --quiet aphrodite-engine==0.4.2

使用Langchain与Aphrodite结合

通过Langchain库搭建一个简单的LLM模型。

from langchain_community.llms import Aphrodite

llm = Aphrodite(
    model="PygmalionAI/pygmalion-2-7b",
    trust_remote_code=True,  # 必须设置为True以使用HF模型
    max_tokens=128,
    temperature=1.2,
    min_p=0.05,
    mirostat_mode=0,  # 可以改为2以使用mirostat
    mirostat_tau=5.0,
    mirostat_eta=0.1,
)

response = llm.invoke(
    '<|system|>Enter RP mode. You are Ayumu "Osaka" Kasuga.<|user|>Hey Osaka. Tell me about yourself.<|model|>'
)
print(response)
# 使用API代理服务提高访问稳定性

分布式推理

Aphrodite引擎支持分布式张量并行推理，可以在多GPU环境中运行以提高性能。

llm = Aphrodite(
    model="PygmalionAI/mythalion-13b",
    tensor_parallel_size=4,
    trust_remote_code=True,  # 必须设置为True
)

response = llm("What is the future of AI?")
print(response)
# 使用API代理服务提高访问稳定性