roberta模型、bert模型可以用vllm部署了

最新推荐文章于 2025-10-31 13:12:41 发布

原创最新推荐文章于 2025-10-31 13:12:41 发布 · 586 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#bert #人工智能 #深度学习

部署运行你感兴趣的模型镜像

OpenAI Embedding Client — vLLM

https://github.com/huggingface/text-embeddings-inference

打好镜像之后在部署服务的时候：

text-embeddings-router --model-id /mnt/model --port 8811 [--pooling cls，可有可无] --dtype float16 --json-output [还有一些其他参数，参考hf那个链接]

使用代码：（我是并发，异步使用的）

MAX_CHARS = 500
input_texts = [text[:MAX_CHARS] for text in examples]#要么部署的时候设置一下最大token数，要么在这里自行截断

async def _process_batch_async(self, session, batch_texts, batch_qids, batch_idx, max_retries=1):
        """Processes a single batch asynchronously with retries."""
        payload = {"inputs": batch_texts}
        for attempt in range(max_retries):
            try:
                async with session.post(self.api_url, json=payload, timeout=30.0) as response:
                    response.raise_for_status()
                    response_json = await response.json()
                    
                    batch_results = {}
                    for j, data in enumerate(response_json):
                        batch_results[batch_qids[j]] = data
                    
                    return batch_results
            except Exception as e:
                if attempt + 1 == max_retries:
                    return {} # Return empty dict on failure
                await asyncio.sleep(1) # Wait 1 second before retrying

tasks = []
async with aiohttp.ClientSession() as session:
    for i in range(0, len(input_texts), batch_size):
        batch_texts = input_texts[i:i + batch_size]
        batch_qids = qids[i:i + batch_size]
        task = self._process_batch_async(session, batch_texts, batch_qids, i)
        tasks.append(task)
            
    batch_results_list = await asyncio.gather(*tasks)
            
    for batch_results in batch_results_list:
        if batch_results:
           dqid2emb.update(batch_results)

您可能感兴趣的与本文相关的镜像

Vllm-v0.11.0

Vllm

vLLM是伯克利大学LMSYS组织开源的大语言模型高速推理框架，旨在极大地提升实时场景下的语言模型服务的吞吐与内存使用效率。vLLM是一个快速且易于使用的库，用于 LLM 推理和服务，可以和HuggingFace 无缝集成。vLLM利用了全新的注意力算法「PagedAttention」，有效地管理注意力键和值