【评测】Qwen3-Embedding模型初体验

小声读源码

已于 2025-06-09 17:09:10 修改

阅读量774

点赞数 4

分类专栏： Dify 文章标签： embedding Qwen3

于 2025-06-08 21:58:01 首次发布

本文链接：https://blog.csdn.net/u010593516/article/details/148519273

版权

Dify 专栏收录该内容

34 篇文章

订阅专栏

回到目录

Qwen3-Embedding的ollama部署方法可以参考【部署】dify+ollama部署Qwen3-Embedding-8B

【评测】Qwen3-Embedding模型初体验

模型的介绍页面
0.6B运行配置：笔记本i5-8265U，16G内存，无GPU核显运行，win10操作系统
8B运行配置：AMD8700G，64G内存，4090D 24G显存，ubuntu24.04操作系统

下面直接使用介绍页面的sample代码体验一下模型的威力。

1. modelscope下载模型

$ modelscope download --model Qwen/Qwen3-Embedding-0.6B
$ modelscope download --model Qwen/Qwen3-Embedding-8B
0.6B模型 1.12GB 8B模型 14.1GB

2. 修改sample代码从本地加载模型

默认代码运行报错：
OSError: We couldn’t connect to ‘https://huggingface.co’ to load the files, and couldn’t find them in the cached files.

# test_qwen3-embedding.py

# Requires transformers>=4.51.0
# Requires sentence-transformers>=2.7.0

from sentence_transformers import SentenceTransformer

# Load the model
#model = SentenceTransformer("Qwen/Qwen3-Embedding-8B")  改为下面代码本地加载模型
model = SentenceTransformer("C:\\Users\\Administrator\\.cache\\modelscope\\hub\models\\Qwen\\Qwen3-Embedding-8B")

# We recommend enabling flash_attention_2 for better acceleration and memory saving,
# together with setting `padding_side` to "left":
# model = SentenceTransformer(
#     "Qwen/Qwen3-Embedding-8B",
#     model_kwargs={"attn_implementation": "flash_attention_2", "device_map": "auto"},
#     tokenizer_kwargs={"padding_side": "left"},
# )

# The queries and documents to embed
queries = [
    "What is the capital of China?",
    "Explain gravity",
]
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

# Encode the queries and documents. Note that queries benefit from using a prompt
# Here we use the prompt called "query" stored under `model.prompts`, but you can
# also pass your own prompt via the `prompt` argument
query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)

# Compute the (cosine) similarity between the query and document embeddings
similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)
# tensor([[0.7493, 0.0751],
#         [0.0880, 0.6318]])

可能是机器配置太低问题，无法正常执行出结果
D:\workspace\test_qwen3-embedding.py:8: SyntaxWarning: invalid escape sequence ‘\m’
model = SentenceTransformer(“C:\Users\Administrator\.cache\modelscope\hub\models\Qwen\Qwen3-Embedding-8B”)
Loading checkpoint shards: 25%|██████████████▎ | 1/4 [00:14<00:42, 14.24s/it]

进度条第一步后，就直接退出程序。

3. 修改sample代码为0.6B模型

# test_qwen3-embedding.py
。。。
# Load the model
#model = SentenceTransformer("Qwen/Qwen3-Embedding-8B")  改为下面代码本地加载模型
model = SentenceTransformer("C:\\Users\\Administrator\\.cache\\modelscope\\hub\models\\Qwen\\Qwen3-Embedding-0.6B")
。。。

(workspace) PS D:\workspace> uv run .\test_qwen3-embedding.py
D:\workspace\test_qwen3-embedding.py:8: SyntaxWarning: invalid escape sequence ‘\m’
model = SentenceTransformer(“C:\Users\Administrator\.cache\modelscope\hub\models\Qwen\Qwen3-Embedding-0.6B”)
tensor([[0.7646, 0.1414],
[0.1355, 0.6000]])

运行成功，几秒钟出结果，CPU呼呼的转

4. 4090D机器上运行8B模型

报错：torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 192.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 72.94 MiB is free. Process 3052744 has 434.64 MiB memory in use. Including non-PyTorch memory, this process has 23.12 GiB memory in use. Of the allocated memory 22.78 GiB is allocated by PyTorch, and 1.10 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(

# test_qwen3-embedding.py
。。。
# Load the model
model = SentenceTransformer("/mnt/wd4t/models/modlescope/Qwen3-Embedding-8B", device="cuda", model_kwargs={"torch_dtype": "auto"})   <-- 修改加载模型代码

$ uv run test_qwen3_embedding.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.48it/s]

tensor([[0.7471, 0.0770],
        [0.0894, 0.6321]])