小记一下在wenda上使用通义千问

最新推荐文章于 2024-05-19 18:44:43 发布

geek2077

最新推荐文章于 2024-05-19 18:44:43 发布

阅读量691

点赞数

分类专栏：学习文章标签： python 语言模型

本文链接：https://blog.csdn.net/geek2077/article/details/132242018

版权

学习专栏收录该内容

2 篇文章 0 订阅

订阅专栏

@
前几天阿里发布了他们的开源大语言模型通义千问（qwen），根据介绍，这个模型不仅在多项评分中排名靠前，而且支持长对话、对api使用能力也有较大进步。而闻达作为一个LLM调用平台，就十分适合qwen。

wenda其实可以直接运行未量化的通义千问qwen-7b模型，只是速度实在感人。每秒只能生成0.2字，难称可用。

究其原因，通过电脑性能监测发现，运行的时候，不仅调用了外接的显卡（跑LLM的主力，12g显存）还同时调用的内置显卡（1650，凑数的，我是笔记本外接显卡）同时cpu也有负荷。推测是自动进行了多卡推理，但是巨大的性能差异和雷电3的小水管严重拖慢了速度。于是进行int8（只有10g多的显存占用）量化运行

首先说一下，这里面的坑还挺多的，但是也没那么多。简简单单就能搞定。

首先量化，我懒，找的别人在h站上量化好的，感谢发布者的工作。链接：https://huggingface.co/AironHeart/Qwen-7B-Chat-8bit

然后，一定仔细看qwen-7b的readme，你遇到的全部问题里面都记录了，给个好评。然后记得遇到缺啥就安装啥，除了bitsandbytes。

首先在环境中安装必要组件

transformers==4.31.0
accelerate
tiktoken
einops
transformers_stream_generator==0.0.4
scipy

这只是能保证可用，想能用（18g显存以下的），就得量化。

Precision	MMLU	Memory
BF16	56.7	16.2G
Int8	52.8	10.1G
NF4	48.9	7.4G

此时由于闻达没有在config文件中提供修改的配置，需要到llm文件夹下更改“llm_qwen.py”文件，改这个地方：

def load_model():
    global model, tokenizer

    tokenizer = AutoTokenizer.from_pretrained(
        settings.llm.path, trust_remote_code=True)
    
    quantization_config = BitsAndBytesConfig(load_in_8bit=True)
#我的电脑int8
    model = AutoModelForCausalLM.from_pretrained(
    settings.llm.path,
    device_map="cuda:0",
    quantization_config=quantization_config,
    max_memory = torch.cuda.get_device_properties(0).total_memory,
    trust_remote_code=True,
    ).eval()

完成int8的配置

之后不出意外，会提示缺少bitsandbytes，这时候一定要注意版本，Windows需要特定版本才能运行。

我用的这个：

pip install https://github.com/jllllll/bitsandbytes-windows-webui/raw/main/bitsandbytes-0.39.0-py3-none-any.whl

安装完成后就可用正常运行了。这个就是成功的标志

知识库加载完成
The model is automatically converting to fp16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin D:\wenda\WPy64-31110\python-3.11.1.amd64\Lib\site-packages\bitsandbytes\libbitsandbytes_cuda118.dll
CUDA SETUP: CUDA runtime path found: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\cudart64_110.dll
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary D:\wenda\WPy64-31110\python-3.11.1.amd64\Lib\site-packages\bitsandbytes\libbitsandbytes_cuda118.dll...
模型加载完成