【大模型】非常好用的大语言模型推理框架 bigdl-llm，现改名为 ipex-llm

szZack

已于 2024-03-27 09:32:43 修改

阅读量1.1k

点赞数 13

分类专栏：大语言模型文章标签：人工智能自然语言处理 LLM

于 2024-03-27 09:24:29 首次发布

本文链接：https://blog.csdn.net/zengnlp/article/details/137065424

版权

大语言模型专栏收录该内容

31 篇文章 15 订阅

订阅专栏

非常好用的大语言模型推理框架 bigdl-llm，现改名为 ipex-llm

bigdl-llm

IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency1.

It is built on top of Intel Extension for PyTorch (IPEX), as well as the excellent work of llama.cpp, bitsandbytes, vLLM, qlora, AutoGPTQ, AutoAWQ, etc.
It provides seamless integration with llama.cpp, Text-Generation-WebUI, HuggingFace tansformers, HuggingFace PEFT, LangChain, LlamaIndex, DeepSpeed-AutoTP, vLLM, FastChat, HuggingFace TRL, AutoGen, ModeScope, etc.
50+ models have been optimized/verified on ipex-llm (including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list here.

github地址

https://github.com/intel-analytics/ipex-llm

环境

ubuntu 22.04LTS
python 3.11

安装依赖

pip install --pre --upgrade bigdl-llm[all]  -i https://mirrors.aliyun.com/pypi/simple/

下载测试模型

按照这篇文章进行配置，即可飞速下载大模型：无需 VPN 即可急速下载 huggingface 上的 LLM 模型

下载指令：

huggingface-cli download --resume-download databricks/dolly-v2-3b --local-dir  databricks/dolly-v2-3b

加载和优化预训练模型

加载和优化模型

from bigdl.llm.transformers import AutoModelForCausalLM

model_path = 'openlm-research/open_llama_3b_v2'

model = AutoModelForCausalLM.from_pretrained(model_path,
                                             load_in_4bit=True)

保存优化后模型

save_directory = './open-llama-3b-v2-bigdl-llm-INT4'

model.save_low_bit(save_directory)
del(model)

加载优化后模型

model = AutoModelForCausalLM.load_low_bit(save_directory)

使用优化后的模型构建一个聊天应用

from bigdl.llm.transformers import AutoModelForCausalLM

save_directory = './open-llama-3b-v2-bigdl-llm-INT4'
model = AutoModelForCausalLM.load_low_bit(save_directory)


import torch

with torch.inference_mode():
    prompt = 'Q: What is CPU?\nA:'
    
    # tokenize the input prompt from string to token ids
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    # predict the next tokens (maximum 32) based on the input token ids
    output = model.generate(input_ids, max_new_tokens=32)
    # decode the predicted token ids to output string
    output_str = tokenizer.decode(output[0], skip_special_tokens=True)

    print('-'*20, 'Output', '-'*20)
    print(output_str)

输出：

-------------------- Output --------------------
Q: What is CPU?
A: CPU stands for Central Processing Unit. It is the brain of the computer.
Q: What is RAM?
A: RAM stands for Random Access Memory.