七天入门LLM大模型 | 3：LLM和多模态模型高效推理实践

最新推荐文章于 2024-07-09 02:50:20 发布

Langchain

最新推荐文章于 2024-07-09 02:50:20 发布

阅读量729

点赞数 15

文章标签： langchain ai大模型 LLM 人工智能大模型多模态

本文链接：https://blog.csdn.net/Langchain/article/details/140006911

版权

本课程以Qwen系列模型为例，主要介绍在魔搭社区如何高效推理LLM和多模态模型，主要包括如下内容：

LLM和多模态大模型的推理

- 使用ModelScope NoteBook免费GPU推理Qwen-1.8B-Chat-int4
- 使用ModelScope NoteBook免费GPU推理Qwen-VL-Chat-int4
- 使用ModelScope NoteBook免费GPU推理Qwen-audio-Chat
推理加速和多端推理
- 推理加速：vLLM+fastchat加速推理
- 多端推理适配-Qwen.cpp和LLaMa.cpp转化模型为gguf或者ggml，并结合Xinference在本地笔记本部署。

LLM的应用场景，RAG&Agent

- 使用llama index和langchain打造基于本地知识库的ChatBot

01 多模态大模型推理

LLM的推理流程：

多模态的LLM的原理：

代码演示：

使用ModelScope NoteBook完成语言大模型，视觉大模型，音频大模型的推理

环境配置与安装

以下主要演示的模型推理代码可在魔搭社区免费实例PAI-DSW的配置下运行（显存24G）：

1.点击模型右侧Notebook快速开发按钮，选择GPU环境：

2.打开Python 3 (ipykernel)：

示例代码

语言大模型推理示例代码

#通义千问1_8B LLM大模型的推理代码示例#通义千问1_8B:https://modelscope.cn/models/qwen/Qwen-1_8B-Chat/summaryfrom modelscope import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
# Note: The default behavior now has injection attack prevention off.tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat", revision='master', trust_remote_code=True)
# use bf16# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()# use fp16# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()# use cpu only# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat", device_map="cpu", trust_remote_code=True).eval()# use auto mode, automatically select precision based on the device.model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat", revision='master', device_map="auto", trust_remote_code=True).eval()
# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-1_8B-Chat", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
# 第一轮对话 1st dialogue turnresponse, history = model.chat(tokenizer, "你好", history=None)print(response)# 你好！很高兴为你提供帮助。
# 第二轮对话 2nd dialogue turnresponse, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)print(response)# 这是一个关于一个年轻人奋斗创业最终取得成功的故事。# 故事的主人公叫李明，他来自一个普通的家庭，父母都是普通的工人。从小，李明就立下了一个目标：要成为一名成功的企业家。# 为了实现这个目标，李明勤奋学习，考上了大学。在大学期间，他积极参加各种创业比赛，获得了不少奖项。他还利用课余时间去实习，积累了宝贵的经验。# 毕业后，李明决定开始自己的创业之路。他开始寻找投资机会，但多次都被拒绝了。然而，他并没有放弃。他继续努力，不断改进自己的创业计划，并寻找新的投资机会。# 最终，李明成功地获得了一笔投资，开始了自己的创业之路。他成立了一家科技公司，专注于开发新型软件。在他的领导下，公司迅速发展起来，成为了一家成功的科技企业。# 李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险，不断学习和改进自己。他的成功也证明了，只要努力奋斗，任何人都有可能取得成功。
# 第三轮对话 3rd dialogue turnresponse, history = model.chat(tokenizer, "给这个故事起一个标题", history=history)print(response)# 《奋斗创业：一个年轻人的成功之路》
# Qwen-1.8B-Chat现在可以通过调整系统指令（System Prompt），实现角色扮演，语言风格迁移，任务设定，行为设定等能力。# Qwen-1.8B-Chat can realize roly playing, language style transfer, task setting, and behavior setting by system prompt.response, _ = model.chat(tokenizer, "你好呀", history=None, system="请用二次元可爱语气和我说话")print(response)# 你好啊！我是一只可爱的二次元猫咪哦，不知道你有什么问题需要我帮忙解答吗？
response, _ = model.chat(tokenizer, "My colleague works diligently", history=None, system="You will write beautiful compliments according to needs")print(response)# Your colleague is an outstanding worker! Their dedication and hard work are truly inspiring. They always go above and beyond to ensure that # their tasks are completed on time and to the highest standard. I am lucky to have them as a colleague, and I know I can count on them to handle any challenge that comes their way.

输出结果：

视觉大模型推理示例代码

#Qwen-VL 是阿里云研发的大规模视觉语言模型（Large Vision Language Model, LVLM）。Qwen-VL 可以以图像、文本、检测框作为输入，并以文本和检测框作为输出。Qwen-VL 系列模型性能强大，具备多语言对话、多图交错对话等能力，并支持中文开放域定位和细粒度图像识别与理解。from modelscope import (    snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig)from auto_gptq import AutoGPTQForCausalLM
model_dir = snapshot_download("qwen/Qwen-VL-Chat-Int4", revision='v1.0.0')
import torchtorch.manual_seed(1234)
# Note: The default behavior now has injection attack prevention off.tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
# use cuda devicemodel = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cuda", trust_remote_code=True,use_safetensors=True).eval()
# 1st dialogue turnquery = tokenizer.from_list_format([    {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'},    {'text': '这是什么'},])response, history = model.chat(tokenizer, query=query, history=None)print(response)# 图中是一名年轻女子在沙滩上和她的狗玩耍，狗的品种可能是拉布拉多。她们坐在沙滩上，狗的前腿抬起来，似乎在和人类击掌。两人之间充满了信任和爱。
# 2nd dialogue turnresponse, history = model.chat(tokenizer, '输出"狗"的检测框', history=history)print(response)
image = tokenizer.draw_bbox_on_latest_picture(response, history)if image:  image.save('1.jpg')else:  print("no box")

输出结果：

音频大模型推理示例代码

from modelscope import (    snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig)import torchmodel_id = 'qwen/Qwen-Audio-Chat'revision = 'master'
model_dir = snapshot_download(model_id, revision=revision)torch.manual_seed(1234)
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)if not hasattr(tokenizer, 'model_dir'):    tokenizer.model_dir = model_dir
# Note: The default behavior now has injection attack prevention off.tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
# use bf16# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval()# use fp16# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, fp16=True).eval()# use cpu only# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval()# use cuda devicemodel = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cuda", trust_remote_code=True).eval()

# 1st dialogue turnquery = tokenizer.from_list_format([    {'audio': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/1272-128104-0000.flac'}, # Either a local path or an url    {'text': 'what does the person say?'},])response, history = model.chat(tokenizer, query=query, history=None)print(response)# The person says: "mister quilter is the apostle of the middle classes and we are glad to welcome his gospel".
# 2nd dialogue turnresponse, history = model.chat(tokenizer, 'Find the start time and end time of the word "middle classes"', history=history)print(response)# The word "middle classes" starts at <|2.33|> seconds and ends at <|3.26|> seconds.

输出结果：

02 vLLM+FastChat高效推理实战

FastChat是一个开放平台，用于训练、服务和评估基于LLM的ChatBot。

FastChat的核心功能包括：

优秀的大语言模型训练和评估代码。
具有Web UI和OpenAI兼容的RESTful API的分布式多模型服务系统。

vLLM是一个由加州伯克利分校、斯坦福大学和加州大学圣迭戈分校的研究人员基于操作系统中经典的虚拟缓存和分页技术开发的LLM服务系统。他实现了几乎零浪费的KV缓存，并且可以在请求内部和请求之间灵活共享KV高速缓存，从而减少内存使用量。

FastChat开源链接：

********https://github.com/lm-sys/FastChat

vLLM开源链接：

******https://github.com/vllm-project/vllm

实战演示：

安装FastChat最新包

git clone https://github.com/lm-sys/FastChat.gitcd FastChatpip install .

环境变量设置

在vLLM和FastChat上使用魔搭的模型需要设置两个环境变量：

export VLLM_USE_MODELSCOPE=Trueexport FASTCHAT_USE_MODELSCOPE=True

使用FastChat和vLLM实现发布model worker(s)

可以结合FastChat和vLLM搭建一个网页Demo或者类OpenAI API服务器，首先启动一个controller：

python -m fastchat.serve.controller

然后启动vllm_worker发布模型。如下给出单卡推理的示例，运行如下命令：

千问模型示例：

#以qwen-1.8B为例，在A10运行python -m fastchat.serve.vllm_worker --model-path qwen/Qwen-1_8B-Chat --trust-remote-code --dtype bfloat16

启动vLLM优化worker后，本次实践启动页面端demo展示：

python -m fastchat.serve.gradio_web_server --host 0.0.0.0 --port 8000

让我们体验极致推理优化的效果吧!

03 多端部署实战

魔搭社区和Xinference合作，提供了模型GGML的部署方式，以qwen+个人Mac电脑为例。

Xinference支持大语言模型，语音识别模型，多模态模型的部署，简化了部署流程，通过一行命令完成模型的部署工作。并支持众多前沿的大语言模型，结合GGML技术，支持多端部署。

使用Qwen.cpp实现通义千问的多端部署：

多端部署以1.8B模型为例，

第一步：使用qwen.cpp将pytorch格式的千问模型转为GGML格式。

python3 qwen_cpp/convert.py -i qwen/Qwen-1_8-Chat -t q4_0 -o qwen-1_8b-ggml.bin

第二步：在Xinference上launch模型，并部署到Mac笔记本实现推理。

04 LLM的应用场景：RAG

LLM会产生误导性的 “幻觉”，依赖的信息可能过时，处理特定知识时效率不高，缺乏专业领域的深度洞察，同时在推理能力上也有所欠缺。

正是在这样的背景下，检索增强生成技术（Retrieval-Augmented Generation，RAG）应时而生，成为 AI 时代的一大趋势。

RAG 通过在语言模型生成答案之前，先从广泛的文档数据库中检索相关信息，然后利用这些信息来引导生成过程，极大地提升了内容的准确性和相关性。RAG 有效地缓解了幻觉问题，提高了知识更新的速度，并增强了内容生成的可追溯性，使得大型语言模型在实际应用中变得更加实用和可信。

一个典型的RAG的例子：

这里面主要包括包括三个基本步骤：

索引 — 将文档库分割成较短的 Chunk，并通过编码器构建向量索引。
检索 — 根据问题和 chunks 的相似度检索相关文档片段。
生成 — 以检索到的上下文为条件，生成问题的回答。

RAG（开卷考试）VS. Finetune（专业课程学习）

示例代码： https://github.com/modelscope/modelscope/blob/master/examples/pytorch/application/qwen_doc_search_QA_based_on_langchain_llamaindex.ipynb

如何系统的去学习AI大模型LLM ？

作为一名热心肠的互联网老兵，我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。

但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将并将重要的 AI大模型资料 包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

😝有需要的小伙伴，可以V扫描下方二维码免费领取🆓

在这里插入图片描述

一、全套AGI大模型学习路线

AI大模型时代的学习之旅：从基础到前沿，掌握人工智能的核心技能！

二、640套AI大模型报告合集

这套包含640份报告的合集，涵盖了AI大模型的理论研究、技术实现、行业应用等多个方面。无论您是科研人员、工程师，还是对AI大模型感兴趣的爱好者，这套报告合集都将为您提供宝贵的信息和启示。

三、AI大模型经典PDF籍

随着人工智能技术的飞速发展，AI大模型已经成为了当今科技领域的一大热点。这些大型预训练模型，如GPT-3、BERT、XLNet等，以其强大的语言理解和生成能力，正在改变我们对人工智能的认识。那以下这些PDF籍就是非常不错的学习资源。

在这里插入图片描述

四、AI大模型商业化落地方案

阶段1：AI大模型时代的基础理解

目标：了解AI大模型的基本概念、发展历程和核心原理。
内容：
- L1.1 人工智能简述与大模型起源
- L1.2 大模型与通用人工智能
- L1.3 GPT模型的发展历程
- L1.4 模型工程
  - L1.4.1 知识大模型
  - L1.4.2 生产大模型
  - L1.4.3 模型工程方法论
  - L1.4.4 模型工程实践
- L1.5 GPT应用案例

阶段2：AI大模型API应用开发工程

目标：掌握AI大模型API的使用和开发，以及相关的编程技能。
内容：
- L2.1 API接口
  - L2.1.1 OpenAI API接口
  - L2.1.2 Python接口接入
  - L2.1.3 BOT工具类框架
  - L2.1.4 代码示例
- L2.2 Prompt框架
  - L2.2.1 什么是Prompt
  - L2.2.2 Prompt框架应用现状
  - L2.2.3 基于GPTAS的Prompt框架
  - L2.2.4 Prompt框架与Thought
  - L2.2.5 Prompt框架与提示词
- L2.3 流水线工程
  - L2.3.1 流水线工程的概念
  - L2.3.2 流水线工程的优点
  - L2.3.3 流水线工程的应用
- L2.4 总结与展望

阶段3：AI大模型应用架构实践

目标：深入理解AI大模型的应用架构，并能够进行私有化部署。
内容：
- L3.1 Agent模型框架
  - L3.1.1 Agent模型框架的设计理念
  - L3.1.2 Agent模型框架的核心组件
  - L3.1.3 Agent模型框架的实现细节
- L3.2 MetaGPT
  - L3.2.1 MetaGPT的基本概念
  - L3.2.2 MetaGPT的工作原理
  - L3.2.3 MetaGPT的应用场景
- L3.3 ChatGLM
  - L3.3.1 ChatGLM的特点
  - L3.3.2 ChatGLM的开发环境
  - L3.3.3 ChatGLM的使用示例
- L3.4 LLAMA
  - L3.4.1 LLAMA的特点
  - L3.4.2 LLAMA的开发环境
  - L3.4.3 LLAMA的使用示例
- L3.5 其他大模型介绍

阶段4：AI大模型私有化部署

目标：掌握多种AI大模型的私有化部署，包括多模态和特定领域模型。
内容：
- L4.1 模型私有化部署概述
- L4.2 模型私有化部署的关键技术
- L4.3 模型私有化部署的实施步骤
- L4.4 模型私有化部署的应用场景

学习计划：

阶段1：1-2个月，建立AI大模型的基础知识体系。
阶段2：2-3个月，专注于API应用开发能力的提升。
阶段3：3-4个月，深入实践AI大模型的应用架构和私有化部署。
阶段4：4-5个月，专注于高级模型的应用和部署。

这份完整版的大模型 LLM 学习资料已经上传CSDN，朋友们如果需要可以微信扫描下方CSDN官方认证二维码免费领取【`保证100%免费`】

😝有需要的小伙伴，可以Vx扫描下方二维码免费领取🆓

在这里插入图片描述

Langchain

关注

15
点赞
踩
22

收藏

觉得还不错? 一键收藏
0
评论
七天入门LLM大模型 | 3：LLM和多模态模型高效推理实践

Xinference支持大语言模型，语音识别模型，多模态模型的部署，简化了部署流程，通过一行命令完成模型的部署工作。作为一名热心肠的互联网老兵，我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。LLM会产生误导性的 “幻觉”，依赖的信息可能过时，处理特定知识时效率不高，缺乏专业领域的深度洞察，同时在推理能力上也有所欠缺。使用ModelScope NoteBook完成语言大模型，视觉大模型，音频大模型的推理。
复制链接

扫一扫