实现Qwen2-1.5B和简单RAG(langchain)的ipex_llm、openvino推理加速以及性能对比

intel_llm、openvino推理加速以及性能对比(Qwen+RAG)

下载原始模型以及int4模型量化

github地址(代码文件和PDF数据):https://github.com/dogdogpp/llm_openvino_ipex/tree/main
这次看到天池比赛有一些部署上的比赛,练习了一下,主要是实现了int4量化和RAG技术。
ipex_llm这个库好像对Linux的支持要好一点,这里提供官方的环境下载方式,要注意路径的更换

#注意路径,到自己的anaconda的envs
cd  /opt/conda/envs 
mkdir ipex
# 下载 ipex-llm 官方环境|
wget https://s3.idzcn.com/ipex-llm/ipex-llm-2.1.0b20240410.tar.gz 
# 解压文件夹以便恢复原先环境
tar -zxvf ipex-llm-2.1.0b20240410.tar.gz -C ipex/ && rm ipex-llm-2.1.0b20240410.tar.gz
# 安装 ipykernel 并将其注册到 notebook 可使用内核中,注意路径,到自己的anaconda的ipex
/opt/conda/envs/ipex/bin/python3 -m pip install ipykernel && /opt/conda/envs/ipex/bin/python3 -m ipykernel install --name=ipex


###############################还需安装的其他库文件(主要)#####################################
conda activate ipex
pip install sentence_transformers
pip install optimum[openvino,nncf]
pip install langchain
pip install langchain_community
pip install -U huggingface_hub
pip install pypdf
pip install faiss-gpu

首先,通过 modelscope 的 api 很容易实现模型的下载

import torch
from modelscope import snapshot_download, AutoModel, AutoTokenizer
import os
# 第一个参数表示下载模型的型号,第二个参数是下载后存放的缓存地址,第三个表示版本号,默认 master
model_dir = snapshot_download('Qwen/Qwen2-1.5B-Instruct', cache_dir='qwen2chat_src', revision='master')

为了实现加速推理,在下载之后需要对Qwen2模型进行精度量化至int4,其实就是对浮点数转换为低位宽的整数型,可以降低对计算资源的需求和提高推理的效率。其中ipex_llm和openvino都对大模型进行了相关优化加速,这次使用openvino、ipex_llm和原本无处理的模型进行一个性能对比还有模型的效果对比,不过量化肯定会造成LLM准确率的下降,不过对于部署边缘设备上这些处理必不可少。

####################ipex_llm##########################
#这段代码的效果是将下载的Qwen2进行sym_int4对称量化,也可以选择非对称量化'asym_int4'
from ipex_llm.transformers import AutoModelForCausalLM
from transformers import  AutoTokenizer
import os
if __name__ == '__main__':
    model_path = os.path.join(os.getcwd(),"qwen2chat_src/Qwen/Qwen2-1___5B-Instruct")
    model = AutoModelForCausalLM.from_pretrained(model_path, load_in_low_bit='sym_int4', trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    #将int4模型保存到qwen2chat_int4_ori文件夹里
    model.save_low_bit('qwen2chat_int4_ori')
    tokenizer.save_pretrained('qwen2chat_int4_ori')
####################openvino##########################
from optimum.intel import OVModelForCausalLM
from nncf import compress_weights, CompressWeightsMode
model = OVModelForCausalLM.from_pretrained('qwen2chat_src/Qwen/Qwen2-1___5B-Instruct', export=True)
model.model = compress_weights(model.model, mode=CompressWeightsMode.INT4_SYM, group_size=128, ratio=0.8)
model.save_pretrained('qwen2chat_int4')

现在,原模型在qwen2chat_src/Qwen/Qwen2-1___5B-Instruct,后面的tokenizer会都用原模型包含的分词器,ipex_llm模型保存在qwen2chat_int4_ori文件下,openvino模型保存在qwen2chat_int4文件夹下。
确保文件正确保存

另外为了RAG上还需要用到一个模型实现embedding

#模型下载
from modelscope import snapshot_download
model_dir = snapshot_download('AI-ModelScope/all-mpnet-base-v2', cache_dir='sentence-transformers')

RAG实现

这次实现一个简单的向量数据库,把现有的数据——PDF格式的LLAMA2论文存进数据库,然后通过检索相关数据库内容与问题一并输入给模型,再予输出结果,模型就可以更精准回答了本不清晰或者根本不知道的内容,整个流程直接用代码演示:

   # 1. 准备模型
    print(f"准备{framework}模型...")
    model_load_start = time.time()
    ragmodel, tokenizer = load_model(framework, model_dir, tokenizer_path)
    model_load_end = time.time()
    print(f"模型加载时间: {model_load_end - model_load_start:.2f} 秒")

    # 2. 创建向量数据库
    print("创建向量数据库...")
    db_creation_start = time.time()
    loader = PyPDFLoader(pdf_path)
    pages = loader.load()

    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    texts = text_splitter.split_documents(pages)
    embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/AI-ModelScope/all-mpnet-base-v2')
    db = FAISS.from_documents(texts, embeddings)

    # 将创建的向量数据库保存到本地:
    db.save_local("Library2")

    # 加载保存的向量数据库
    loaded_db = FAISS.load_local("Library2", embeddings, allow_dangerous_deserialization=True)
    db_creation_end = time.time()
    print(f"向量数据库创建时间: {db_creation_end - db_creation_start:.2f} 秒")

    # 3. 实现TheRAG系统
    print("实现TheRAG系统...")
    rag_setup_start = time.time()

    
    # 创建pipeline
    pipe = pipeline(
        "text-generation",
        model=ragmodel,
        tokenizer=tokenizer,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.95,
        do_sample=True
    )

    # 创建HuggingFacePipeline实例
    llm = HuggingFacePipeline(pipeline=pipe)

    # 创建RetrievalQA链
    qa = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=loaded_db.as_retriever()
    )

    rag_setup_end = time.time()
    print(f"RAG系统设置时间: {rag_setup_end - rag_setup_start:.2f} 秒")

    # 使用示例
    print("执行查询...")
    query_start = time.time()
    print('query:', query)
    # query = "llama2 的实际效果如何"
    result = qa.run(query)
    query_end = time.time()
    
    print(f"查询执行时间: {query_end - query_start:.2f} 秒")

    print("\n查询结果:")
    print(result)

    # 计算总执行时间
    end_time = time.time()
    total_time = end_time - start_time
    print(f"\n总执行时间: {total_time:.2f} 秒")

这里步骤和时间那些都比较清晰了,后面我将会通过函数调用去实现性能对比,分两个代码文件

效果对比

## RAG_diff_models.py
import os
import time
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from transformers import AutoTokenizer
from transformers import pipeline




def load_model(framework, model_dir, tokenizer_path):
    if framework == 'openvino':

        from optimum.intel import OVModelForCausalLM
        model = OVModelForCausalLM.from_pretrained(model_dir)

    elif framework == 'pytorch':
        from transformers import AutoModelForCausalLM as TAutoModelForCausalLM
        
        model = TAutoModelForCausalLM.from_pretrained(model_dir)

    elif framework == 'intel_llm':

        from ipex_llm.transformers import AutoModelForCausalLM as intelAutoModelForCausalLM
        model = intelAutoModelForCausalLM.load_low_bit(model_dir, trust_remote_code=True)
    
    else:
        raise ValueError(f"Unsupported framework: {framework}")
    
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
    return model, tokenizer

def run_inference(framework, model_dir, tokenizer_path, pdf_path, query):
    # 记录开始时间
    start_time = time.time()

    # 设置OpenMP线程数为8
    os.environ["OMP_NUM_THREADS"] = "8"

    # 1. 准备模型
    print(f"准备{framework}模型...")
    model_load_start = time.time()
    ragmodel, tokenizer = load_model(framework, model_dir, tokenizer_path)
    model_load_end = time.time()
    print(f"模型加载时间: {model_load_end - model_load_start:.2f} 秒")

    # 2. 创建向量数据库
    print("创建向量数据库...")
    db_creation_start = time.time()
    loader = PyPDFLoader(pdf_path)
    pages = loader.load()

    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    texts = text_splitter.split_documents(pages)
    embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/AI-ModelScope/all-mpnet-base-v2')
    db = FAISS.from_documents(texts, embeddings)

    # 将创建的向量数据库保存到本地:
    db.save_local("Library2")

    # 加载保存的向量数据库
    loaded_db = FAISS.load_local("Library2", embeddings, allow_dangerous_deserialization=True)
    db_creation_end = time.time()
    print(f"向量数据库创建时间: {db_creation_end - db_creation_start:.2f} 秒")

    # 3. 实现TheRAG系统
    print("实现TheRAG系统...")
    rag_setup_start = time.time()

    
    # 创建pipeline
    pipe = pipeline(
        "text-generation",
        model=ragmodel,
        tokenizer=tokenizer,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.95,
        do_sample=True
    )

    # 创建HuggingFacePipeline实例
    llm = HuggingFacePipeline(pipeline=pipe)

    # 创建RetrievalQA链
    qa = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=loaded_db.as_retriever()
    )

    rag_setup_end = time.time()
    print(f"RAG系统设置时间: {rag_setup_end - rag_setup_start:.2f} 秒")

    # 使用示例
    print("执行查询...")
    query_start = time.time()
    print('query:', query)
    # query = "llama2 的实际效果如何"
    result = qa.run(query)
    query_end = time.time()
    
    print(f"查询执行时间: {query_end - query_start:.2f} 秒")

    print("\n查询结果:")
    print(result)

    # 计算总执行时间
    end_time = time.time()
    total_time = end_time - start_time
    print(f"\n总执行时间: {total_time:.2f} 秒")


    result_time = query_end - query_start
    
    return result_time, result
## RAG_trust.py
import RAG_diff_models as RAG
import pandas as pd
import os
from datetime import datetime

openvino_dir = 'qwen2chat_int4'

pytorch_dir = 'qwen2chat_src/Qwen/Qwen2-1___5B-Instruct'

intel_llm = 'qwen2chat_int4_ori'

query = "llama2 的实际效果如何"

openvino_total_time, openvino_result = RAG.run_inference('openvino', model_dir=openvino_dir, tokenizer_path=pytorch_dir, pdf_path='llamatiny.pdf', query = query)
pytorch_total_time, pytorch_result = RAG.run_inference('pytorch', model_dir=pytorch_dir, tokenizer_path=pytorch_dir, pdf_path='llamatiny.pdf', query = query)
intel_llm_total_time, intel_llm_result = RAG.run_inference('intel_llm', model_dir=intel_llm, tokenizer_path=intel_llm, pdf_path='llamatiny.pdf', query = query)

data = {
    "模型": ["OpenVINO", "PyTorch", "Intel LLM"],
    "总耗时 (秒)": [openvino_total_time, pytorch_total_time, intel_llm_total_time],
    "生成结果": [openvino_result, pytorch_result, intel_llm_result]
}

# 使用 pandas 创建一个 DataFrame
df = pd.DataFrame(data)

# 打印 DataFrame 以表格形式展示对比结果
print(df)

# 找出最快的模型
fastest_model = df.loc[df["总耗时 (秒)"].idxmin()]["模型"]
print(f"\n最快的模型是: {fastest_model}")

# 创建一个保存结果的目录
results_dir = "model_comparison_results"
os.makedirs(results_dir, exist_ok=True)

# 生成带有时间戳的文件名
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
csv_filename = f"model_comparison_{timestamp}.csv"
csv_path = os.path.join(results_dir, csv_filename)

# 保存 DataFrame 到 CSV 文件
df.to_csv(csv_path, index=False, encoding='utf-8-sig')
print(f"\n结果已保存到: {csv_path}")

# 将最快模型信息添加到 CSV 文件
with open(csv_path, 'a', encoding='utf-8-sig') as f:
    f.write(f"\n最快的模型是:,{fastest_model}")

print("CSV 文件已更新,包含最快模型信息。")

好了,万事俱备,运行RAG_trust.py文件,会自动保存效果和时间到CSV文件

模型总耗时 (秒)
OpenVINO24.238186s
PyTorch86.559874s
Intel LLM55.636541s

输出内容有点长,代码形式展示把

openvino

Question: llama2 的实际效果如何
Helpful Answer: With regard to the effectiveness of the llama2 model, we can say that it is generally better than other open-source models. 
It is on par with some closed-source models, at least in terms of the human evaluations we performed. 
It is important to note that these results are subjective and noisy, due to limitations of the prompt set, subjectivity of review guidelines, subjectivity of individual raters, and the inherent difficulty of comparing generations. 
Nonetheless, we have taken measures to increase the safety of the models, including safety-specific data annotation and tuning, as well as conducting red teaming and employing iterative evaluations. 
Additionally, this paper contributes a thorough description of our fine-tuning methodology and approach to improving LLM safety. 
We hope that this openness will enable the community to reproduce fine-tuned LLMs and continue to improve the safety of those models, paving the way for more responsible development of LLMs.

原模型

Question: llama2 的实际效果如何
Helpful Answer: Based on the helpfulness human evaluation results for Llama 2-Chat compared to other open-source and closed-source models, the overall performance of Llama 2-Chat models appears to be better than existing open-source models. However, they seem to be on par with some of the closed-source models, especially regarding the human evaluations performed. 

This comparison highlights the evolving state of the models' capabilities and suggests that they might provide higher quality conversational experiences over time. The evaluations also emphasize the importance of thorough testing and careful model selection tailored to specific application requirements before deploying Llama 2-Chat.

If further improvements or refinements are made to enhance its effectiveness in various contexts, it may become even more competitive against other models. The open-release of Llama 2 provides opportunities for collaboration and innovation within the community, potentially leading to safer and more effective LLMs. To fully understand the impact, additional quantitative measures, such as performance metrics and user satisfaction surveys, would be useful to gather more comprehensive feedback.

Please note that the evaluation process itself could introduce biases and inaccuracies due to the limited prompt set and subjective nature of reviewers' assessments, making direct comparisons challenging. Nonetheless, the results provide valuable insights into the current standing of Llama 2-Chat models and their relative strengths and weaknesses. For responsible deployment, users and developers must conduct thorough safety tests and tailor the model's responses to their specific application needs. Further research and development efforts in this area are expected to continue, aiming to optimize and refine Llama 2-Chat models for enhanced utility and safety.

ipex_llm

Question: llama2 的实际效果如何
Helpful Answer: 在系列的有用性与安全性评估中,我们发现 llama2-chat 模型通常比现有的开源模型表现更好。它们似乎在人类评价方面与一些闭源模型相似(例如,根据 GPT-4 的结果),至少在人类审查方面也是如此。我们采取了措施来增加这些模型的安全性,并对安全进行了专门的数据注释和调整,以及进行红队并采用迭代评估。此外,本论文详细描述了我们的微调方法和改进语言模型安全性的策略。希望这种开放性能够促进社区对微调后的 LLMs 和进一步改善模型安全性的能力的复制,从而为更负责任地开发语言模型铺平道路。我们也提出了关于 llama2 和 llama2 - chat 开发过程中出现的新观察点,比如工具使用及其知识的组织方式。

NoAnswer: lla2的效果并没有给出具体的答案。

Final Answer: 在系列的有用性与安全性评估中,我们发现 llama2-chat 模型通常比现有的开源模型表现更好。它们似乎在人类评价方面与一些闭源模型相似(例如,根据 GPT-4 的结果),至少在人类审查方面也是如此。我们采取了措施来增加这些模型的安全性,并对安全进行了专门的数据注释和调整,以及进行红队并采用迭代评估。此外,本论文详细描述了我们的微调方法和改进语言模型安全性的策略。希望这种开放性能够促进社区对微调后的 LLMs 和进一步改善模型安全性的能力的复制,从而为更负责任地开发语言模型铺平道路。我们也提出了关于 llama2 和 llama2 - chat 开发过程中出现的新观察点,比如工具使用及其知识的组织方式。

无内容:没有提供具体的结果。

Final Answer: 在系列的有用性与安全性评估中,我们发现 llama2-chat 模型通常比现有的开源模型表现更好。它们似乎在人类评价方面与一些闭源模型相似(例如,根据 GPT-4 的结果),至少在人类审查方面也是如此。我们采取了措施来增加这些模型的安全性,并对安全进行了专门的数据注释和调整,以及进行红队并采用迭代评估。此外,本论文详细描述了我们的微调方法和改进语言模型安全性的策略。希望这种开放性能够促进社区对微调后的 LLMs 和进一步改善模型安全性的能力的复制,从而为更负责任地开发语言模型铺平道路。我们也提出了关于 llama2 和

效率上,在服务器芯片上是OpenVINO更胜一筹,但是不是绝对在任何环境下都是它最快,不过int4量化后是实质上的提升。
在回答质量上,见仁见智了,ipex_llm翻译了中文,但有重复回答的问题,其他两个没翻译但是回答更由条理,内容不重复。

  • 15
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值