JETSON AGX ORIN结合owl+qwen2.5实现开放语义目标检测

魔障阿Q

已于 2024-10-09 15:44:54 修改

阅读量1.1k

点赞数 23

文章标签：目标检测人工智能计算机视觉语言模型

于 2024-10-09 15:43:21 首次发布

本文链接：https://blog.csdn.net/qq_44908396/article/details/142785855

版权

前言

本文不生产技术，只做技术的搬运工！该框架结构图如下：

本文将使用开源框架nanoowl+qwen2.5实现开放语义检测，如提示词："查找图中的男人和行李箱"(将查找出画面中的男人和行李箱，具体效果由模型决定)、"查找图中穿着白色T恤的女人"(将查找出画面中的穿着白色T恤的女人，其他穿着女人除外，具体效果由模型决定)。

涉及到的技术主要包括：owl模型部署、自动化数据集生成、LLama-factory大语言模型微调、LLama-cpp大语言模型量化、ollama模型部署及逻辑实现。

OWL模型部署(jetson)

1.jetson上安装miniconda

参考地址：https://github.com/BestAnHongjun/LMDeploy-Jetson/blob/main/zh/s2.md

2.conda create -n owl python=3.8创建conda环境

3.conda activate owl激活环境

4.配置owl环境

参考地址：https://github.com/xuanlinli17/nanoowl

tensorrt安装参考地址：jetson agx orin 的pytorch、torchvision、tensorrt安装最全教程_jetson pytorch-CSDN博客

5.修改hugging face模型下载源(可选，如果后续步骤出现网络问题，可以返回这一步解决)

参考地址(作者本人博客)：解决OSError: We couldn‘t connect to ‘https://huggingface.co‘ to load this file-CSDN博客

6.按照第4步中提到的参考地址：https://github.com/xuanlinli17/nanoowl进行模型转换，并进行测试(如果这一步搞不通，后续无法完成整体框架搭建)

自动化数据集生成(服务器)

由于qwen2.5格式化输出有时会出现不正确的格式，因此需要进行微调，这里编写了一个脚本进行自动生成数据，并编写脚本进行sharegpt数据格式转换，这里可以按照自己的需求自行编写自动化脚本进行数据生成。

数据生成脚本

import random
import json

# 定义一些常见的与人相关的对象和修饰词
people_zh = ["男人", "女人", "男孩", "女孩", "人"]
people_en = ["man", "woman", "boy", "girl", "person"]

# 更丰富的配饰列表
accessories_zh = [
    "口罩", "帽子", "眼镜", "墨镜", "围巾", "手套", "手表", "项链", "手链",
    "戒指", "耳环", "腰带", "领带", "领结", "袖扣", "胸针", "吊坠", "脚链",
    "手镯手表", "智能手表", "健身追踪器", "手环", "发带", "发夹"
]
#"背包", "钱包", "雨伞", "拐杖"
accessories_en = [
    "mask", "hat", "glasses", "sunglasses", "scarf", "glove", "watch", "necklace", "bracelet",
    "ring", "earrings", "belt", "tie", "bow tie", "cufflinks", "brooch", "pendant", "anklet",
    "bracelet watch", "smartwatch", "fitness tracker", "wristband", "headband", "hair clip"
]
#"backpack", "umbrella", "cane",
actions = ["戴着"]

# 生成1000条数据
data = []
for _ in range(100):
    person_zh = random.choice(people_zh)
    person_en = people_en[people_zh.index(person_zh)]

    object1_zh = random.choice(accessories_zh)
    object1_en = accessories_en[accessories_zh.index(object1_zh)]

    object2_zh = random.choice(accessories_zh)
    object2_en = accessories_en[accessories_zh.index(object2_zh)]

    # 确保两个对象不同
    while object1_zh == object2_zh:
        object2_zh = random.choice(accessories_zh)
        object2_en = accessories_en[accessories_zh.index(object2_zh)]

    # 随机决定是否存在修饰关系
    has_modifier = random.choice([True, False])

    if has_modifier:
        action = random.choice(actions)
        question = f"查找图中{action}{object1_zh}的{person_zh}"
        answer = [person_en, object1_en, True]
    else:
        question = f"查找图中的{person_zh}和{object2_zh}"
        answer = [person_en, object2_en, False]

    # 生成数据条目
    data.append({
        "question": question,
        "answer": answer
    })

# 打印前10条数据作为示例
for i, item in enumerate(data[:10]):
    print(f"示例 {i + 1}: {item}")

# 将生成的数据保存到文件
with open('generated_data_daizhe.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=2)

print("数据已生成并保存到 generated_data_daizhe.json 文件中。")

这里需要自行修改随机词汇的内容，作者以拿着、穿着、戴着为标志，生成了三个数据文件
其内容如下：

sharegpt数据转换

import json
import os

def getFileList(dir, Filelist, ext=None):
    """
    获取文件夹及其子文件夹中文件列表
    输入 dir：文件夹根目录
    输入 ext: 扩展名
    返回： 文件路径列表
    """
    newDir = dir
    if os.path.isfile(dir):
        if ext is None:
            Filelist.append(dir)
        else:
            if ext in dir:
                Filelist.append(dir)

    elif os.path.isdir(dir):
        for s in os.listdir(dir):
            newDir = os.path.join(dir, s)
            getFileList(newDir, Filelist, ext)

    return Filelist

ori_path = r"/home/workspace/qwen2.5-dataset"
ori_list = []
ori_list = getFileList(ori_path, ori_list, ".json")

template = {
    "conversations": [
      {
        "from": "human",
        "value": "user instruction"
      },
      {
        "from": "gpt",
        "value": "model response"
      }
    ]
  }

res = []
for i in ori_list:
    with open(i, 'r', encoding='utf-8') as f:
        data = json.load(f)
        for j in data:
            temp = {"conversations": [{"from": "human","value": "user instruction"},{"from": "gpt","value": "model response"}]}
            temp['conversations'][0]['value'] = "“" + j['question'] + "”"+ "，这句话中要检测的内容有哪些，被检测对象之间是否存在修饰关系，帮我生成一个python列表，包括图中被检测的对象，列表的最后一位写入是否存在修饰关系，如果存在修饰关系则写true，如果不存在修饰关系写false，列表中的检测对象不要存在动词，并转换为英文"
            temp['conversations'][1]['value'] = str(j['answer'])
            res.append(temp)
split = int(len(res)*0.8)
res_train = res[:split]
res_val = res[split:]
with open('/home/workspace/qwen2.5-dataset-merged/qwen2.5_train_sharegpt.json', 'w', encoding='utf-8') as f:
    json.dump(res_train, f, ensure_ascii=False, indent=4)
with open('/home/workspace/qwen2.5-dataset-merged/qwen2.5_val_sharegpt.json', 'w', encoding='utf-8') as f:
    json.dump(res_val, f, ensure_ascii=False, indent=4)

执行后生成如下文件：

其内容如下：

LLama-factory大语言模型微调(服务器)

1. conda create --name llama-factory python=3.10

2. conda activate llama-factory

3. git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git

4. cd LLaMA-Factory

5. pip install -e ".[torch,metrics]"

6. 制作dataset_ifo.json文件放在数据集所在目录，内容如下

{"qwen2.5_train_sharegpt": {

"file_name": "qwen2.5_train_sharegpt.json",

"formatting": "sharegpt",

"columns": {

"messages": "conversations"}}}

7. CUDA_VISIBLE_DEVICES=2,3 llamafactory-cli webui 启动llama-factory

8. 打开浏览器，输入http://192.168.1.10:6006/ ，然后开始配置。
   1）选择语言zh（中文）。
   2）选择或编辑“模型名称”，即使用的基础模型的名称。
   3）指定“模型路径”在本地的绝对路径，如果是相对路径，就可能会使用HF的模型。
   4）微调方法默认选择“lora”。
   5）“检查点路径”如果是首次训练，则为空，否则可以选择已有的检查点路径。
   6）“高级配置”中均可选择默认值。
   7）选择“Train”页面开始配置训练相关参数：
       a. “训练阶段”选择“Supervised Fine-Tuning”。
       b. “数据路径”选择绝对路径或者相对路径“data”。
       c. “数据集”配置：输入数据集的名称turing_identity和rap_data。
       d. 通过“预览数据集”查看数据集是否加载正常。
       e. “训练轮数”自定义。
       f. “计算类型”选择fp16（V100不支持bf16）。
       g. “批处理大小”自定义。
   8）其他训练参数可以默认。
   9）完成训练参数配置后，可以选择“预览命令”预览训练命令。另外，可以“保存训练参数”。
   10）然后就可以选择“开始”按钮进行训练。
   11）“设备数量”是当前可用的设备数，如果使用CUDA_VISIBLE_DEVICES指定了设备编号，则“设备数量”仅显示指定的设备数。
   12）开始训练之后，如果没有报错，则在右侧的“损失”图中会显示当前的损失值变化。

9. 点击“Chat”按钮，然后选择“检查点路径”，再“加载模型”，待模型加载完成，就可以开始聊天测试。

10. 点击“Export”按钮，然后设置“导出目录”，如“/data/Qwen2.5-7B-finetuned”，其他配置保持默认, 然后点“开始导出”按钮，即可以开始导出模型。

LLama-cpp大语言模型量化(服务器)

1. git clone https://github.com/ggerganov/llama.cpp.git

2. cd llama.cpp

3. conda activate llama-factory

4. pip install -r requirements.txt

5. make -j

6. python convert_hf_to_gguf.py /data/Qwen2.5-7B-finetuned --outfile ./Qwen2.5/qwen2.5-7b-finetuned-export/qwen2_5-7b-finetuned.gguf

7. ./llama-quantize ./Qwen2.5/qwen2.5-7b-finetuned-export/qwen2_5-7b-finetuned.gguf ./Qwen2.5/qwen2.5-7b-finetuned-export/qwen2_5-7b-finetuned-q4_0.gguf q4_0

ollama模型部署(jetson)

1. 创建qwen2.5-7b-finetuned.Modelfile文件，写入以下内容

FROM ./Qwen2.5/qwen2.5-7b-finetuned-export/qwen2_5-7b-finetuned-q4_0.gguf
TEMPLATE """{{- if .Messages }}
{{- if or .System .Tools }}<|im_start|>system
{{- if .System }}
{{ .System }}
{{- end }}
{{- if .Tools }}

# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}
{{- end }}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end }}<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ if .Content }}{{ .Content }}
{{- else if .ToolCalls }}<tool_call>
{{ range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{ end }}</tool_call>
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
{{ end }}
{{- end }}
{{- else }}
{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ end }}{{ .Response }}{{ if .Response }}<|im_end|>{{ end }}"""

2. 将转出的量化后的gguf文件和新建的modelfile文件移动到jetson设备上

3. 在jetson上安装ollama(自行安装即可)

4. ollama create your_model_name -f qwen2.5-7b-finetuned.Modelfile

5. 执行ollama list查看是否部署成功

逻辑代码

由于涉及业务逻辑，因此仅开放demo代码，其余需求需自行实现，逻辑代码流程图如下：

服务端代码

1.找到owl_drawing.py文件，在最后添加如下代码

#宽广语义检测功能函数
def decode_output(output):
    temp = {}
    num_detections = len(output.labels)
    for i in range(num_detections):
        label_index = int(output.labels[i])
        temp[label_index] = []
    for i in range(num_detections):
        box = output.boxes[i]
        label_index = int(output.labels[i])
        box = [int(x) for x in box]
        temp[label_index].append(box)
    return temp

def calculate_iou(box1, box2):
    """
    计算两个框的交并比 (IoU)。
    :param box1: 第一个框 (x1, y1, x2, y2)
    :param box2: 第二个框 (x1, y1, x2, y2)
    :return: IoU 值
    """
    # 计算两个框的交集
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])

    # 如果没有交集，则 IoU 为 0
    if x2 < x1 or y2 < y1:
        return 0.0

    # 计算交集面积
    intersection_area = (x2 - x1) * (y2 - y1)

    # 计算两个框的面积
    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])

    # 计算并集面积
    union_area = area1 + area2 - intersection_area

    # 计算 IoU
    #iou = intersection_area / union_area
    iou = max(intersection_area/area1, intersection_area/area2)
    return iou

def find_closest_box(boxes_dict):
    #遍历每个类别的框
    filtered_boxes = {}

    for category, boxes in boxes_dict.items():
        filtered_boxes[category] = []
        for box in boxes:
            keep_box = False
            for other_category, other_boxes in boxes_dict.items():
                if other_category != category:
                    for other_box in other_boxes:
                        iou = calculate_iou(box, other_box)
                        if iou > 0.95:
                            keep_box = True
                            break
                    if keep_box:
                        break
            if keep_box:
                filtered_boxes[category].append(box)
    return filtered_boxes

#宽广语义检测
def draw_owl_output_new(image, output: OwlDecodeOutput, text: List[str], draw_text=True,prompt_flag = False):
    is_pil = not isinstance(image, np.ndarray)
    if is_pil:
        image = np.array(image)
    font = cv2.FONT_HERSHEY_SIMPLEX
    font_scale = 0.75
    colors = get_colors(len(text))
    num_detections = len(output.labels)
    if not prompt_flag:
        for i in range(num_detections):
            box = output.boxes[i]
            label_index = int(output.labels[i])
            box = [int(x) for x in box]
            pt0 = (box[0], box[1])
            pt1 = (box[2], box[3])
            cv2.rectangle(
                image,
                pt0,
                pt1,
                colors[label_index],
                2
            )
            if draw_text:
                offset_y = 12
                offset_x = 0
                label_text = text[label_index] + ' ' + f'{output.scores[i]:.2f}'
                cv2.putText(
                    image,
                    label_text,
                    (box[0] + offset_x, box[1] + offset_y),
                    font,
                    font_scale,
                    colors[label_index],
                    1,# thickness
                    cv2.LINE_AA
                )
    else:
        temp = decode_output(output)
        filter_boxes = find_closest_box(temp)
        for key in filter_boxes.keys():
            boxes = filter_boxes[key]
            for box in boxes:
                cv2.rectangle(
                    image,
                    (box[0], box[1]),
                    (box[2], box[3]),
                    colors[key],
                    2,# thickness
                )
                if draw_text:
                    offset_y = 12
                    offset_x = 0
                    label_text = text[int(key)]
                    cv2.putText(
                        image,
                        label_text,
                        (box[0] + offset_x, box[1] + offset_y),
                        font,
                        font_scale,
                        colors[key],
                        2,# thickness
                        cv2.LINE_AA
                    )
    if is_pil:
        image = PIL.Image.fromarray(image)
    return image

2.sever.py核心代码，代码会将推理的rtsp流结果图像进行保存，请提前设置路径(第162行代码)

import io
from io import BytesIO
import json
import base64
import torch
import PIL.Image
from flask import Flask, request, jsonify
from translate import Translator
from nanoowl.owl_predictor import OwlPredictor
from nanoowl.owl_drawing import draw_owl_output,draw_owl_output_new
import numpy as np
import requests
import cv2


app = Flask(__name__)

#解析文本使用
from openai import OpenAI
import re
import ast

def extract_prompt(prompt):
    client = OpenAI(base_url='http://localhost:11434/v1', api_key='xxx')
    model_type = client.models.list().data[0].id
    print(model_type)
    prompt = "“" + prompt + "”"+ "这句话中要检测的内容有哪些，被检测对象之间是否存在修饰关系，帮我生成一个python列表，包括图中被检测的对象，列表的最后一位写入是否存在修饰关系，如果存在修饰关系则写true，如果不存在修饰关系写false，列表中的检测对象不要存在动词，并转换为英文"
    print(prompt)
    response = client.chat.completions.create(
    model=model_type,
    messages=[
      {
        "role": "user",
        "content": [
          {"type": "text", "text": prompt}
        ],
      }
    ],
    max_tokens=300
    )
    start = response.choices[0].message.content.find("[")
    end = response.choices[0].message.content.find("]")
    result = ast.literal_eval(response.choices[0].message.content[start:end+1])
    return result
def pil_image_to_base64(image):
    buffer = BytesIO()
    # 将图像保存到字节流中
    image.save(buffer, format="JPEG")
    # 获取字节流的内容
    byte_string = buffer.getvalue()
    # 将字节流转换为Base64字符串
    base64_string = base64.b64encode(byte_string).decode('utf-8')
    return base64_string

# def translate_prompt(prompt):
#     res = []
#     url = 'https://fanyi.baidu.com/sug'
#     for word in prompt:
#         data = {}
#         data["kw"] = word
#         temp = requests.post(url, data=data).json()
#         print(word,temp)
#         if len(temp['data'])==0:
#             continue
#         temp_text = temp['data'][0]['v']
#         match_english = re.search(r'\b[a-zA-Z]+\b', temp_text)
#         res.append(match_english.group())
#     return res

# def deal_prompt(prompt):
#     # prompt = prompt.strip("][()")
#     # text = prompt.split(',')
#     new_prompt = translate_prompt(prompt)
#     return new_prompt

def deal_threshold(threshold):
    thresholds = threshold.strip("][()")
    thresholds = thresholds.split(',')
    if len(thresholds) == 1:
        thresholds = float(thresholds[0])
    else:
        thresholds = [float(x) for x in thresholds]
    return thresholds

def load_model(model, image_encoder_engine):
    predictor = OwlPredictor(
        model,
        image_encoder_engine=image_encoder_engine,
        no_roi_align=True
    )
    return predictor

def predict(predictor, image, text, text_encodings, thresholds, nms_threshold):
    output = predictor.predict(
        image=image,
        text=text,
        text_encodings=text_encodings,
        threshold=thresholds,
        nms_threshold=nms_threshold,
        pad_square=False
    )
    return output


@app.route('/predict', methods=['POST'])
def predict_endpoint():
    global history_prompt, history_text, history_thresholds, history_text_encoding,history_flag
    try:
        # 获取请求中的 JSON 数据
        data = request.get_json()
        print(data)
        # 检查请求数据
        if not data or 'rtsp' not in data or 'prompt' not in data:
            return jsonify({'error': 'Missing required fields'}), 400

        # 解析请求数据
        rtsp_url = data['rtsp']
        prompt = data['prompt']
        threshold = data.get('threshold', "0.1")
        nms_threshold = float(data.get('nms_threshold', 0.5))

        # 处理提示信息
        print("当前prompt:",prompt)
        print("历史prompt:",history_prompt)
        if prompt != history_prompt:
            text = extract_prompt(prompt)
            print(text)
            flag = text[-1]
            text = text[0:-1]
            print(text)
            print(flag)
            thresholds = deal_threshold(threshold)
            text_encodings = predictor.encode_text(text)
            history_prompt = prompt
            history_text = text
            history_thresholds = thresholds
            history_text_encoding = text_encodings 
            history_flag = flag
        else:
            text = history_text
            thresholds = history_thresholds
            text_encodings = history_text_encoding
            flag = history_flag

        cap = cv2.VideoCapture(rtsp_url)
        ret,frame = cap.read()
        i = 0
        while ret:
            ret,frame = cap.read()
            cv_image_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            image = PIL.Image.fromarray(cv_image_rgb)

            # 进行预测
            output = predict(predictor, image, text, text_encodings, thresholds, nms_threshold)
            image = draw_owl_output_new(image, output, text=text, draw_text=True,prompt_flag = flag)
            #temp = pil_image_to_base64(image)
            # 返回结果
            result = {
                'output': pil_image_to_base64(image),
                'text': text
            }
            image.save("/data/pic/temp/output_"+str(i)+".jpg")
            print(i)
            i = i + 1

    except Exception as e:
        return jsonify({'error': str(e)}), 400

if __name__ == '__main__':
    model = "google/owlvit-base-patch32"
    image_encoder_engine = "/data/owl2/nanoowl/data/owl_image_encoder_patch32.engine"
    predictor = load_model(model, image_encoder_engine)
    global history_prompt, history_text, history_thresholds, history_text_encoding,history_flag
    history_prompt = ""
    history_text = ""
    history_thresholds = ""
    history_text_encoding = ""
    history_flag = ""
    print("load model success")
    app.run(host='0.0.0.0', port=8079)

客户端代码

1. client.py核心代码

import base64
import requests
from PIL import Image
import cv2
import numpy as np

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
    return encoded_string
    
def decode_image(base64_image):
     img = base64.b64decode(base64_image)
     return img

def send_request(rtsp_url, prompt, threshold, nms_threshold):

    data = {
        "rtsp": rtsp_url,
        "prompt": prompt,
        "threshold": str(threshold),
        "nms_threshold": str(nms_threshold)
    }

    url = "http://0.0.0.0:8079/predict"
    
    response = requests.post(url, json=data, headers={"Content-Type": "application/json"})

    if response.status_code == 200:
        return response.json()
    else:
        return {"error": f"Request failed with status {response.status_code}"}

if __name__ == "__main__":
    #rtsp_url = "rtsp://admin:****@***.***.*.**:***"  # 替换为你的图像路径
    rtsp_url = "rtsp://***.***.*.***:***/record/11.mp4"
    prompt = "查找图中的男人和行李箱"
    threshold = 0.02
    nms_threshold = 0.5

    result = send_request(rtsp_url, prompt, threshold, nms_threshold)