前言
本文不生产技术,只做技术的搬运工!该框架结构图如下:
本文将使用开源框架nanoowl+qwen2.5实现开放语义检测,如提示词:"查找图中的男人和行李箱"(将查找出画面中的男人和行李箱,具体效果由模型决定)、"查找图中穿着白色T恤的女人"(将查找出画面中的穿着白色T恤的女人,其他穿着女人除外,具体效果由模型决定)。
涉及到的技术主要包括:owl模型部署、自动化数据集生成、LLama-factory大语言模型微调、LLama-cpp大语言模型量化、ollama模型部署及逻辑实现。
OWL模型部署(jetson)
1.jetson上安装miniconda
参考地址:https://github.com/BestAnHongjun/LMDeploy-Jetson/blob/main/zh/s2.md
2.conda create -n owl python=3.8创建conda环境
3.conda activate owl激活环境
4.配置owl环境
参考地址:https://github.com/xuanlinli17/nanoowl
tensorrt安装参考地址:jetson agx orin 的pytorch、torchvision、tensorrt安装最全教程_jetson pytorch-CSDN博客
5.修改hugging face模型下载源(可选,如果后续步骤出现网络问题,可以返回这一步解决)
参考地址(作者本人博客):解决OSError: We couldn‘t connect to ‘https://huggingface.co‘ to load this file-CSDN博客
6.按照第4步中提到的参考地址:https://github.com/xuanlinli17/nanoowl进行模型转换,并进行测试(如果这一步搞不通,后续无法完成整体框架搭建)
自动化数据集生成(服务器)
由于qwen2.5格式化输出有时会出现不正确的格式,因此需要进行微调,这里编写了一个脚本进行自动生成数据,并编写脚本进行sharegpt数据格式转换,这里可以按照自己的需求自行编写自动化脚本进行数据生成。
数据生成脚本
import random
import json
# 定义一些常见的与人相关的对象和修饰词
people_zh = ["男人", "女人", "男孩", "女孩", "人"]
people_en = ["man", "woman", "boy", "girl", "person"]
# 更丰富的配饰列表
accessories_zh = [
"口罩", "帽子", "眼镜", "墨镜", "围巾", "手套", "手表", "项链", "手链",
"戒指", "耳环", "腰带", "领带", "领结", "袖扣", "胸针", "吊坠", "脚链",
"手镯手表", "智能手表", "健身追踪器", "手环", "发带", "发夹"
]
#"背包", "钱包", "雨伞", "拐杖"
accessories_en = [
"mask", "hat", "glasses", "sunglasses", "scarf", "glove", "watch", "necklace", "bracelet",
"ring", "earrings", "belt", "tie", "bow tie", "cufflinks", "brooch", "pendant", "anklet",
"bracelet watch", "smartwatch", "fitness tracker", "wristband", "headband", "hair clip"
]
#"backpack", "umbrella", "cane",
actions = ["戴着"]
# 生成1000条数据
data = []
for _ in range(100):
person_zh = random.choice(people_zh)
person_en = people_en[people_zh.index(person_zh)]
object1_zh = random.choice(accessories_zh)
object1_en = accessories_en[accessories_zh.index(object1_zh)]
object2_zh = random.choice(accessories_zh)
object2_en = accessories_en[accessories_zh.index(object2_zh)]
# 确保两个对象不同
while object1_zh == object2_zh:
object2_zh = random.choice(accessories_zh)
object2_en = accessories_en[accessories_zh.index(object2_zh)]
# 随机决定是否存在修饰关系
has_modifier = random.choice([True, False])
if has_modifier:
action = random.choice(actions)
question = f"查找图中{action}{object1_zh}的{person_zh}"
answer = [person_en, object1_en, True]
else:
question = f"查找图中的{person_zh}和{object2_zh}"
answer = [person_en, object2_en, False]
# 生成数据条目
data.append({
"question": question,
"answer": answer
})
# 打印前10条数据作为示例
for i, item in enumerate(data[:10]):
print(f"示例 {i + 1}: {item}")
# 将生成的数据保存到文件
with open('generated_data_daizhe.json', 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
print("数据已生成并保存到 generated_data_daizhe.json 文件中。")
这里需要自行修改随机词汇的内容,作者以拿着、穿着、戴着为标志,生成了三个数据文件
其内容如下:
sharegpt数据转换
import json
import os
def getFileList(dir, Filelist, ext=None):
"""
获取文件夹及其子文件夹中文件列表
输入 dir:文件夹根目录
输入 ext: 扩展名
返回: 文件路径列表
"""
newDir = dir
if os.path.isfile(dir):
if ext is None:
Filelist.append(dir)
else:
if ext in dir:
Filelist.append(dir)
elif os.path.isdir(dir):
for s in os.listdir(dir):
newDir = os.path.join(dir, s)
getFileList(newDir, Filelist, ext)
return Filelist
ori_path = r"/home/workspace/qwen2.5-dataset"
ori_list = []
ori_list = getFileList(ori_path, ori_list, ".json")
template = {
"conversations": [
{
"from": "human",
"value": "user instruction"
},
{
"from": "gpt",
"value": "model response"
}
]
}
res = []
for i in ori_list:
with open(i, 'r', encoding='utf-8') as f:
data = json.load(f)
for j in data:
temp = {"conversations": [{"from": "human","value": "user instruction"},{"from": "gpt","value": "model response"}]}
temp['conversations'][0]['value'] = "“" + j['question'] + "”"+ ",这句话中要检测的内容有哪些,被检测对象之间是否存在修饰关系,帮我生成一个python列表,包括图中被检测的对象,列表的最后一位写入是否存在修饰关系,如果存在修饰关系则写true,如果不存在修饰关系写false,列表中的检测对象不要存在动词,并转换为英文"
temp['conversations'][1]['value'] = str(j['answer'])
res.append(temp)
split = int(len(res)*0.8)
res_train = res[:split]
res_val = res[split:]
with open('/home/workspace/qwen2.5-dataset-merged/qwen2.5_train_sharegpt.json', 'w', encoding='utf-8') as f:
json.dump(res_train, f, ensure_ascii=False, indent=4)
with open('/home/workspace/qwen2.5-dataset-merged/qwen2.5_val_sharegpt.json', 'w', encoding='utf-8') as f:
json.dump(res_val, f, ensure_ascii=False, indent=4)
执行后生成如下文件:
其内容如下:
LLama-factory大语言模型微调(服务器)
1. conda create --name llama-factory python=3.10
2. conda activate llama-factory
3. git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
4. cd LLaMA-Factory
5. pip install -e ".[torch,metrics]"
6. 制作dataset_ifo.json文件放在数据集所在目录,内容如下
{"qwen2.5_train_sharegpt": {
"file_name": "qwen2.5_train_sharegpt.json",
"formatting": "sharegpt",
"columns": {
"messages": "conversations"}}}
7. CUDA_VISIBLE_DEVICES=2,3 llamafactory-cli webui 启动llama-factory
8. 打开浏览器,输入http://192.168.1.10:6006/ ,然后开始配置。
1)选择语言zh(中文)。
2)选择或编辑“模型名称”,即使用的基础模型的名称。
3)指定“模型路径”在本地的绝对路径,如果是相对路径,就可能会使用HF的模型。
4)微调方法默认选择“lora”。
5)“检查点路径”如果是首次训练,则为空,否则可以选择已有的检查点路径。
6)“高级配置”中均可选择默认值。
7)选择“Train”页面开始配置训练相关参数:
a. “训练阶段”选择“Supervised Fine-Tuning”。
b. “数据路径”选择绝对路径或者相对路径“data”。
c. “数据集”配置:输入数据集的名称turing_identity和rap_data。
d. 通过“预览数据集”查看数据集是否加载正常。
e. “训练轮数”自定义。
f. “计算类型”选择fp16(V100不支持bf16)。
g. “批处理大小”自定义。
8)其他训练参数可以默认。
9)完成训练参数配置后,可以选择“预览命令”预览训练命令。另外,可以“保存训练参数”。
10)然后就可以选择“开始”按钮进行训练。
11)“设备数量”是当前可用的设备数,如果使用CUDA_VISIBLE_DEVICES指定了设备编号,则“设备数量”仅显示指定的设备数。
12)开始训练之后,如果没有报错,则在右侧的“损失”图中会显示当前的损失值变化。
9. 点击“Chat”按钮,然后选择“检查点路径”,再“加载模型”,待模型加载完成,就可以开始聊天测试。
10. 点击“Export”按钮,然后设置“导出目录”,如“/data/Qwen2.5-7B-finetuned”,其他配置保持默认, 然后点“开始导出”按钮,即可以开始导出模型。
LLama-cpp大语言模型量化(服务器)
1. git clone https://github.com/ggerganov/llama.cpp.git
2. cd llama.cpp
3. conda activate llama-factory
4. pip install -r requirements.txt
5. make -j
6. python convert_hf_to_gguf.py /data/Qwen2.5-7B-finetuned --outfile ./Qwen2.5/qwen2.5-7b-finetuned-export/qwen2_5-7b-finetuned.gguf
7. ./llama-quantize ./Qwen2.5/qwen2.5-7b-finetuned-export/qwen2_5-7b-finetuned.gguf ./Qwen2.5/qwen2.5-7b-finetuned-export/qwen2_5-7b-finetuned-q4_0.gguf q4_0
ollama模型部署(jetson)
1. 创建qwen2.5-7b-finetuned.Modelfile文件,写入以下内容
FROM ./Qwen2.5/qwen2.5-7b-finetuned-export/qwen2_5-7b-finetuned-q4_0.gguf
TEMPLATE """{{- if .Messages }}
{{- if or .System .Tools }}<|im_start|>system
{{- if .System }}
{{ .System }}
{{- end }}
{{- if .Tools }}
# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}
{{- end }}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end }}<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ if .Content }}{{ .Content }}
{{- else if .ToolCalls }}<tool_call>
{{ range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{ end }}</tool_call>
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
{{ end }}
{{- end }}
{{- else }}
{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ end }}{{ .Response }}{{ if .Response }}<|im_end|>{{ end }}"""
2. 将转出的量化后的gguf文件和新建的modelfile文件移动到jetson设备上
3. 在jetson上安装ollama(自行安装即可)
4. ollama create your_model_name -f qwen2.5-7b-finetuned.Modelfile
5. 执行ollama list查看是否部署成功
逻辑代码
由于涉及业务逻辑,因此仅开放demo代码,其余需求需自行实现,逻辑代码流程图如下:
服务端代码
1.找到owl_drawing.py文件,在最后添加如下代码
#宽广语义检测功能函数
def decode_output(output):
temp = {}
num_detections = len(output.labels)
for i in range(num_detections):
label_index = int(output.labels[i])
temp[label_index] = []
for i in range(num_detections):
box = output.boxes[i]
label_index = int(output.labels[i])
box = [int(x) for x in box]
temp[label_index].append(box)
return temp
def calculate_iou(box1, box2):
"""
计算两个框的交并比 (IoU)。
:param box1: 第一个框 (x1, y1, x2, y2)
:param box2: 第二个框 (x1, y1, x2, y2)
:return: IoU 值
"""
# 计算两个框的交集
x1 = max(box1[0], box2[0])
y1 = max(box1[1], box2[1])
x2 = min(box1[2], box2[2])
y2 = min(box1[3], box2[3])
# 如果没有交集,则 IoU 为 0
if x2 < x1 or y2 < y1:
return 0.0
# 计算交集面积
intersection_area = (x2 - x1) * (y2 - y1)
# 计算两个框的面积
area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
# 计算并集面积
union_area = area1 + area2 - intersection_area
# 计算 IoU
#iou = intersection_area / union_area
iou = max(intersection_area/area1, intersection_area/area2)
return iou
def find_closest_box(boxes_dict):
#遍历每个类别的框
filtered_boxes = {}
for category, boxes in boxes_dict.items():
filtered_boxes[category] = []
for box in boxes:
keep_box = False
for other_category, other_boxes in boxes_dict.items():
if other_category != category:
for other_box in other_boxes:
iou = calculate_iou(box, other_box)
if iou > 0.95:
keep_box = True
break
if keep_box:
break
if keep_box:
filtered_boxes[category].append(box)
return filtered_boxes
#宽广语义检测
def draw_owl_output_new(image, output: OwlDecodeOutput, text: List[str], draw_text=True,prompt_flag = False):
is_pil = not isinstance(image, np.ndarray)
if is_pil:
image = np.array(image)
font = cv2.FONT_HERSHEY_SIMPLEX
font_scale = 0.75
colors = get_colors(len(text))
num_detections = len(output.labels)
if not prompt_flag:
for i in range(num_detections):
box = output.boxes[i]
label_index = int(output.labels[i])
box = [int(x) for x in box]
pt0 = (box[0], box[1])
pt1 = (box[2], box[3])
cv2.rectangle(
image,
pt0,
pt1,
colors[label_index],
2
)
if draw_text:
offset_y = 12
offset_x = 0
label_text = text[label_index] + ' ' + f'{output.scores[i]:.2f}'
cv2.putText(
image,
label_text,
(box[0] + offset_x, box[1] + offset_y),
font,
font_scale,
colors[label_index],
1,# thickness
cv2.LINE_AA
)
else:
temp = decode_output(output)
filter_boxes = find_closest_box(temp)
for key in filter_boxes.keys():
boxes = filter_boxes[key]
for box in boxes:
cv2.rectangle(
image,
(box[0], box[1]),
(box[2], box[3]),
colors[key],
2,# thickness
)
if draw_text:
offset_y = 12
offset_x = 0
label_text = text[int(key)]
cv2.putText(
image,
label_text,
(box[0] + offset_x, box[1] + offset_y),
font,
font_scale,
colors[key],
2,# thickness
cv2.LINE_AA
)
if is_pil:
image = PIL.Image.fromarray(image)
return image
2.sever.py核心代码,代码会将推理的rtsp流结果图像进行保存,请提前设置路径(第162行代码)
import io
from io import BytesIO
import json
import base64
import torch
import PIL.Image
from flask import Flask, request, jsonify
from translate import Translator
from nanoowl.owl_predictor import OwlPredictor
from nanoowl.owl_drawing import draw_owl_output,draw_owl_output_new
import numpy as np
import requests
import cv2
app = Flask(__name__)
#解析文本使用
from openai import OpenAI
import re
import ast
def extract_prompt(prompt):
client = OpenAI(base_url='http://localhost:11434/v1', api_key='xxx')
model_type = client.models.list().data[0].id
print(model_type)
prompt = "“" + prompt + "”"+ "这句话中要检测的内容有哪些,被检测对象之间是否存在修饰关系,帮我生成一个python列表,包括图中被检测的对象,列表的最后一位写入是否存在修饰关系,如果存在修饰关系则写true,如果不存在修饰关系写false,列表中的检测对象不要存在动词,并转换为英文"
print(prompt)
response = client.chat.completions.create(
model=model_type,
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt}
],
}
],
max_tokens=300
)
start = response.choices[0].message.content.find("[")
end = response.choices[0].message.content.find("]")
result = ast.literal_eval(response.choices[0].message.content[start:end+1])
return result
def pil_image_to_base64(image):
buffer = BytesIO()
# 将图像保存到字节流中
image.save(buffer, format="JPEG")
# 获取字节流的内容
byte_string = buffer.getvalue()
# 将字节流转换为Base64字符串
base64_string = base64.b64encode(byte_string).decode('utf-8')
return base64_string
# def translate_prompt(prompt):
# res = []
# url = 'https://fanyi.baidu.com/sug'
# for word in prompt:
# data = {}
# data["kw"] = word
# temp = requests.post(url, data=data).json()
# print(word,temp)
# if len(temp['data'])==0:
# continue
# temp_text = temp['data'][0]['v']
# match_english = re.search(r'\b[a-zA-Z]+\b', temp_text)
# res.append(match_english.group())
# return res
# def deal_prompt(prompt):
# # prompt = prompt.strip("][()")
# # text = prompt.split(',')
# new_prompt = translate_prompt(prompt)
# return new_prompt
def deal_threshold(threshold):
thresholds = threshold.strip("][()")
thresholds = thresholds.split(',')
if len(thresholds) == 1:
thresholds = float(thresholds[0])
else:
thresholds = [float(x) for x in thresholds]
return thresholds
def load_model(model, image_encoder_engine):
predictor = OwlPredictor(
model,
image_encoder_engine=image_encoder_engine,
no_roi_align=True
)
return predictor
def predict(predictor, image, text, text_encodings, thresholds, nms_threshold):
output = predictor.predict(
image=image,
text=text,
text_encodings=text_encodings,
threshold=thresholds,
nms_threshold=nms_threshold,
pad_square=False
)
return output
@app.route('/predict', methods=['POST'])
def predict_endpoint():
global history_prompt, history_text, history_thresholds, history_text_encoding,history_flag
try:
# 获取请求中的 JSON 数据
data = request.get_json()
print(data)
# 检查请求数据
if not data or 'rtsp' not in data or 'prompt' not in data:
return jsonify({'error': 'Missing required fields'}), 400
# 解析请求数据
rtsp_url = data['rtsp']
prompt = data['prompt']
threshold = data.get('threshold', "0.1")
nms_threshold = float(data.get('nms_threshold', 0.5))
# 处理提示信息
print("当前prompt:",prompt)
print("历史prompt:",history_prompt)
if prompt != history_prompt:
text = extract_prompt(prompt)
print(text)
flag = text[-1]
text = text[0:-1]
print(text)
print(flag)
thresholds = deal_threshold(threshold)
text_encodings = predictor.encode_text(text)
history_prompt = prompt
history_text = text
history_thresholds = thresholds
history_text_encoding = text_encodings
history_flag = flag
else:
text = history_text
thresholds = history_thresholds
text_encodings = history_text_encoding
flag = history_flag
cap = cv2.VideoCapture(rtsp_url)
ret,frame = cap.read()
i = 0
while ret:
ret,frame = cap.read()
cv_image_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
image = PIL.Image.fromarray(cv_image_rgb)
# 进行预测
output = predict(predictor, image, text, text_encodings, thresholds, nms_threshold)
image = draw_owl_output_new(image, output, text=text, draw_text=True,prompt_flag = flag)
#temp = pil_image_to_base64(image)
# 返回结果
result = {
'output': pil_image_to_base64(image),
'text': text
}
image.save("/data/pic/temp/output_"+str(i)+".jpg")
print(i)
i = i + 1
except Exception as e:
return jsonify({'error': str(e)}), 400
if __name__ == '__main__':
model = "google/owlvit-base-patch32"
image_encoder_engine = "/data/owl2/nanoowl/data/owl_image_encoder_patch32.engine"
predictor = load_model(model, image_encoder_engine)
global history_prompt, history_text, history_thresholds, history_text_encoding,history_flag
history_prompt = ""
history_text = ""
history_thresholds = ""
history_text_encoding = ""
history_flag = ""
print("load model success")
app.run(host='0.0.0.0', port=8079)
客户端代码
1. client.py核心代码
import base64
import requests
from PIL import Image
import cv2
import numpy as np
def encode_image(image_path):
with open(image_path, "rb") as image_file:
encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
return encoded_string
def decode_image(base64_image):
img = base64.b64decode(base64_image)
return img
def send_request(rtsp_url, prompt, threshold, nms_threshold):
data = {
"rtsp": rtsp_url,
"prompt": prompt,
"threshold": str(threshold),
"nms_threshold": str(nms_threshold)
}
url = "http://0.0.0.0:8079/predict"
response = requests.post(url, json=data, headers={"Content-Type": "application/json"})
if response.status_code == 200:
return response.json()
else:
return {"error": f"Request failed with status {response.status_code}"}
if __name__ == "__main__":
#rtsp_url = "rtsp://admin:****@***.***.*.**:***" # 替换为你的图像路径
rtsp_url = "rtsp://***.***.*.***:***/record/11.mp4"
prompt = "查找图中的男人和行李箱"
threshold = 0.02
nms_threshold = 0.5
result = send_request(rtsp_url, prompt, threshold, nms_threshold)