【NVIDIA NIM 黑客松训练营】看图说话

深耕AI

已于 2024-10-10 16:26:39 修改

阅读量191

点赞数 2

文章标签：人工智能 python 开发语言

于 2024-10-10 14:15:16 首次发布

本文链接：https://blog.csdn.net/weixin_45037357/article/details/142820162

版权

版本

最终提交版【比赛用】

项目背景

图像识别和描述技术在各个领域都有广泛的应用，比如：社交媒体、无障碍辅助，小孩语言学习等。通过自动化图像描述系统，可以帮助用户快速理解图像内容，特别是对于视障用户，这种技术能够显著提升他们的生活质量。本项目旨在：构建一个简易的图像描述系统，用户只需上传图像并输入API密钥，即可获得自动生成的图像描述。

技术栈

编程语言

Python: 本项目主要使用Python语言进行开发，其丰富的库资源和社区支持使其成为图像处理和网络请求的理想选择。

图像处理

PIL (Python Imaging Library)/Pillow: 用于打开、处理和压缩图像。

网络请求

Requests: 用于向NVIDIA的API发送请求并接收响应。

Web应用框架

Gradio: 用于快速创建和部署基于浏览器的用户界面，使用户能够方便地上传图像和输入API密钥。

项目说明

用户只需要上传一个图片，并输入自己的api验证码，就可以得到图片描述。

代码

环境安装

pip install pillow
pip install gradio==3.50.0

项目代码

import base64
import io
import json
from PIL import Image
import gradio as gr
import requests


def translate_to_chinese(text):
    headers = {"Authorization": "Bearer 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"}
    API_URL = "https://api-inference.huggingface.co/models/Helsinki-NLP/opus-mt-en-zh"

    def query(payload):
        response = requests.post(API_URL, headers=headers, json=payload)
        return response.json()

    # 获取翻译结果
    response = query({"inputs": text})

    # 提取翻译文本
    if response and isinstance(response, list) and "translation_text" in response[0]:
        translated_text = response[0]["translation_text"]
    else:
        translated_text = "Translation Error: Translation failed"

    # 返回翻译结果
    return translated_text

def process_img(file, api_key):
    invoke_url = "https://ai.api.nvidia.com/v1/vlm/adept/fuyu-8b"
    stream = True

    # 打开并压缩图像
    with Image.open(file.name) as img:
        if img.mode == 'RGBA':
            img = img.convert('RGB')  # 转换为RGB模式
        buffered = io.BytesIO()
        img = img.resize((img.width // 2, img.height // 2))  # 调整图像大小
        img.save(buffered, format="JPEG", quality=50)  # 调整质量以压缩图像
        image_b64 = base64.b64encode(buffered.getvalue()).decode()

    assert len(image_b64) < 180_000, \
        "To upload larger images, use the assets API (see docs)"

    headers = {
        "Authorization": f"Bearer {api_key}",
        "Accept": "text/event-stream" if stream else "application/json"
    }

    payload = {
        "messages": [
            {
                "role": "user",
                "content": f'What do you see in the following image? <img src="data:image/jpeg;base64,{image_b64}" />'
            }
        ],
        "max_tokens": 1024,
        "temperature": 0.20,
        "top_p": 0.70,
        "seed": 0,
        "stream": stream
    }

    try:
        response = requests.post(invoke_url, headers=headers, json=payload)
        response.raise_for_status()
    except requests.RequestException as e:
        return f"Error: {e}", ""

    result = ""
    if stream:
        for line in response.iter_lines():
            if line:
                decoded_line = line.decode("utf-8")
                if decoded_line.startswith("data: "):
                    decoded_line = decoded_line[len("data: "):]
                if decoded_line != "[DONE]":
                    try:
                        data = json.loads(decoded_line)
                        delta = data.get("choices", [{}])[0].get("delta", {}).get("content", "")
                        result += delta
                    except json.JSONDecodeError:
                        continue
    else:
        result = response.json()

    # 翻译结果
    translated_result = translate_to_chinese(result)

    return result, translated_result

iface = gr.Interface(
    fn=process_img,
    inputs=[
        gr.File(label="请选择一张你想要描述的图片", type="file"),
        gr.Textbox(label="NVIDIA API Key"),
    ],
    outputs=[
        gr.Textbox(label="生成的描述【英文】"),
        gr.Textbox(label="中文")
    ],
    title="看图说话"
)

iface.launch()

运行效果

上传的图像为：

运行的效果如下：

可以看到根据我们上传的图像生成了对应的英文结果描述：The image features a large, smiling stuffed toy bear, which appears to be jumping in the air. The bear is brown in color, and it appears to be animated.

翻译为中文就是：图像中有一个大而微笑的玩具熊, 它似乎在空中跳跃。熊的颜色是棕色的, 它看起来是动画的。