Caption-Anything项目Ubuntu22.04系统复现-CSDN博客

本文链接：https://blog.csdn.net/YIBO0408/article/details/133141420

本文介绍了南方科技大学开发的Caption-Anything项目，集成了图像分割和文本生成功能，能为图像中的对象生成描述性文本。文章详细指导了环境搭建、依赖安装和项目运行，包括使用不同模型对显存的需求和解决常见问题的方法。

摘要由CSDN通过智能技术生成

一、项目介绍

1.Github地址：

https://github.com/ttengwang/Caption-Anythinghttps://github.com/ttengwang/Caption-Anything2.论文地址：

https://arxiv.org/abs/2305.02677https://arxiv.org/abs/2305.026773.项目简介：

Caption-Anything 是由南方科技大学团队推出的一款多模态的图像处理工具，它结合了今年主流的 Segment-Anything 和 ChatGPT 生成字幕解释的功能，分别对应图像分割（通过鼠标点击，生成点、框、轨迹）和文本生成（生成有长度、有情感、事实性的文本）。 Caption-Anything 的具体功能是为输入图像中的任何对象(object)生成描述性的文本标题，它还提供了一系列语言类型以适应不同国家的用户偏好。总的来说，它将视觉和语言提示统一到一个模块化的框架中，从而实现不同控件之间的灵活组合。

用于文本生成的可视控件和语言控件
选择所选对象，详细了解
交互式演示

二、项目实现

1.环境搭建：

# 克隆项目目录(如果git clone出错，简单的方法直接下载zip，或者看下面问题解决):

git clone https://github.com/ttengwang/caption-anything.git

cd caption-anything

# 安装项目依赖(python版本需>=3.8.1，其中有坑不推荐一键安装，具体看下面问题解决):
conda create -n cat python==3.10

conda activate cat

pip install -r requirements.txt(具体下载速度快慢我就不阐述了，以前发过)

# 配置OpenAI的ChatGPT4的api key（GPT3.5的key无法使用，至少我无法使用，后面在网页打开时配置）

# 运行Caption-Anything的gradio demo:

python app_langchain.py --segmenter huge --captioner blip2 --port 6086  --clip_filter  
# 需要13G以上的GPU显存，基本3090、3090Ti、4080、4090才可以

#python app_langchain.py --segmenter base --captioner blip2 
# 需要9G的显存，3080、3080Ti、3090、3090Ti、
4060Ti12GB、4070、4070Ti、4080、4090可以

#python app_langchain.py --segmenter base --captioner blip 
# 需要6G的显存，非生产力平民可以试试跑这个demo

# (可选推荐) 使用SAM的预训练模型:

wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth ./sam_vit_h_4b8939.pth # 保持网络代理畅通

python app_langchain.py --segmenter huge --captioner blip2 --segmenter_checkpoint ./sam_vit_h_4b8939.pth  # 需要13G左右的显存

2.问题解决：

git clone问题：

使用 git clone 拉取 Github 项目仓库时，我遇到了 "Recv failure: Connection was reset" 和 "Failed to connect to http://github.com port 443 after 23456 ms: Couldn't connect to server" 的报错。此时，需要为 git 单独配置代理，根据你的代理端口号(默认为7890,127.0.0.1代表本机)使用如下命令：

git config --global http.proxy http://127.0.0.1:7890
git config --global https.proxy http://127.0.0.1:7890

这样就完成了配置，可以 git clone 一下试试：

git clone https://github.com/ttengwang/caption-anything.git

如果仍然报错，并且错误为SSLError，可以用以下命令设置关闭 SSL 证书验证即可：

git config --global http.sslVerify false

安装依赖的问题：

gradio版本必须要指定作者requirments里的版本，我安装的是最新版本一些控件功能被删除而导致出错：
```
pip install https://gradio-builds.s3.amazonaws.com/3e68e5e882a6790ac5b457bd33f4edf9b695af90/gradio-3.24.1-py3-none-any.whl
```
langchain版本也必须按照requirments里的版本安装
pillow的版本连带安装的版本太新，需要降级，否则之后运行项目会出现AttributeError: 'FreeTypeFont' object has no attribute 'getsize'的bug
```
pip install pillow==9.5
```
torch的版本最新的版本亲测可用，没必要像requirements里的那么旧的版本
其他requirements里的包需要指定版本，默认就好

requirements.txt修改后如下：

torch
torchvision
torchaudio
openai
pillow==9.5
langchain==0.0.101
git+https://github.com/huggingface/transformers.git
ftfy
regex
tqdm
git+https://github.com/openai/CLIP.git
git+https://github.com/facebookresearch/segment-anything.git
opencv-python
pycocotools
matplotlib
onnxruntime
onnx
https://gradio-builds.s3.amazonaws.com/3e68e5e882a6790ac5b457bd33f4edf9b695af90/gradio-3.24.1-py3-none-any.whl
accelerate
bitsandbytes
easyocr
tensorboardX

3.项目运行：

我运行的是SAM的预训练模型sam_vit_h_4b8939.pth：

python app_langchain.py --segmenter huge --captioner blip2 --segmenter_checkpoint ./sam_vit_h_4b8939.pth

此时又遇到了一个头疼的bug，运行这个脚本初始化会连接huggingface.co网站下载相关模型以及配置文件，尽管代理开了，但是始终连接不上网站，出现类似如下错误：

Connection error, and we cannot find the requested files in the disk cache. Please try again or make sure your Internet connection is on.

ConnectionResetError: [Errno 104] Connection reset by peer.

查询网上大部分的回答都没用，终极解决方法：

在项目的app_langchain.py开头添加如下代码才可正确配置代理下载在系统缓存的目录模型：

import os
os.environ['TRANSFORMERS_CACHE'] = '~/.cache/huggingface/hub'

proxy = "http://127.0.0.1:7890"

os.environ['http_proxy'] = proxy
os.environ['HTTP_PROXY'] = proxy
os.environ['https_proxy'] = proxy
os.environ['HTTPS_PROXY'] = proxy

添加位置如下：