想提供Huggingface的transformer库实现多模态模型CLIP的推断,结果报错
(myenv) root@d27d1ff1836c:/home/model_test# python3 CLIP.py
ftfy or spacy is not installed using BERT BasicTokenizer instead of ftfy.
Traceback (most recent call last):
File “/home/model_test/myenv/lib/python3.8/site-packages/PIL/Image.py”, line 3089, in fromarray
mode, rawmode = _fromarray_typemap[typekey]
KeyError: ((1, 1, 224, 224), ‘|u1’)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File “CLIP.py”, line 87, in
inputs = processor(text=text, images=image, return_tensors=“pt”, padding=True)
File “/home/model_test/myenv/lib/python3.8/site-packages/transformers/models/clip/processing_clip.py”, line 148, in call
image_features = self.feature_extractor(images, return_tensors=return_tensors, **kwargs)
File “/home/model_test/myenv/lib/python3.8/site-packages/transformers/models/clip/feature_extraction_clip.py”, line 146, in call
images = [self.resize(image=image, size=self.size, resample=self.resample) for image in images]
File “/home/model_test/myenv/lib/python3.8/site-packages/transformers/models/clip/feature_extraction_clip.py”, line 146, in
images = [self.resize(image=image, size=self.size, resample=self.resample) for image in images]
File “/home/model_test/myenv/lib/python3.8/site-packages/transformers/models/clip/feature_extraction_clip.py”, line 199, in resize
image = self.to_pil_image(image)
File “/home/model_test/myenv/lib/python3.8/site-packages/transformers/image_utils.py”, line 78, in to_pil_image
return PIL.Image.fromarray(image)
File “/home/model_test/myenv/lib/python3.8/site-packages/PIL/Image.py”, line 3092, in fromarray
raise TypeError(msg) from e
TypeError: Cannot handle this data type: (1, 1, 224, 224), |u1
以下注释掉的部分是出错的代码,而没被注释的代码是不出错的代码,最大的区别是没出错的代码没有进行图像预处理
by the way,出错的代码是GPT4给我的(改了10几次还是出错,看来训练数据有问题),而没出错的代码是newbing给我的,一次做对。文末给出提示词。
# import os
# import requests
# import torch
# import numpy as np
# from PIL import Image
# from torchvision import transforms
# from transformers import CLIPProcessor, CLIPFeatureExtractor, CLIPModel
# os.environ["TRANSFORMERS_CACHE"] = "https://mirrors.tuna.tsinghua.edu.cn/hugging-face-models"
# # 定义图像预处理操作
# preprocess = transforms.Compose([
# transforms.Resize(256),
# transforms.CenterCrop(224),
# transforms.ToTensor(),
# transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
# ])
# # 加载预训练模型的处理器和模型
# processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# feature_extractor = CLIPFeatureExtractor.from_pretrained("openai/clip-vit-base-patch32")
# model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
# # 定义图像路径
# image_path = "images/photo-1.jpg"
# # 如果图像不存在,则从网上下载一张
# if not os.path.exists(image_path):
# image_url = "https://images.unsplash.com/photo-1501594907352-04cda38ebc29"
# response = requests.get(image_url)
# with open(image_path, "wb") as f:
# f.write(response.content)
# # 加载图像并执行预处理
# image = Image.open(image_path).convert("RGB")
# # 修改 1:将 PIL 图像转换为 NumPy 数组,并确保数据类型为 uint8
# image_array = np.array(image).astype(np.uint8)
# # 修改 2:检查图像数组的形状,如果需要,调整为 (H, W, C) 或 (H, W)
# image_shape = image_array.shape
# if len(image_shape) == 3 and image_shape[2] == 1:
# image_array = image_array.reshape(image_shape[0], image_shape[1])
# elif len(image_shape) == 4:
# image_array = image_array.transpose(1, 2, 0)
# # 修改 3:将 NumPy 数组转回 PIL 图像
# image = Image.fromarray(image_array)
# # 对图像进行预处理
# image_tensor = preprocess(image)
# # 修改:将图像张量从 (C, H, W) 转换为 (H, W, C),然后再转回 (C, H, W)
# image_np = image_tensor.numpy().transpose(1, 2, 0)
# image_np = image_np.astype(np.uint8) # 确保数据类型为 uint8
# image_tensor = torch.from_numpy(image_np.transpose(2, 0, 1))
# image_batch = image_tensor.unsqueeze(0)
# # 定义输入文本
# texts = ["a cat", "a dog"]
# # 预处理输入(文本和图像)
# text_inputs = processor(text=texts, return_tensors="pt", padding=True)
# image_inputs = feature_extractor(images=image_batch, return_tensors="pt")
# # 获取特征
# with torch.no_grad():
# outputs = model(text_inputs, image_inputs)
# image_features = outputs.image_embeds
# text_features = outputs.text_embeds
# # 计算文本和图像特征之间的余弦相似度
# similarity = text_features.numpy().dot(image_features.numpy().T)
# similarity = torch.from_numpy(similarity)
# # 输出相似度
# print("Cosine similarity between the image and text inputs:", similarity.numpy())
# from transformers import CLIPProcessor, CLIPModel
# import torch
# processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch32')
# model = CLIPModel.from_pretrained('openai/clip-vit-base-patch32')
# text = ["Hello, my dog is cute", "I love my cat"]
# image = torch.randn(2, 3, 224, 224)
# inputs = processor(text=text, images=image, return_tensors="pt", padding=True)
# outputs = model(**inputs)
# print(outputs.logits.shape)
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image_path = "images/photo-1.jpg"
image = Image.open(image_path)
text = ["这是一只狗", "这是一只猫", "这是一辆汽车"]
inputs = processor(text=text, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
print(probs)
正确代码输出如下:
(myenv) root@d27d1ff1836c:/home/model_test# python3 CLIP.py
ftfy or spacy is not installed using BERT BasicTokenizer instead of ftfy.
/home/model_test/myenv/lib/python3.8/site-packages/transformers/feature_extraction_utils.py:158: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at …/torch/csrc/utils/tensor_new.cpp:201.)
tensor = as_tensor(value)
tensor([[0.6413, 0.2198, 0.1389]], grad_fn=)
newbing prompt:
请使用huggingface提供的transformer中的多模态模型CLIP,用简洁的代码实现CLIP模型的核心功能
已发送电子邮件. 请使用huggingface提供的transformer中的多模态模型CLIP,用简洁的代码实现CLIP模型的核心功能
您好,这是Bing。您可以使用Hugging Face提供的Transformers库中的CLIP模型来实现您所需的功能。下面是一个简单的示例代码,它展示了如何使用Transformers库中的CLIP模型来实现其核心功能:
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = ... # 加载图像
text = ... # 输入文本
inputs = processor(text=text, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
希望这对您有所帮助!
我希望你给我一个可以运行的代码,而不是伪代码,我图像的路径是image_path = “images/photo-1.jpg”,而文本你可以随机找几个句子
已发送电子邮件. 我希望你给我一个可以运行的代码,而不是伪代码,我图像的路径是image_path = “images/photo-1.jpg”,而文本你可以随机找几个句子
您好,这是Bing。下面是一个可以运行的示例代码,它展示了如何使用Transformers库中的CLIP模型来实现其核心功能:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image_path = "images/photo-1.jpg"
image = Image.open(image_path)
text = ["这是一只狗", "这是一只猫", "这是一辆汽车"]
inputs = processor(text=text, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
print(probs)
请注意,您需要安装Transformers库和Pillow库才能运行此代码。希望这对您有所帮助!