Github Repo

BLIP全称Bootstrapping Language-Image Pre-training,是一个全新的 统一视觉语言理解与生成的预训练模型

它统一了视觉语言任务的理解与生成功能,并且通过嵌入 Captioner 和 Filter 去除网络资源中的文本噪声,提高了模型在下游视觉语言任务上的性能

Colab Notebook 运行

Salesforce在Colab NoteBook中提供了Demo,可以在云端直接运行:Colab notebook

第一步配置环境时出现ERROR: Failed building wheel for tokenizers问题,在Issue#151的评论中提供了解决方案:修改transformers版本4.25.1

本地调用

学习在本地使用Pretrained BLIP模型,测试代码置于BLIP目录下

实验环境:

  • Python 3.11.11 on conda
  • transformers 4.25.1
  • timm 0.4.12
  • fairscale 0.4.4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from PIL import Image
import requests
import torch
from torchvision import transforms
from torchvision.transforms.functional import InterpolationMode

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def load_demo_image(image_size,device):
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

w,h = raw_image.size
print(raw_image.resize((w//5,h//5)))

transform = transforms.Compose([
transforms.Resize((image_size,image_size),interpolation=InterpolationMode.BICUBIC),
transforms.ToTensor(),
transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711))
])
image = transform(raw_image).unsqueeze(0).to(device)
return image

Image Captioning 图像描述

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from models.blip import blip_decoder

image_size = 384
image = load_demo_image(image_size=image_size, device=device)

model_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth'

model = blip_decoder(pretrained=model_url, image_size=image_size, vit='base')
model.eval()
model = model.to(device)

with torch.no_grad():
# beam search
caption = model.generate(image, sample=False, num_beams=3, max_length=20, min_length=5)
# nucleus sampling
# caption = model.generate(image, sample=True, top_p=0.9, max_length=20, min_length=5)
print('caption: '+caption[0])
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]


<PIL.Image.Image image mode=RGB size=409x273 at 0x11139327650>
reshape position embedding from 196 to 576
load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth
caption: a woman and her dog on the beach

Visual Question Answering 视觉问答

由于存在和ImageCaptioning中的beam_search相同的版本问题,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from models.blip_vqa import blip_vqa

image_size = 480
image = load_demo_image(image_size=image_size, device=device)

model_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_vqa_capfilt_large.pth'

model = blip_vqa(pretrained=model_url, image_size=image_size, vit='base')
model.eval()
model = model.to(device)

question = 'where is the woman sitting?'

with torch.no_grad():
answer = model(image, question, train=False, inference='generate')
print('answer: '+answer[0])
<PIL.Image.Image image mode=RGB size=409x273 at 0x11139109090>
load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_vqa_capfilt_large.pth
answer: on beach

Image-Text Matching 图文匹配

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from models.blip_itm import blip_itm

image_size = 384
image = load_demo_image(image_size=image_size,device=device)

model_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_retrieval_coco.pth'

model = blip_itm(pretrained=model_url, image_size=image_size, vit='base')
model.eval()
model = model.to(device=device)

caption = 'a woman sitting on the beach with a dog'

print('text: %s' %caption)

itm_output = model(image,caption,match_head='itm')
itm_score = torch.nn.functional.softmax(itm_output,dim=1)[:,1]
print('The image and text is matched with a probability of %.4f'%itm_score)

itc_score = model(image,caption,match_head='itc')
print('The image feature and text feature has a cosine similarity of %.4f'%itc_score)
<PIL.Image.Image image mode=RGB size=409x273 at 0x11130227FD0>
load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_retrieval_coco.pth
text: a woman sitting on the beach with a dog
The image and text is matched with a probability of 0.9960
The image feature and text feature has a cosine similarity of 0.5262