The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
Moving 0 files to the new cache system
0it [00:00, ?it/s]
<PIL.Image.Image image mode=RGB size=409x273 at 0x11139327650>
reshape position embedding from 196 to 576
load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth
caption: a woman and her dog on the beach
model_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_vqa_capfilt_large.pth' model = blip_vqa(pretrained=model_url, image_size=image_size, vit='base') model.eval() model = model.to(device)
question = 'where is the woman sitting?'
with torch.no_grad(): answer = model(image, question, train=False, inference='generate') print('answer: '+answer[0])
<PIL.Image.Image image mode=RGB size=409x273 at 0x11139109090>
load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_vqa_capfilt_large.pth
answer: on beach
model_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_retrieval_coco.pth' model = blip_itm(pretrained=model_url, image_size=image_size, vit='base') model.eval() model = model.to(device=device)
caption = 'a woman sitting on the beach with a dog'
print('text: %s' %caption)
itm_output = model(image,caption,match_head='itm') itm_score = torch.nn.functional.softmax(itm_output,dim=1)[:,1] print('The image and text is matched with a probability of %.4f'%itm_score)
itc_score = model(image,caption,match_head='itc') print('The image feature and text feature has a cosine similarity of %.4f'%itc_score)
<PIL.Image.Image image mode=RGB size=409x273 at 0x11130227FD0>
load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_retrieval_coco.pth
text: a woman sitting on the beach with a dog
The image and text is matched with a probability of 0.9960
The image feature and text feature has a cosine similarity of 0.5262