Blip2Retrieval¶
- class mmpretrain.models.multimodal.Blip2Retrieval(vision_backbone, text_backbone=None, multimodal_backbone=None, vision_neck=None, text_neck=None, head=None, multimodal_head=None, tokenizer=None, temperature=0.07, fast_match=False, topk=256, data_preprocessor=None, init_cfg=None)[source]¶
- BLIP2 Retriever. - Parameters:
- vision_backbone (dict) – Backbone for extracting image features. 
- text_backbone (dict) – Backbone for extracting text features. 
- multimodal_backbone (Optional[dict]) – Backbone for extracting multi-modal features. 
- vision_neck (Optional[dict]) – The neck module to process image features from vision backbone. Defaults to None. 
- text_neck (Optional[dict]) – The neck module to process text features from text backbone. Defaults to None. 
- head (Optional[Union[List[dict], dict]]) – The head module to calculate loss from processed single modality features. See - mmmultimodal.models.heads. Notice that if the head is not set, loss method cannot be used. Defaults to None.
- multimodal_head (Optional[Union[List[dict], dict]]) – The multi-modal head module to calculate loss from processed multimodal features. See - mmmultimodal.models.heads. Notice that if the head is not set, loss method cannot be used. Defaults to None.
- tokenizer (Optional[dict]) – The config for tokenizer. Defaults to None. 
- temperature (float) – Temperature parameter that controls the concentration level of the distribution. Defaults to 0.07. 
- fast_match (bool) – If False, select topk similarity as candidates and compute the matching score. If True, return the similarity as the matching score directly. Defaults to False. 
- topk (int) – Select topk similarity as candidates for compute matching scores. Notice that this is not the topk in evaluation. Defaults to 256. 
- data_preprocessor (Optional[dict]) – The config for preprocessing input data. If None or no specified type, it will use “MultiModalDataPreprocessor” as type. See - MultiModalDataPreprocessorfor more details. Defaults to None.
- init_cfg (Optional[dict]) – the config to control the initialization. Defaults to None. 
 
 - compute_score_matrix_i2t(img_feats, img_embeds, text_feats, text_ids, text_atts)[source]¶
- Compare the score matrix for image-to-text retrieval. Every image should compare to all the text features. - Parameters:
- img_feats (torch.Tensor) – The input tensor with shape (M, C). M stands for numbers of samples on a single GPU. 
- img_embeds (List[torch.Tensor]) – Image features from each layer of the vision backbone. 
- text_feats (torch.Tensor) – The input tensor with shape (N, C). N stands for numbers of all samples on all GPUs. 
- text_ids (torch.Tensor) – The input tensor with shape (N, C). 
- text_atts (torch.Tensor) – The input tensor with shape (N, C). 
 
- Returns:
- Score matrix of image-to-text retrieval. 
- Return type:
 
 - compute_score_matrix_t2i(img_feats, img_embeds, text_feats, text_ids, text_atts)[source]¶
- Compare the score matrix for text-to-image retrieval. - Every text should compare to all the image features. - Parameters:
- img_feats (torch.Tensor) – The input tensor with shape (N, C). N stands for numbers of all samples on all GPUs. 
- img_embeds (List[torch.Tensor]) – Image features from each layer of the vision backbone. 
- text_feats (torch.Tensor) – The input tensor with shape (M, C). M stands for numbers of samples on a single GPU. 
- text_ids (torch.Tensor) – The input tensor with shape (M, C). 
- text_atts (torch.Tensor) – The input tensor with shape (M, C). 
 
- Returns:
- Score matrix of text-to-image retrieval. 
- Return type:
 
 - loss(images, data_samples=None)[source]¶
- Calculate losses from a batch of inputs and data samples. - Parameters:
- inputs (dict) – A batch of inputs. The input tensor with of at least one modality. For image, the value is a tensor of shape (N, C, …) in general. For text, the value is a dict of tokenized text inputs. 
- data_samples (Optional[List[DataSample]]) – The annotation data of every samples. Defaults to None. 
 
- Returns:
- a dictionary of loss components of
- both head and multimodal head. 
 
- Return type:
- Dict[str, torch.tensor] 
 
 - predict_all(feats, data_samples, num_images=None, num_texts=None, cal_i2t=True, cal_t2i=True)[source]¶
- Compute similarity matrix between images and texts across all ranks. - Parameters:
- feats (Dict[str, torch.Tensor]) – Features from the current rank. 
- data_samples (List[DataSample]) – Data samples from the current rank. 
- num_images (int, optional) – Number of images to use. Defaults to None. 
- num_texts (int, optional) – Number of texts to use. Defaults to None. 
- cal_i2t (bool, optional) – Whether to compute image-to-text similarity. Defaults to True. 
- cal_t2i (bool, optional) – Whether to compute text-to-image similarity. Defaults to True. 
 
- Returns:
- Image-to-text and text-to-image similarity matrices. 
- Return type:
- Tuple[torch.Tensor, torch.Tensor]