Blip2Caption¶
- class mmpretrain.models.multimodal.Blip2Caption(vision_backbone, text_backbone, multimodal_backbone, vision_neck, tokenizer=None, prompt='', max_txt_len=20, num_captions=1, data_preprocessor=None, init_cfg=None)[source]¶
- BLIP2 Caption. - Module for BLIP2 Caption task. - Parameters:
- vision_backbone (dict) – The config dict for vision backbone. 
- text_backbone (dict) – The config dict for text backbone. 
- multimodal_backbone (dict) – The config dict for multimodal backbone. 
- vision_neck (dict) – The config dict for vision neck. 
- tokenizer – (Optional[dict]): The config for tokenizer. Defaults to None. 
- prompt (str) – Prompt used for training and eval. Defaults to ‘’. 
- max_txt_len (int) – Max text length of input text. 
- num_captions (int) – Number of captions to be generated for each image. 
- data_preprocessor (Optional[dict]) – The config for preprocessing input data. If None or no specified type, it will use “MultiModalDataPreprocessor” as type. See - MultiModalDataPreprocessorfor more details. Defaults to None.
- init_cfg (Optional[dict]) – the config to control the initialization. Defaults to None. 
 
 - forward(images, data_samples=None, mode='loss')[source]¶
- The unified entry for a forward process in both training and test. The method should accept two modes: “predict” and “loss”: - “predict”: Forward and return the predictions, which are fully processed to a list of - DataSample.
- “loss”: Forward and return a dict of losses according to the given inputs and data samples. 
 - Note that this method doesn’t handle neither back propagation nor optimizer updating, which are done in the - train_step().- Parameters:
- images (torch.Tensor) – pre_processed img tensor (N, C, …). 
- data_samples (List[DataSample], optional) – 
- mode (str) – Return what kind of value. Defaults to ‘loss’. 
 
- Returns:
- The return type depends on - mode. - If- mode="loss", return a dict of tensor. - If- mode="predict", return a list of
 
 - loss(images, data_samples=None, **kwargs)[source]¶
- The forward function in training. - Parameters:
- images (torch.Tensor) – The input tensor with shape (N, C, …) in general. 
- data_samples (List[DataSample], optional) – The annotation data of every samples. Defaults to None. 
- **kwargs – Other keyword arguments accepted by the - lossmethod of- head.
 
- Returns:
- A dictionary of loss components. 
- Return type:
- Dict[str, torch.Tensor] 
 
 - predict(images, data_samples=None, **kwargs)[source]¶
- Predict captions from a batch of inputs. - Parameters:
- images (torch.Tensor) – The input tensor with shape (N, C, …) in general. 
- data_samples (List[DataSample], optional) – The annotation data of every samples. Defaults to None. 
- **kwargs – Other keyword arguments accepted by the - predictmethod of- head.
 
- Returns:
- Return list of data samples. 
- Return type:
- List[DataSample]