Llava¶
- class mmpretrain.models.multimodal.Llava(vision_encoder, lang_encoder, tokenizer, mm_hidden_size, prompt_tmpl, task='caption', use_im_patch=True, use_im_start_end=False, mm_vision_select_layer=-1, mm_proj_depth=1, generation_cfg={}, load_lang_pretrained=False, data_preprocessor=None, init_cfg=None)[source]¶
- The LLaVA model for multiple tasks. - Parameters:
- vision_encoder (dict) – The config of the vision encoder. 
- lang_encoder (dict) – The config of the language encoder. 
- tokenizer (dict) – The tokenizer to encode the text. 
- prompt_tmpl (str) – Prompt template for inference. 
- task (int) – The task to perform prediction. 
- use_im_start_end (bool) – Whether to use the im_start and im_end tokens 
- mm_vision_select_layer (int) – The index from vision encoder output. Defaults to -1. 
- mm_proj_depth (int) – The number of linear layers for multi-modal projection. Defaults to 1. 
- load_lang_pretrained (bool) – Whether to load the pretrained model of language encoder. Defaults to False. 
- generation_cfg (dict) – The extra generation config, accept the keyword arguments of [~`transformers.GenerationConfig`]. Defaults to an empty dict. 
- data_preprocessor (Optional[dict]) – The config for preprocessing input data. If None or no specified type, it will use “MutimodalDataPreprocessor” as type. See - MutimodalDataPreprocessorfor more details. Defaults to None.
- init_cfg (dict, optional) – The initialization config. Defaults to None. 
 
 - forward(images, data_samples=None, mode='loss')[source]¶
- The unified entry for a forward process in both training and test. - “predict”: Forward and return the predictions, which are fully processed to a list of - DataSample.
- “loss”: Forward and return a dict of losses according to the given inputs and data samples. 
 - Note that this method doesn’t handle neither back propagation nor optimizer updating, which are done in the - train_step().- Parameters:
- images (torch.Tensor) – The input image tensor with different ndim according to the inputs. 
- data_samples (List[DataSample], optional) – The annotation data of every samples. It’s required if - mode="loss". Defaults to None.
- mode (str) – Return what kind of value. Defaults to ‘loss’. 
 
- Returns:
- The return type depends on - mode. - If- mode="loss", return a dict of tensor.
 
 - post_process(outputs, data_samples)[source]¶
- Perform post process for outputs for different task. - Parameters:
- outputs (torch.Tensor) – The generated outputs. 
- data_samples (List[DataSample], optional) – The annotation data of every samples. 
 
- Returns:
- Return list of data samples. 
- Return type:
- List[DataSample] 
 
 - predict(images, data_samples=None, **generation_cfg)[source]¶
- Predict generation results from a batch of inputs. - Parameters:
- images (torch.Tensor) – For zero-shot, the input images tensor is with shape (B, C, H, W), for few-shot, which is (B, T_img, C, H, W) in general. Images in the same chunk are collated along T_img. Video data is not supported yet. 
- data_samples (List[DataSample], optional) – The annotation data of every samples. Defaults to None. 
- **generation_cfg – Other keyword arguments accepted by the - generatemethod of- lang_encoder.
 
- Returns:
- Return list of data samples. 
- Return type:
- List[DataSample] 
 
 - preprocess_text(data_samples, device)[source]¶
- Preprocess text in advance before fed into language model. - Parameters:
- data_samples (List[DataSample]) – The annotation data of every samples. Defaults to None. 
- device (torch.device) – Device for text to put on. 
 
- Returns:
- Return list of data samples. 
- Return type:
- List[DataSample]