BEiTPretrainViT¶
- class mmpretrain.models.selfsup.BEiTPretrainViT(arch='base', img_size=224, patch_size=16, in_channels=3, out_indices=-1, drop_rate=0, drop_path_rate=0, norm_cfg={'eps': 1e-06, 'type': 'LN'}, final_norm=True, out_type='raw', frozen_stages=-1, use_abs_pos_emb=False, use_rel_pos_bias=False, use_shared_rel_pos_bias=True, layer_scale_init_value=0.1, interpolate_mode='bicubic', patch_cfg={'padding': 0}, layer_cfgs={}, init_cfg=None)[source]¶
- Vision Transformer for BEiT pre-training. - Parameters:
- Vision Transformer architecture. If use string, choose from ‘small’, ‘base’ and ‘large’. If use dict, it should have below keys: - embed_dims (int): The dimensions of embedding. 
- num_layers (int): The number of transformer encoder layers. 
- num_heads (int): The number of heads in attention modules. 
- feedforward_channels (int): The hidden dimensions in feedforward modules. 
 - Defaults to ‘base’. 
- img_size (int | tuple) – The expected input image shape. Because we support dynamic input shape, just set the argument to the most common input image shape. Defaults to 224. 
- patch_size (int | tuple) – The patch size in patch embedding. Defaults to 16. 
- in_channels (int) – The num of input channels. Defaults to 3. 
- out_indices (Sequence | int) – Output from which stages. Defaults to -1, means the last stage. 
- drop_rate (float) – Probability of an element to be zeroed. Defaults to 0. 
- drop_path_rate (float) – stochastic depth rate. Defaults to 0. 
- qkv_bias (bool) – Whether to add bias for qkv in attention modules. Defaults to True. 
- norm_cfg (dict) – Config dict for normalization layer. Defaults to - dict(type='LN').
- final_norm (bool) – Whether to add a additional layer to normalize final feature map. Defaults to True. 
- out_type (str) – - The type of output features. Please choose from - "cls_token": The class token tensor with shape (B, C).
- "featmap": The feature map tensor from the patch tokens with shape (B, C, H, W).
- "avg_featmap": The global averaged feature map tensor with shape (B, C).
- "raw": The raw feature tensor includes patch tokens and class tokens with shape (B, L, C).
 - It only works without input mask. Defaults to - "avg_featmap".
- with_cls_token (bool) – Whether concatenating class token into image tokens as transformer input. Defaults to True. 
- frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1. 
- use_abs_pos_emb (bool) – Whether or not use absolute position embedding. Defaults to False. 
- use_rel_pos_bias (bool) – Whether or not use relative position bias. Defaults to False. 
- use_shared_rel_pos_bias (bool) – Whether or not use shared relative position bias. Defaults to True. 
- layer_scale_init_value (float) – The initialization value for the learnable scaling of attention and FFN. Defaults to 0.1. 
- interpolate_mode (str) – Select the interpolate mode for position embedding vector resize. Defaults to “bicubic”. 
- patch_cfg (dict) – Configs of patch embedding. Defaults to an empty dict. 
- layer_cfgs (Sequence | dict) – Configs of each transformer layer in encoder. Defaults to an empty dict. 
- init_cfg (dict, optional) – Initialization config dict. Defaults to None. 
 
 - forward(x, mask)[source]¶
- The BEiT style forward function. - The function supports two kind of forward behaviors. If the - maskis not- None, the forward function will be executed as masked image modeling pre-training; if the- maskis- None, the forward function will call- super().forward(), which extract features from images without mask.- Parameters:
- x (torch.Tensor) – Input images, which is of shape (B x C x H x W). 
- mask (torch.Tensor, optional) – Mask for input, which is of shape (B x patch_resolution[0] x patch_resolution[1]). 
 
- Returns:
- Hidden features. 
- Return type:
- Tuple[torch.Tensor]