TNT¶
- class mmpretrain.models.backbones.TNT(arch='b', img_size=224, patch_size=16, in_channels=3, ffn_ratio=4, qkv_bias=False, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, act_cfg={'type': 'GELU'}, norm_cfg={'type': 'LN'}, first_stride=4, num_fcs=2, init_cfg=[{'type': 'TruncNormal', 'layer': 'Linear', 'std': 0.02}, {'type': 'Constant', 'layer': 'LayerNorm', 'val': 1.0, 'bias': 0.0}])[source]¶
- Transformer in Transformer. - A PyTorch implement of: Transformer in Transformer - Inspiration from https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/tnt.py - Parameters:
- arch (str | dict) – Vision Transformer architecture Default: ‘b’ 
- in_channels (int) – Number of input channels. Defaults to 3 
- ffn_ratio (int) – A ratio to calculate the hidden_dims in ffn layer. Default: 4 
- qkv_bias (bool) – Enable bias for qkv if True. Default False 
- drop_rate (float) – Probability of an element to be zeroed after the feed forward layer. Default 0. 
- attn_drop_rate (float) – The drop out rate for attention layer. Default 0. 
- drop_path_rate (float) – stochastic depth rate. Default 0. 
- act_cfg (dict) – The activation config for FFNs. Defaults to GELU. 
- norm_cfg (dict) – Config dict for normalization layer. Default layer normalization 
- first_stride (int) – The stride of the conv2d layer. We use a conv2d layer and a unfold layer to implement image to pixel embedding. 
- num_fcs (int) – The number of fully-connected layers for FFNs. Default 2 
- init_cfg (dict, optional) – Initialization config dict