Bottleneck ViT

class cv.backbones.BoT_ViT.model.BoT_ViT(mhsa_num_heads, attention_dropout=0.0, num_classes=1000, in_channels=3)[source]

Bases: Module

BoT-ViT (Bottleneck Transformers for Visual Recognition) model architecture from paper.

This class implements the BoT-ViT model, combining convolutional and bottleneck transformer blocks for feature extraction and classification. The model processes input images through multiple residual groups and a final classification layer.

Parameters:

mhsa_num_heads (int) – The number of heads for the multi-head self-attention layer.
attention_dropout (float, optional) – Dropout rate for attention layers (default: 0.0).
num_classes (int, optional) – Number of output classes for classification (default: 1000).
in_channels (int, optional) – Number of input channels for the images, typically 3 for RGB (default: 3).

Example

>>> model = BoT_ViT(mhsa_num_heads=8, attention_dropout=0.1, num_classes=1000)

initializeConv()[source]

Initialize convolutional layers with specific weight and bias initialization.

This method initializes the weights and biases of convolutional layers using a specific strategy: - Weights: Initialized with a normal distribution where the mean is 0.0 and the standard deviation is computed based on the number of input units and output channels. - Biases: Initialized to 0.0.

The standard deviation for weight initialization is calculated using the formula:

\[\text{std} = \sqrt{\frac{2}{n_{\text{in}}}}\]

where:

\[n_{\text{in}} = \text{kernel_size[0]}^2 \times \text{out_channels}\]

The weight initialization is performed as follows:

\[\text{weight} \sim \mathcal{N}(\text{mean}=0.0, \text{std})\]

Biases are initialized with:

\[\text{bias} = 0.0\]

This initialization helps in stabilizing the learning process and improving the convergence rate.

forward(x)[source]

Defines the forward pass through the BoT-ViT model.

The input tensor passes through the initial convolutional layers, followed by multiple residual groups, and is then processed through the classifier for final classification.

Parameters:: x (torch.Tensor) – Input tensor of shape (batch_size, in_channels, height, width).
Returns:: Output tensor after passing through the model, with shape (batch_size, num_classes).
Return type:: torch.Tensor

Example

>>> output = model(torch.randn(1, 3, 224, 224))  # Example input tensor of shape (batch_size, channels, height, width)