ViT with Soft Convolutional Inductive Biases

class cv.backbones.ConViT.model.ConViT(d_model, image_size, patch_size, classifier_mlp_d, encoder_mlp_d, encoder_num_heads, num_encoder_blocks, num_gated_blocks=10, locality_strength=1.0, locality_distance_method='constant', use_conv_init=True, d_pos=3, dropout=0.0, encoder_dropout=0.0, encoder_attention_dropout=0.0, encoder_projection_dropout=0.0, patchify_technique='linear', stochastic_depth=False, stochastic_depth_mp=None, layer_scale=None, ln_order='residual', in_channels=3, num_classes=1000)[source]

Bases: Module

Model class for ConViT architecture as described in the paper.

ConViT is a hybrid architecture that combines vision transformers (ViT) with convolutional inductive biases, enabling the model to better capture local features in images. This is achieved through gated transformer blocks, allowing flexibility in balancing local and global feature extraction.

Parameters:
  • d_model (int) – Dimension of the model (embedding size).

  • image_size (int) – Height and width of the input image.

  • patch_size (int) – Size of the patches extracted from the image.

  • classifier_mlp_d (int) – Hidden size of the classifier MLP layer.

  • encoder_mlp_d (int) – Hidden size of the MLP within the transformer encoder.

  • encoder_num_heads (int) – Number of attention heads in each transformer encoder block.

  • num_encoder_blocks (int) – Total number of encoder blocks.

  • num_gated_blocks (int, optional) – Number of gated transformer blocks (default: 10).

  • locality_strength (float, optional) – Strength of the locality bias in gated blocks (default: 1.0).

  • locality_distance_method (str, optional) – Method for determining locality distance (default: “constant”).

  • use_conv_init (bool, optional) – Whether to use convolutional initialization (default: True).

  • d_pos (int, optional) – Dimensionality of positional embeddings (default: 3).

  • dropout (float, optional) – Dropout probability for regularization (default: 0.0).

  • encoder_dropout (float, optional) – Dropout probability within the encoder (default: 0.0).

  • encoder_attention_dropout (float, optional) – Dropout probability in attention layers (default: 0.0).

  • encoder_projection_dropout (float, optional) – Dropout probability in projection layers (default: 0.0).

  • patchify_technique (str, optional) – Patchification technique, either “linear” or “convolutional” (default: “linear”).

  • stochastic_depth (bool, optional) – Whether to use stochastic depth (default: False).

  • stochastic_depth_mp (optional) – Stochastic depth max probability (default: None).

  • layer_scale (optional) – Scale factor for layer normalization (default: None).

  • ln_order (str, optional) – Layer normalization order (default: “residual”).

  • in_channels (int, optional) – Number of input channels, typically 3 for RGB images (default: 3).

  • num_classes (int, optional) – Number of output classes (default: 1000).

Example

>>> model = ConViT(d_model=1024, image_size=224, patch_size=16, classifier_mlp_d=2048, encoder_mlp_d=4096, encoder_num_heads=16, num_encoder_blocks=12)
updateStochasticDepthRate(k=0.05)[source]

Updates the stochastic depth rate for each block in the transformer encoder.

Stochastic depth is a regularization technique that randomly drops entire layers during training to prevent overfitting. This method increases the drop probability for each transformer encoder block based on its position in the model, using the following formula:

\[\text{new_drop_prob} = \text{original_drop_prob} + \text{block_index} \times \left( \frac{k}{\text{num_blocks} - 1} \right)\]
Parameters:

k (float, optional) – A scaling factor for adjusting the drop probability, default is 0.05. This value is spread across the transformer blocks, increasing progressively as you move deeper into the encoder.

Example

If the model has 12 encoder blocks, and k=0.05, the first block will have its drop probability increased slightly, while the last block will have a larger increase, making the depth randomness more aggressive in the deeper layers.

forward(x)[source]

Forward pass of the ConViT model.

The input tensor x is first patchified, then linearly projected into the transformer space. It is passed through a transformer encoder with both global and gated blocks to capture local and global features. The final class token is used for classification.

Parameters:

x (torch.Tensor) – The input tensor representing a batch of images, with shape (batch_size, in_channels, height, width).

Returns:

The classification output with shape (batch_size, num_classes), representing the predicted class probabilities for each image.

Return type:

torch.Tensor

Example

>>> output = model(torch.randn(1, 3, 224, 224))  # Example input tensor of shape (batch_size, channels, height, width)