Convolutional Designs in ViT

class cv.backbones.CeiT.model.CeiT(image_size=224, d_model=768, patch_size=4, dropout=0.0, encoder_num_heads=12, num_encoder_blocks=12, encoder_dropout=0.0, encoder_attention_dropout=0.0, encoder_projection_dropout=0.0, classifier_mlp_d=2048, i2t_out_channels=32, i2t_conv_kernel_size=7, i2t_conv_stride=2, i2t_max_pool_kernel_size=3, i2t_max_pool_stride=2, leff_expand_ratio=4, leff_depthwise_kernel=3, leff_depthwise_stride=1, leff_depthwise_padding=1, leff_depthwise_separable=True, lca_encoder_expansion_ratio=4, lca_encoder_num_heads=12, lca_encoder_dropout=0.0, lca_encoder_attention_dropout=0.0, lca_encoder_projection_dropout=0.0, lca_encoder_ln_order='post', lca_encoder_stodepth_prob=0.0, lca_encoder_layer_scale=None, lca_encoder_qkv_bias=False, patchify_technique='linear', stochastic_depth=False, stochastic_depth_mp=None, layer_scale=None, ln_order='post', num_classes=1000, in_channels=3)[source]

Bases: Module

The CeiT class implements the CeiT (Convolutional Enhanced Image Transformer) architecture, as described in the paper.

This model combines convolutional layers with transformer-based image processing, providing a powerful approach for image classification tasks.

Parameters:

image_size (int, optional) – The size of the input image (height and width) (default: 224).
d_model (int, optional) – The dimensionality of the model’s feature space (default: 768).
patch_size (int, optional) – The size of the image patches to be extracted (default: 4).
dropout (float, optional) – The dropout rate applied to the inputs and outputs of the dropout layers (default: 0.0).
encoder_num_heads (int, optional) – Number of attention heads in the encoder layers (default: 12).
num_encoder_blocks (int, optional) – Number of encoder blocks in the transformer (default: 12).
encoder_dropout (float, optional) – Dropout rate applied in the encoder layers (default: 0.0).
encoder_attention_dropout (float, optional) – Dropout rate applied to the attention weights in the encoder (default: 0.0).
encoder_projection_dropout (float, optional) – Dropout rate applied to the projection layers in the encoder (default: 0.0).
classifier_mlp_d (int, optional) – Dimensionality of the hidden layer in the classifier’s MLP (default: 2048).
i2t_out_channels (int, optional) – Number of output channels in the initial convolutional layer (default: 32).
i2t_conv_kernel_size (int, optional) – Size of the kernel in the initial convolutional layer (default: 7).
i2t_conv_stride (int, optional) – Stride of the convolution in the initial layer (default: 2).
i2t_max_pool_kernel_size (int, optional) – Kernel size of the max pooling layer (default: 3).
i2t_max_pool_stride (int, optional) – Stride of the max pooling layer (default: 2).
leff_expand_ratio (int, optional) – Expansion ratio for the Locally Enhanced Feed Forward (LEFF) module (default: 4).
leff_depthwise_kernel (int, optional) – Kernel size for the depthwise convolution in LEFF (default: 3).
leff_depthwise_stride (int, optional) – Stride for the depthwise convolution in LEFF (default: 1).
leff_depthwise_padding (int, optional) – Padding for the depthwise convolution in LEFF (default: 1).
leff_depthwise_separable (bool, optional) – Whether to use depthwise separable convolutions in LEFF (default: True).
lca_encoder_expansion_ratio (int, optional) – Expansion ratio for the LCA encoder module (default: 4).
lca_encoder_num_heads (int, optional) – Number of attention heads in the LCA encoder (default: 12).
lca_encoder_dropout (float, optional) – Dropout rate in the LCA encoder (default: 0.0).
lca_encoder_attention_dropout (float, optional) – Dropout rate for attention weights in the LCA encoder (default: 0.0).
lca_encoder_projection_dropout (float, optional) – Dropout rate for projection layers in the LCA encoder (default: 0.0).
lca_encoder_ln_order (str, optional) – The order of Layer Normalization in the LCA encoder (default: “post”).
lca_encoder_stodepth_prob (float, optional) – Probability for stochastic depth in the LCA encoder (default: 0.0).
lca_encoder_layer_scale (float, optional) – Layer scaling factor in the LCA encoder (default: None).
lca_encoder_qkv_bias (bool, optional) – Whether to use bias in the QKV projections in the LCA encoder (default: False).
patchify_technique (str, optional) – Technique used for patch extraction (“linear” or “conv”) (default: linear).
stochastic_depth (bool, optional) – Whether to use stochastic depth in the model (default: False).
stochastic_depth_mp (float, optional) – Stochastic depth max probability (default: None).
layer_scale (float, optional) – Scaling factor for the layers (default: None).
ln_order (str, optional) – The order of Layer Normalization in the model (default: “post”).
num_classes (int, optional) – Number of classes for the classification task (default: 1000).
in_channels (int, optional) – Number of input channels in the images, typically 3 for RGB (default: 3).

Example

>>> model = CeiT()

forward(x)[source]

Forward pass through the CeiT model.

This method performs a forward pass of the input tensor through the CeiT model. It includes the following steps:

Initial feature extraction using the i2t_module (i2t: image-to-tokens).
Transformation of the extracted features into patches using the patchify method.
Linear projection of the patchified features.
Addition of positional embeddings and class tokens.
Application of dropout.
Processing of the features through the transformer encoder.
Application of LCA attention to the encoder’s output.
Classification of the final token output using the classifier.

Parameters:: x (torch.Tensor) – Input tensor of shape (batch_size, in_channels, height, width).
Returns:: Output tensor after passing through the model, with shape (batch_size, num_classes).
Return type:: torch.Tensor

Notes

The i2t_module first processes the input images, extracting initial features through convolutional layers followed by max pooling.
The patchify method converts the feature maps into a sequence of patches suitable for the transformer encoder.
Positional embeddings and class tokens are added to the sequence of patches to incorporate positional information and a class-specific token.
Dropout is applied to prevent overfitting during training.
The transformer encoder processes the sequence of patches, producing an encoded representation.
LCA attention is applied to the encoded representation to enhance the feature representation.
Finally, the class token is passed through a classifier to obtain the predicted class scores.

Example

>>> output = model(torch.randn(1, 3, 224, 224))  # Example input tensor of shape (batch_size, channels, height, width)