Hierarchical Pooling ViT

class cv.backbones.HPool_ViT.model.HPool_ViT(num_classes, d_model, image_size, patch_size, classifier_mlp_d, encoder_mlp_d, encoder_num_heads, num_encoder_blocks, dropout=0.0, encoder_dropout=0.0, encoder_attention_dropout=0.0, encoder_projection_dropout=0.0, patchify_technique='linear', stochastic_depth=False, stochastic_depth_mp=None, layer_scale=None, ln_order='residual', hvt_pool=[1, 5, 9], in_channels=3)[source]

Bases: Module

HPool-ViT model implementing the Vision Transformer with Hierarchical Pooling (HPool) from paper.

This model is based on the architecture introduced in the paper “Scalable Vision Transformers with Hierarchical Pooling”. HPool-ViT uses a hierarchical pooling mechanism to improve scalability and efficiency in processing large images, while maintaining the core components of a Vision Transformer (ViT) such as patch embeddings, multi-head attention, and Transformer encoders.

Parameters:
  • num_classes (int) – Number of output classes for classification.

  • d_model (int) – Dimension of the model (embedding size for patches).

  • image_size (int) – Size of the input image (assumed to be square).

  • patch_size (int) – Size of each image patch (assumed to be square).

  • classifier_mlp_d (int) – Dimension of the MLP in the classifier head.

  • encoder_mlp_d (int) – Dimension of the MLP in the Transformer encoder.

  • encoder_num_heads (int) – Number of attention heads in the Transformer encoder.

  • num_encoder_blocks (int) – Number of Transformer encoder blocks.

  • dropout (float, optional) – Dropout rate applied to the patch embeddings and classifier (default: 0.0).

  • encoder_dropout (float, optional) – Dropout rate in the encoder layers (default: 0.0).

  • encoder_attention_dropout (float, optional) – Dropout rate in the multi-head attention layers (default: 0.0).

  • encoder_projection_dropout (float, optional) – Dropout rate for the linear projections in the encoder (default: 0.0).

  • patchify_technique (str, optional) – Technique for creating patches from the image. Can be “linear” or “convolutional” (default: “linear”).

  • stochastic_depth (bool, optional) – Whether to apply stochastic depth (DropPath) to encoder layers (default: False).

  • stochastic_depth_mp (float or None) – Maximum probability for stochastic depth. If None, no stochastic depth is applied (default: None).

  • layer_scale (float or None) – Scale factor for layer normalization. If None, no scaling is applied (default: None).

  • ln_order (str, optional) – Order of layer normalization. Can be “residual” or “pre” (default: “residual”).

  • hvt_pool (list of int or None) – Transformer blocks to use hierarchical pooling at. If None, hierarchical pooling is not used (default: [1, 5, 9]).

  • in_channels (int, optional) – Number of input channels, typically 3 for RGB (default: 3).

Example

>>> model = HPool_ViT(num_classes=1000, d_model=768, image_size=192, patch_size=16, classifier_mlp_d=2048, encoder_mlp_d=3072, encoder_num_heads=12, num_encoder_blocks=12)
updateStochasticDepthRate(k=0.05)[source]

Updates the stochastic depth rate for each block in the transformer encoder.

Stochastic depth is a regularization technique that randomly drops entire layers during training to prevent overfitting. This method increases the drop probability for each transformer encoder block based on its position in the model, using the following formula:

\[\text{new_drop_prob} = \text{original_drop_prob} + \text{block_index} \times \left( \frac{k}{\text{num_blocks} - 1} \right)\]
Parameters:

k (float, optional) – A scaling factor for adjusting the drop probability, default is 0.05. This value is spread across the transformer blocks, increasing progressively as you move deeper into the encoder.

Example

If the model has 12 encoder blocks, and k=0.05, the first block will have its drop probability increased slightly, while the last block will have a larger increase, making the depth randomness more aggressive in the deeper layers.

forward(x)[source]

Forward pass through the HPool-ViT model.

Parameters:

x (Tensor) – Input tensor of shape (batch_size, in_channels, height, width).

Returns:

Output tensor of shape (batch_size, num_classes) containing the predicted class scores.

Return type:

Tensor

Example

>>> output = model(torch.randn(1, 3, 192, 192))  # Example input tensor of shape (batch_size, channels, height, width)