Background

The overall SwinT model simpliy stacking the modules as described in global arch figure, and return a Embedding as [B, H/32W/32, 8C], the following Pooling method is:

x = self.avgpool(x.transpose(1, 2))  # B 8*C 1
x = torch.flatten(x, 1)

# return [B,8*C]

x = self.head(x) # Linear(8*C, num_classes)

# return the classification vector

VST Detail

Global Arch

The architecture of the Video Swin Transformer (VST) closely resembles that of the Swin Transformer (SwinT). A notable distinction is the inclusion of an additional temporal dimension, denoted as T, in the input features, which corresponds to a fixed number of frames, typically set to 32, sampled from the input video.The overall architecture of VST follows the structure of SwinT, as shown in the image below:

Global Arch of VST

Global Arch of VST

Patch Partition & Linear Embedding

Conver CNN2D to CNN3D, Patch_size default=(4,4,4), means the output size is [T/4. H/4, W/4]:

self.proj = nn.Conv3d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
D, Wh, Ww = x.size(2), x.size(3), x.size(4)
x = x.flatten(2).transpose(1, 2)
x = self.norm(x)
x = x.transpose(1, 2).view(-1, self.embed_dim, D, Wh, Ww)

After flatten, the embedding X $\in{R^{BC\frac{T}{4}\frac{H}{4}\frac{W}{4}}}$, Both the SwinTBlock and PatchMerging Block adhere to the methodology of the SwinT while extending the dimensionality from 2D to 3D.