Number of patches per side: 672 / 14 = 48 Total patches: 48 *48 = 2304 patches Each patch then has 384 dimentional feature vector
-
It takes patch features and then upsamples then back into full image resolution.
-
Since the output produced by DINO is [batchsize, num_patches, 384] 384 is the features extracted
-
patch_features = patch_features.view( batch_size, self.num_patches_per_side, self.num_patches_per_side, 384 ).permute(0, 3, 1, 2). Here, the 2304 patches is converted into 2D spatial grid.
-
This converted 2D spartial grid is then sent over to ConvTranspose2d(384, 512, kernelsize=3, stride=2, padding=1, output_padding=1)
-
stride=2: Doubles spatial size (48×48 → 96×96) - this is for upsampling 384 → 512: Increases channels - this is where we get richer features kernel_size=3: Uses 3×3 kernels to learn how to intelligently upsample
-
What is upsampling? The process of making the data spartially larger - increasing the height and width dimensions. The problem is that in DINO we have patch resolution (48X48), but we need pixel level predictions (672X672) The solution: Gradually upsample through multiple steps: 48×48 → 96×96 → 192×192 → 384×384 → 672×672