Regarding the DINO Model

Number of patches per side: 672 / 14 = 48 Total patches: 48 *48 = 2304 patches Each patch then has 384 dimentional feature vector

Regarding Segmentation Head

It takes patch features and then upsamples then back into full image resolution.
Since the output produced by DINO is [batchsize, num_patches, 384] 384 is the features extracted
patch_features = patch_features.view( batch_size, self.num_patches_per_side, self.num_patches_per_side, 384 ).permute(0, 3, 1, 2). Here, the 2304 patches is converted into 2D spatial grid.
This converted 2D spartial grid is then sent over to ConvTranspose2d(384, 512, kernelsize=3, stride=2, padding=1, output_padding=1)
stride=2: Doubles spatial size (48×48 → 96×96) - this is for upsampling 384 → 512: Increases channels - this is where we get richer features kernel_size=3: Uses 3×3 kernels to learn how to intelligently upsample
What is upsampling? The process of making the data spartially larger - increasing the height and width dimensions. The problem is that in DINO we have patch resolution (48X48), but we need pixel level predictions (672X672) The solution: Gradually upsample through multiple steps: 48×48 → 96×96 → 192×192 → 384×384 → 672×672

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
dino_nn.ipynb		dino_nn.ipynb
dino_nn.py		dino_nn.py
localization.py		localization.py
requirements.txt		requirements.txt
segmentation_training.ipynb		segmentation_training.ipynb
segmentation_training.py		segmentation_training.py