Question about EuclideanCodebook `expire_codes_` and EMA update behavior

Hi, thank you for the excellent work and released model!

I am trying to reproduce training the small 75FPS model from scratch, and I am seeing an unexpectedly large number of dead codewords when running:

```
python train.py fit --config configs/wavtokenizer_smalldata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml
```

This seems to be caused by the logic in `EuclideanCodebook.forward()`, when `expire_codes_` is called to replace the dead codewords. `expire_codes_` does not re-compute assignment counts, so afterwards in `forward`, the EMA update to `self.cluster_size` is made with the ***old*** `embed_onehot.sum()`, which is 0 for the respawned codewords. Similarly, `self.embed_avg` is updated with the ***old*** information `embed_sum = x.t() @ embed_onehot` which is 0-vectors for the respawned codewords.

So, at the end of `forward()`, `self.embed.data.copy_(embed_normalized)` seems to be updated with old information about the codewords that does not take into account the codewords that were respawned earlier in the function call.

Is this the same logic that was used in the original training of WavTokenizer? Maybe this is not the most up-to-date version of the code? Possibly I'm missing something. Help understanding how this should work would be appreciated :) 


<img width="588" height="878" alt="Image" src="https://github.com/user-attachments/assets/24a3d43b-643e-412e-9bb1-b690976e2712" />


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about EuclideanCodebook `expire_codes_` and EMA update behavior #89

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question about EuclideanCodebook expire_codes_ and EMA update behavior #89

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Question about EuclideanCodebook `expire_codes_` and EMA update behavior #89