-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Description
Hi, thank you for your great work on TokenPacker!
I’m trying to reproduce the TokenPacker-HD (7B, scale factor 2, patch number 9) experiments, but I’m not getting results close to the paper or the released checkpoint.
Hardware Setup
- 4 × H100 GPUs
Results Comparison
- Row 1: Results reported in the paper
- Row 2: Results from the released checkpoint
- Row 3+: My experiments under different settings
| Method | TextVQA | OCRB | DocVQA | MMB | MMMU | MME | VQAv2 | VizWiz | POPE |
|---|---|---|---|---|---|---|---|---|---|
| Reported in paper | 68.0 | 452 | 60.2 | 67.4 | 35.4 | 1489/338 | 81.2 | 54.7 | 88.2 |
| Released checkpoint | 67.92 | 452 | 27 | 67.35 | 35.89 | 1489.02/337.5 | 81.17 | 54.63 | 88.15 |
| Exp 1 | 41.29 | 17 | 9 | 21.13 | 31.44 | 675.46/283.93 | 67.5 | 48.12 | 56.70 |
| Exp 2 | 36.53 | 14 | 8 | 20.79 | 28.89 | 653.94/248.57 | 67.12 | 48.12 | 55.6 |
| Exp 3 | 40.14 | 19 | 8 | 21.05 | 31.22 | 666.27/240.36 | 45.7 | 47.53 | 51.07 |
| Exp 4 | 40.37 | 17 | 8 | 21.21 | 30.67 | 720.37/273.21 | 45.25 | 47.92 | 58.94 |
Experiment Details
-
Exp 1
- Pretrain: LR = 1e-3, batch size = 256 (32 × 4 GPUs, grad_accum = 2)
- Instruction FT: LR = 2e-4, batch size = 128 (16 × 4 GPUs, grad_accum = 2)
- Results far from paper/released checkpoint
-
Exp 2 (following Issue #12)
- Based on 复现hd模型失败, #12 (comment) trainer_state.json, I noticed the pretrain LR was 5e-4 with batch size 128.
- Pretrain: LR = 5e-4, batch size = 128 (16× 4 GPUs, grad_accum = 2)
- Instruction FT: LR = 2e-4, batch size = 128 (16 × 4 GPUs, grad_accum = 2)
- Result still not close.
- Notably, my pretraining loss drops only to ~1.6–1.7, while your trainer_state.json shows it drops to ~1.2. My pretrain-trainer_state.json and instruction-trainer_state.json
-
Exp 3
- Same as Exp 2, but batch size = 64 (16 × 4 GPUs, grad_accum = 1)
- Still far from expected results.
-
Exp 4
- Same as Exp 1, but with
deepspeedseed and dataset seed set to 2024 - Still not close to paper/released checkpoint.
- Same as Exp 1, but with
Questions
- Could you clarify:
- The exact learning rate schedule and batch size settings used in pretraining/finetuning?
- Whether there are other important hyperparameters (e.g., warmup steps, optimizer settings, gradient clipping) not mentioned in the paper but necessary to reproduce results?
- Could you also provide the pretraining dataset JSON and instruction-tuning dataset JSON?
- I noticed that in sunshine-lwt/TokenPacker-HD-7b-9patch-144token the instruction-tuning trainer_state.json, the global step is 11,627. If the batch size is 128, that implies about
11,627 × 128 = 1,488,256samples. But the actual size of the Mini Gemini instruction-tuning dataset is1,511,341, meaning around 20k samples are missing. - Could you provide the exact JSON datasets used, so reproduction is faithful?
- I noticed that in sunshine-lwt/TokenPacker-HD-7b-9patch-144token the instruction-tuning trainer_state.json, the global step is 11,627. If the batch size is 128, that implies about
Metadata
Metadata
Assignees
Labels
No labels