Unable to reproduce benchmark results for TokenPacker-HD  (7B, Scale=2, Patch=9)

Hi, thank you for your great work on TokenPacker!

I’m trying to reproduce the **TokenPacker-HD (7B, scale factor 2, patch number 9)** experiments, but I’m not getting results close to the paper or the released checkpoint.

---

### Hardware Setup

- 4 × H100 GPUs

---

### Results Comparison

- **Row 1:** Results reported in the paper
- **Row 2:** Results from the released checkpoint
- **Row 3+:** My experiments under different settings

| Method | TextVQA | OCRB | DocVQA | MMB | MMMU | MME | VQAv2 | VizWiz | POPE |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Reported in paper | 68.0 | 452 | 60.2 | 67.4 | 35.4 | 1489/338 | 81.2 | 54.7 | 88.2 |
| Released checkpoint | 67.92 | 452 | 27 | 67.35 | 35.89 | 1489.02/337.5 | 81.17 | 54.63 | 88.15 |
| Exp 1 | 41.29 | 17 | 9 | 21.13 | 31.44 | 675.46/283.93 | 67.5 | 48.12 | 56.70 |
| Exp 2 | 36.53 | 14 | 8 | 20.79 | 28.89 | 653.94/248.57 | 67.12 | 48.12 | 55.6 |
| Exp 3 | 40.14 | 19 | 8 | 21.05 | 31.22 | 666.27/240.36 | 45.7 | 47.53 | 51.07 |
| Exp 4 | 40.37 | 17 | 8 | 21.21 | 30.67 | 720.37/273.21 | 45.25 | 47.92 | 58.94 |

---

### Experiment Details

1. **Exp 1**
    - Pretrain: LR = 1e-3, batch size = 256 (32 × 4 GPUs, grad_accum = 2)
    - Instruction FT: LR = 2e-4, batch size = 128 (16 × 4 GPUs, grad_accum = 2)
    - Results far from paper/released checkpoint
2. **Exp 2 (following [Issue #12](https://github.com/CircleRadon/TokenPacker/issues/12))**
    - Based on https://github.com/CircleRadon/TokenPacker/issues/12#issuecomment-2328534898 [trainer_state.json](https://github.com/user-attachments/files/16868895/trainer_state.json), I noticed the pretrain LR was **5e-4** with batch size 128.
    - Pretrain: LR = 5e-4, batch size = 128 (16× 4 GPUs, grad_accum = 2)
    - Instruction FT: LR = 2e-4, batch size = 128 (16 × 4 GPUs, grad_accum = 2)
    - Result still not close.
    - Notably, my pretraining loss drops only to **~1.6–1.7**, while your trainer_state.json shows it drops to **~1.2**.  My [pretrain-trainer_state.json](https://github.com/user-attachments/files/22847915/pretrain-trainer_state.json) and [instruction-trainer_state.json](https://github.com/user-attachments/files/22848798/instruction-trainer_state.json)

3. **Exp 3**
    - Same as Exp 2, but batch size = 64 (16 × 4 GPUs, grad_accum = 1)
    - Still far from expected results.
4. **Exp 4**
    - Same as Exp 1, but with `deepspeed` seed and dataset seed set to 2024
    - Still not close to paper/released checkpoint.

---

### Questions

1. Could you clarify:
    - The exact **learning rate schedule** and batch size settings used in pretraining/finetuning?
    - Whether there are other important **hyperparameters** (e.g., warmup steps, optimizer settings, gradient clipping) not mentioned in the paper but necessary to reproduce results?
2. Could you also provide the **pretraining dataset JSON** and **instruction-tuning dataset JSON**?
    - I noticed that in [sunshine-lwt/TokenPacker-HD-7b-9patch-144token](https://huggingface.co/sunshine-lwt/TokenPacker-HD-7b-9patch-144token/tree/main) the instruction-tuning [trainer_state.json](https://huggingface.co/sunshine-lwt/TokenPacker-HD-7b-9patch-144token/raw/main/trainer_state.json), the **global step is 11,627**. If the batch size is 128, that implies about `11,627 × 128 = 1,488,256` samples. But the actual size of the **Mini Gemini instruction-tuning dataset** is `1,511,341`, meaning around **20k samples are missing**.
    - Could you provide the exact JSON datasets used, so reproduction is faithful?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to reproduce benchmark results for TokenPacker-HD (7B, Scale=2, Patch=9) #25

Hardware Setup

Results Comparison

Experiment Details

Questions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Method	TextVQA	OCRB	DocVQA	MMB	MMMU	MME	VQAv2	VizWiz	POPE
Reported in paper	68.0	452	60.2	67.4	35.4	1489/338	81.2	54.7	88.2
Released checkpoint	67.92	452	27	67.35	35.89	1489.02/337.5	81.17	54.63	88.15
Exp 1	41.29	17	9	21.13	31.44	675.46/283.93	67.5	48.12	56.70
Exp 2	36.53	14	8	20.79	28.89	653.94/248.57	67.12	48.12	55.6
Exp 3	40.14	19	8	21.05	31.22	666.27/240.36	45.7	47.53	51.07
Exp 4	40.37	17	8	21.21	30.67	720.37/273.21	45.25	47.92	58.94

Unable to reproduce benchmark results for TokenPacker-HD (7B, Scale=2, Patch=9) #25

Description

Hardware Setup

Results Comparison

Experiment Details

Questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions