Skip to content

Unable to reproduce benchmark results for TokenPacker-HD (7B, Scale=2, Patch=9) #25

@worapob841

Description

@worapob841

Hi, thank you for your great work on TokenPacker!

I’m trying to reproduce the TokenPacker-HD (7B, scale factor 2, patch number 9) experiments, but I’m not getting results close to the paper or the released checkpoint.


Hardware Setup

  • 4 × H100 GPUs

Results Comparison

  • Row 1: Results reported in the paper
  • Row 2: Results from the released checkpoint
  • Row 3+: My experiments under different settings
Method TextVQA OCRB DocVQA MMB MMMU MME VQAv2 VizWiz POPE
Reported in paper 68.0 452 60.2 67.4 35.4 1489/338 81.2 54.7 88.2
Released checkpoint 67.92 452 27 67.35 35.89 1489.02/337.5 81.17 54.63 88.15
Exp 1 41.29 17 9 21.13 31.44 675.46/283.93 67.5 48.12 56.70
Exp 2 36.53 14 8 20.79 28.89 653.94/248.57 67.12 48.12 55.6
Exp 3 40.14 19 8 21.05 31.22 666.27/240.36 45.7 47.53 51.07
Exp 4 40.37 17 8 21.21 30.67 720.37/273.21 45.25 47.92 58.94

Experiment Details

  1. Exp 1

    • Pretrain: LR = 1e-3, batch size = 256 (32 × 4 GPUs, grad_accum = 2)
    • Instruction FT: LR = 2e-4, batch size = 128 (16 × 4 GPUs, grad_accum = 2)
    • Results far from paper/released checkpoint
  2. Exp 2 (following Issue #12)

  3. Exp 3

    • Same as Exp 2, but batch size = 64 (16 × 4 GPUs, grad_accum = 1)
    • Still far from expected results.
  4. Exp 4

    • Same as Exp 1, but with deepspeed seed and dataset seed set to 2024
    • Still not close to paper/released checkpoint.

Questions

  1. Could you clarify:
    • The exact learning rate schedule and batch size settings used in pretraining/finetuning?
    • Whether there are other important hyperparameters (e.g., warmup steps, optimizer settings, gradient clipping) not mentioned in the paper but necessary to reproduce results?
  2. Could you also provide the pretraining dataset JSON and instruction-tuning dataset JSON?
    • I noticed that in sunshine-lwt/TokenPacker-HD-7b-9patch-144token the instruction-tuning trainer_state.json, the global step is 11,627. If the batch size is 128, that implies about 11,627 × 128 = 1,488,256 samples. But the actual size of the Mini Gemini instruction-tuning dataset is 1,511,341, meaning around 20k samples are missing.
    • Could you provide the exact JSON datasets used, so reproduction is faithful?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions