menory leak

Hi @retsuh-bqw ,
I’m currently pretraining the Lam  at stage1, and I’ve run into a persistent memory leak issue: the RAM usage keeps increasing steadily during training, eventually leading to an OOM error.

I’d like to ask:
What shuffle_buffer_size do you recommend for 8 A100 machines  during stage1 pretraining? Did you encounter similar memory issues during your experiments?

<img width="1014" height="336" alt="Image" src="https://github.com/user-attachments/assets/042d07c3-ca9d-41b2-82b2-6781332d26b3" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

menory leak #57

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

menory leak #57

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions