Use all gpus available for training by Abecid · Pull Request #293 · Lightning-AI/lit-llama

Abecid · 2023-05-19T08:16:12Z

Use all gpus available for training

finetune/adapter.py

awaelchli

Thanks for the suggestion @Abecid. In general the changes look good but it won't make much difference because we are keeping the batch size constant globally to keep it easy to reproduce results. In fact, if training fits in a single GPU and torch.cuda.device_count() selects more devices, it may actually be slower because of the reduced per-device batch size and communication overhead.

I vote for keeping the values as they are and instead advise users to scale both devices and batch size to their setup to maximize efficiency. We could add this info to the how-to guides. Let me know what you think.

awaelchli · 2023-05-20T12:44:51Z

finetune/full.py

 eval_iters = 100
 log_interval = 100
-devices = 4
+devices = torch.cuda.device_count()


Suggested change

devices = torch.cuda.device_count()

devices = "auto"

awaelchli · 2023-05-20T12:45:54Z

finetune/lora.py

+    devices = torch.cuda.device_count()
+    fabric = L.Fabric(accelerator="cuda", devices=devices, precision="bf16-true")


Suggested change

devices = torch.cuda.device_count()

fabric = L.Fabric(accelerator="cuda", devices=devices, precision="bf16-true")

fabric = L.Fabric(accelerator="cuda", devices="auto", precision="bf16-true")

@Abecid Did multi-gpu training work with this script and how many did you use? Is the loss convergence comparable to single gpu training?

i ran this with two NVIDIA Tesla T4 cards with devices="auto" and got this error:

RuntimeError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Broken pipe. This may indicate a possible application crash on rank 0 or a network set up issue.

Multi GPU Training

7c781d7

Abecid requested review from awaelchli, carmocca and lantiga as code owners May 19, 2023 08:16

carmocca reviewed May 19, 2023

View reviewed changes

finetune/adapter.py Show resolved Hide resolved

awaelchli reviewed May 20, 2023

View reviewed changes

Borda marked this pull request as draft July 1, 2025 16:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Use all gpus available for training#293

Use all gpus available for training#293
Abecid wants to merge 1 commit intoLightning-AI:mainfrom
AttentionX:multi_gpus

Abecid commented May 19, 2023

Uh oh!

Uh oh!

awaelchli left a comment

Uh oh!

awaelchli May 20, 2023

Uh oh!

awaelchli May 20, 2023

Uh oh!

awaelchli May 20, 2023

Uh oh!

juliosanz May 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		devices = torch.cuda.device_count()
		fabric = L.Fabric(accelerator="cuda", devices=devices, precision="bf16-true")

Comments

Conversation

Abecid commented May 19, 2023

Uh oh!

Uh oh!

awaelchli left a comment

Choose a reason for hiding this comment

Uh oh!

awaelchli May 20, 2023

Choose a reason for hiding this comment

Uh oh!

awaelchli May 20, 2023

Choose a reason for hiding this comment

Uh oh!

awaelchli May 20, 2023

Choose a reason for hiding this comment

Uh oh!

juliosanz May 23, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants