layer.output_adapters = BottleneckLayer("output_adapter", is_layer_hooked=True)
ln_2_get_fn = lambda: multigetattr(layer, model.adapter_interface.layer_ln_2, None)
layer_output_proj.register_forward_hook(partial(hook_fn, layer.output_adapters, ln_2_get_fn))
This code causes the layer.output_adapters of cuda:n to always point to the layer.output_adapters of cuda 0 during multi-GPU training with the default distributed settings of the Huggingface trainer. The model can be properly distributed to different GPUs. I suspect it is due to partial. So I tried to save variables like layer.xxx and layer in the context so that it can run on multiple GPUs.
Variables like residual and hidden state are both shown to be on cuda1 during debugging, but layer is shown to be on cuda0. I printed the addresses of the layer variable on two GPUs. The address of layer on cuda:1 is the same as that on cuda:0. Since my GPU can't handle models like Qwen, and it's not easy to provide data for my own model, could you please test whether this problem occurs in multi-GPU training? Thank you! I followed the process of adapters-for-any-transformer.
This code causes the
layer.output_adaptersof cuda:n to always point to thelayer.output_adaptersof cuda 0 during multi-GPU training with the default distributed settings of the Huggingface trainer. The model can be properly distributed to different GPUs. I suspect it is due topartial. So I tried to save variables likelayer.xxxandlayerin the context so that it can run on multiple GPUs.Variables like
residualandhidden stateare both shown to be oncuda1during debugging, butlayeris shown to be oncuda0. I printed the addresses of thelayervariable on two GPUs. The address oflayeroncuda:1is the same as that oncuda:0. Since my GPU can't handle models like Qwen, and it's not easy to provide data for my own model, could you please test whether this problem occurs in multi-GPU training? Thank you! I followed the process of adapters-for-any-transformer.