Skip to content

nvcc fatal : Unsupported gpu architecture 'compute_1.' #1245

@Scalcium

Description

@Scalcium

尝试用两块RTX5090 32GB进行mini-internvl 2B的微调,出现下列错误,用了各种方法都没解决:

[INFO|trainer.py:571] 2025-12-28 15:58:29,158 >> Using auto half precision backend
[2025-12-28 15:58:29,342] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.0, git-hash=unknown, git-branch=unknown
[2025-12-28 15:58:29,342] [INFO] [config.py:733:init] Config mesh_device None world_size = 2
trainable params: 15,728,640 || all params: 1,904,875,520 || trainable%: 0.8257043483870274
[2025-12-28 15:58:30,787] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[rank0]:W1228 15:58:31.023978 5278 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank0]:W1228 15:58:31.023978 5278 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output multi_tensor_adam.cuda.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1018" -I/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/include -isystem /root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /root/miniconda3/envs/internvl/include/python3.9 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_120,code=compute_120 -gencode=arch=compute_120,code=sm_120 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_1.,code=sm_1. -gencode=arch=compute_1.,code=compute_1. -std=c++17 -c /root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
FAILED: [code=1] multi_tensor_adam.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output multi_tensor_adam.cuda.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1018" -I/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/include -isystem /root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /root/miniconda3/envs/internvl/include/python3.9 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_120,code=compute_120 -gencode=arch=compute_120,code=sm_120 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_1.,code=sm_1. -gencode=arch=compute_1.,code=compute_1. -std=c++17 -c /root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
nvcc fatal : Unsupported gpu architecture 'compute_1.'
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1018" -I/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/include -isystem /root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /root/miniconda3/envs/internvl/include/python3.9 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o
ninja: build stopped: subcommand failed.
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2595, in _run_ninja_build
[rank0]: subprocess.run(
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/subprocess.py", line 528, in run
[rank0]: raise CalledProcessError(retcode, process.args,
[rank0]: subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]: File "/root/autodl-tmp/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 1072, in
[rank0]: main()
[rank0]: File "/root/autodl-tmp/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 1057, in main
[rank0]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/transformers/trainer.py", line 1539, in train
[rank0]: return inner_training_loop(
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/transformers/trainer.py", line 1690, in _inner_training_loop
[rank0]: model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/accelerate/accelerator.py", line 1318, in prepare
[rank0]: result = self._prepare_deepspeed(*args)
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/accelerate/accelerator.py", line 1815, in _prepare_deepspeed
[rank0]: engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/init.py", line 193, in initialize
[rank0]: engine = DeepSpeedEngine(args=args,
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 315, in init
[rank0]: self._configure_optimizer(optimizer, model_parameters)
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1284, in _configure_optimizer
[rank0]: basic_optimizer = self._configure_basic_optimizer(model_parameters)
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1361, in _configure_basic_optimizer
[rank0]: optimizer = FusedAdam(
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init
[rank0]: fused_adam_cuda = FusedAdamBuilder().load()
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 531, in load
[rank0]: return self.jit_load(verbose)
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 578, in jit_load
[rank0]: op_module = load(name=self.name,
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1681, in load
[rank0]: return _jit_compile(
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2138, in _jit_compile
[rank0]: _write_ninja_file_and_build_library(
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2290, in _write_ninja_file_and_build_library
[rank0]: _run_ninja_build(
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2612, in _run_ninja_build
[rank0]: raise RuntimeError(message) from e
[rank0]: RuntimeError: Error building extension 'fused_adam'
Time to load fused_adam op: 23.132505178451538 seconds
[rank0]:[W1228 15:58:55.682284098 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W1228 15:58:56.810345 5226 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 5279 closing signal SIGTERM
E1228 15:58:57.128611 5226 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 5278) of binary: /root/miniconda3/envs/internvl/bin/python3.9
Traceback (most recent call last):
File "/root/miniconda3/envs/internvl/bin/torchrun", line 7, in
sys.exit(main())
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 143, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

internvl/train/internvl_chat_finetune.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-12-28_15:58:56
host : autodl-container-f5f44baa55-aa8a850b
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5278)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions