-
Notifications
You must be signed in to change notification settings - Fork 750
Description
尝试用两块RTX5090 32GB进行mini-internvl 2B的微调,出现下列错误,用了各种方法都没解决:
[INFO|trainer.py:571] 2025-12-28 15:58:29,158 >> Using auto half precision backend
[2025-12-28 15:58:29,342] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.0, git-hash=unknown, git-branch=unknown
[2025-12-28 15:58:29,342] [INFO] [config.py:733:init] Config mesh_device None world_size = 2
trainable params: 15,728,640 || all params: 1,904,875,520 || trainable%: 0.8257043483870274
[2025-12-28 15:58:30,787] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[rank0]:W1228 15:58:31.023978 5278 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank0]:W1228 15:58:31.023978 5278 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output multi_tensor_adam.cuda.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1018" -I/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/include -isystem /root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /root/miniconda3/envs/internvl/include/python3.9 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_120,code=compute_120 -gencode=arch=compute_120,code=sm_120 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_1.,code=sm_1. -gencode=arch=compute_1.,code=compute_1. -std=c++17 -c /root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
FAILED: [code=1] multi_tensor_adam.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output multi_tensor_adam.cuda.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1018" -I/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/include -isystem /root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /root/miniconda3/envs/internvl/include/python3.9 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_120,code=compute_120 -gencode=arch=compute_120,code=sm_120 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_1.,code=sm_1. -gencode=arch=compute_1.,code=compute_1. -std=c++17 -c /root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
nvcc fatal : Unsupported gpu architecture 'compute_1.'
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1018" -I/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/include -isystem /root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /root/miniconda3/envs/internvl/include/python3.9 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o
ninja: build stopped: subcommand failed.
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2595, in _run_ninja_build
[rank0]: subprocess.run(
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/subprocess.py", line 528, in run
[rank0]: raise CalledProcessError(retcode, process.args,
[rank0]: subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
[rank0]: The above exception was the direct cause of the following exception:
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/autodl-tmp/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 1072, in
[rank0]: main()
[rank0]: File "/root/autodl-tmp/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 1057, in main
[rank0]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/transformers/trainer.py", line 1539, in train
[rank0]: return inner_training_loop(
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/transformers/trainer.py", line 1690, in _inner_training_loop
[rank0]: model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/accelerate/accelerator.py", line 1318, in prepare
[rank0]: result = self._prepare_deepspeed(*args)
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/accelerate/accelerator.py", line 1815, in _prepare_deepspeed
[rank0]: engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/init.py", line 193, in initialize
[rank0]: engine = DeepSpeedEngine(args=args,
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 315, in init
[rank0]: self._configure_optimizer(optimizer, model_parameters)
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1284, in _configure_optimizer
[rank0]: basic_optimizer = self._configure_basic_optimizer(model_parameters)
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1361, in _configure_basic_optimizer
[rank0]: optimizer = FusedAdam(
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init
[rank0]: fused_adam_cuda = FusedAdamBuilder().load()
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 531, in load
[rank0]: return self.jit_load(verbose)
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 578, in jit_load
[rank0]: op_module = load(name=self.name,
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1681, in load
[rank0]: return _jit_compile(
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2138, in _jit_compile
[rank0]: _write_ninja_file_and_build_library(
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2290, in _write_ninja_file_and_build_library
[rank0]: _run_ninja_build(
[rank0]: File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2612, in _run_ninja_build
[rank0]: raise RuntimeError(message) from e
[rank0]: RuntimeError: Error building extension 'fused_adam'
Time to load fused_adam op: 23.132505178451538 seconds
[rank0]:[W1228 15:58:55.682284098 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W1228 15:58:56.810345 5226 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 5279 closing signal SIGTERM
E1228 15:58:57.128611 5226 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 5278) of binary: /root/miniconda3/envs/internvl/bin/python3.9
Traceback (most recent call last):
File "/root/miniconda3/envs/internvl/bin/torchrun", line 7, in
sys.exit(main())
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 143, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
internvl/train/internvl_chat_finetune.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2025-12-28_15:58:56
host : autodl-container-f5f44baa55-aa8a850b
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5278)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html