-
Notifications
You must be signed in to change notification settings - Fork 221
rocm-smi crashes EasyBuild if no AMD GPU is found #5154
Copy link
Copy link
Labels
Milestone
Description
On jsc-zen3, rocm-smi was implicitly installed by a system update. Since no AMD GPUs are present, every test build failed with:
EasyBuild crashed! Please consider reporting a bug, this should not happen...
Traceback (most recent call last):
File "/usr/lib64/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/easybuild/easybuild-framework/easybuild/main.py", line 862, in <module>
main_with_hooks()
File "/easybuild/easybuild-framework/easybuild/main.py", line 847, in main_with_hooks
exit_code: EasyBuildExit = main(args=args, prepared_cfg_data=(init_session_state, eb_go, cfg_settings))
File "/easybuild/easybuild-framework/easybuild/main.py", line 798, in main
is_successful = process_eb_args(orig_paths, eb_go, cfg_settings, modtool, testing, init_session_state,
File "/easybuild/easybuild-framework/easybuild/main.py", line 632, in process_eb_args
test_report_msg = overall_test_report(ecs_with_res, len(paths), overall_success, success_msg, init_session_state)
File "/easybuild/easybuild-framework/easybuild/tools/testing.py", line 459, in overall_test_report
txt = post_pr_test_report(pr_nrs, GITHUB_EASYCONFIGS_REPO, test_report, msg, init_session_state,
File "/easybuild/easybuild-framework/easybuild/tools/testing.py", line 372, in post_pr_test_report
gpu_info = get_gpu_info(init_session_state['environment'])
File "/easybuild/easybuild-framework/easybuild/tools/systemtools.py", line 745, in get_gpu_info
amd_driver = res.output.strip().split('\n')[1].split(',')[1]
IndexError: list index out of range
This failure is caused by rocm-smi not reporting a non-zero exit code when failing:
[reuter1@jsczen3l1 ~]$ rocm-smi
cat: /sys/module/amdgpu/initstate: No such file or directory
ERROR:root:Driver not initialized (amdgpu not found in modules)
[reuter1@jsczen3l1 ~]$ echo $?
0
See also in the rocm-smi sources.
The reported error message fails to be parsed by the strip() and split() we're doing, hence causing the IndexError.
A simple fix could involve adding a try/except around that call, or extending the try/except around that whole block, so that we bail out when we detect a non-functional output. Any further parsing would produce invalid data anyway (see also here)
Reactions are currently unavailable