Skip to content

rocm-smi crashes EasyBuild if no AMD GPU is found #5154

@Thyre

Description

@Thyre

On jsc-zen3, rocm-smi was implicitly installed by a system update. Since no AMD GPUs are present, every test build failed with:

EasyBuild crashed! Please consider reporting a bug, this should not happen...

Traceback (most recent call last):
  File "/usr/lib64/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/easybuild/easybuild-framework/easybuild/main.py", line 862, in <module>
    main_with_hooks()
  File "/easybuild/easybuild-framework/easybuild/main.py", line 847, in main_with_hooks
    exit_code: EasyBuildExit = main(args=args, prepared_cfg_data=(init_session_state, eb_go, cfg_settings))
  File "/easybuild/easybuild-framework/easybuild/main.py", line 798, in main
    is_successful = process_eb_args(orig_paths, eb_go, cfg_settings, modtool, testing, init_session_state,
  File "/easybuild/easybuild-framework/easybuild/main.py", line 632, in process_eb_args
    test_report_msg = overall_test_report(ecs_with_res, len(paths), overall_success, success_msg, init_session_state)
  File "/easybuild/easybuild-framework/easybuild/tools/testing.py", line 459, in overall_test_report
    txt = post_pr_test_report(pr_nrs, GITHUB_EASYCONFIGS_REPO, test_report, msg, init_session_state,
  File "/easybuild/easybuild-framework/easybuild/tools/testing.py", line 372, in post_pr_test_report
    gpu_info = get_gpu_info(init_session_state['environment'])
  File "/easybuild/easybuild-framework/easybuild/tools/systemtools.py", line 745, in get_gpu_info
    amd_driver = res.output.strip().split('\n')[1].split(',')[1]
IndexError: list index out of range

This failure is caused by rocm-smi not reporting a non-zero exit code when failing:

[reuter1@jsczen3l1 ~]$ rocm-smi
cat: /sys/module/amdgpu/initstate: No such file or directory
ERROR:root:Driver not initialized (amdgpu not found in modules)
[reuter1@jsczen3l1 ~]$ echo $?
0

See also in the rocm-smi sources.

The reported error message fails to be parsed by the strip() and split() we're doing, hence causing the IndexError.
A simple fix could involve adding a try/except around that call, or extending the try/except around that whole block, so that we bail out when we detect a non-functional output. Any further parsing would produce invalid data anyway (see also here)

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions