Skip to content

Jobs aren’t being killed on Leonardo #951

@cathalobrien

Description

@cathalobrien

What happened?

Jobs get some error on Leonardo (e.g. network timeout). Anemoi stops running but the slurm job keeps running for 24 hours or until the slurm time limit is hit

perhaps we are not throwing the right exit code in some cases in anemoi?

What are the steps to reproduce the bug?

tbd

Version

main

Platform (OS and architecture)

leonardo

Relevant log output

Accompanying data

No response

Organisation

ecmwf

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

Next Up

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions