Skip to content

bug: Tasks with thousands of related nodes get stuck in RUNNING when Prefect rejects events for exceeding PREFECT_SERVER_EVENTS_MAXIMUM_RELATED_RESOURCES (default 500) #9068

@fatih-acar

Description

@fatih-acar

Component

API Server / GraphQL

Infrahub version

1.9 (develop)

Current Behavior

Prefect events carry a list of related resources (one entry per related node passed to a flow / emitted from a flow run). Prefect's server enforces an upper bound on this list through PREFECT_SERVER_EVENTS_MAXIMUM_RELATED_RESOURCES, which defaults to 500.

When an Infrahub flow runs over thousands of related nodes (e.g. a computed-attribute pipeline touching every instance of a kind), the terminal event emitted for the flow run exceeds this cap. The Prefect server raises a validation error while persisting the event, the flow run never receives a terminal state transition, and the task is stuck in RUNNING indefinitely from Infrahub's perspective.

The issue is purely the size of the related-resources list on the emitted event — the underlying work has finished successfully.

Expected Behavior

Tasks with arbitrarily large related-node sets must reach a terminal state (COMPLETED / FAILED) reliably. Options to consider:

  1. Cap or chunk related resources at emit time. Truncate the related-resources list to a safe size before the event is sent, or split into multiple events.
  2. Avoid attaching every related node to flow-run events. Move the per-node identity off events (e.g. into task logs or a dedicated relationship in the Infrahub graph) and only keep aggregate / summary related resources on the Prefect event.
  3. Raise the Prefect limit. Document and ship a higher PREFECT_SERVER_EVENTS_MAXIMUM_RELATED_RESOURCES value with the Prefect server config — but this only delays the failure, so it should be paired with (1) or (2).

Steps to Reproduce

  1. Schema with a kind that has more than 500 instances.
  2. Trigger a workflow that operates over all instances and reports them as related nodes on the resulting task (e.g. a Python transform computed attribute, see IFC-2449: Performance improvements for transform based computed attributes #9025).
  3. Observe in the UI / infrahubctl task list: the task remains in RUNNING.
  4. Inspect Prefect server logs: validation error rejecting the event because related resources exceed PREFECT_SERVER_EVENTS_MAXIMUM_RELATED_RESOURCES (500). No terminal state is recorded for the flow run.

Additional Information

Metadata

Metadata

Assignees

No one assigned

    Labels

    group/backendIssue related to the backend (API Server, Git Agent)state/need-triageThis issue needs to be triagedtype/bugSomething isn't working as expected

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions