Skip to content

Conversation

@raelga
Copy link
Collaborator

@raelga raelga commented Jan 28, 2026

AROSLSRE-91

https://redhat-external.slack.com/archives/C075PHEFZKQ/p1769620814361439

This pull request updates Prometheus alerting rules and their tests to add cluster-awareness to all relevant alerts. The changes ensure that alerts are correctly grouped and fired on a per-cluster basis, improving accuracy and scalability in multi-cluster environments. Several PromQL expressions are updated to use group by (cluster) and to join service health checks with cluster membership. Corresponding test files are also updated to reflect these changes, ensuring correct alert firing and label expectations.

Prometheus Alert Rule Updates for Cluster Awareness:

  • Updated alert expressions in Prometheus rules for backend, frontend, Prometheus, and Arobit forwarder jobs to use group by (cluster) and/or unless on(cluster) to ensure alerts are evaluated and fired per cluster. This affects both the main rules and the generated Bicep templates. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]

Test Updates for Cluster Labeling and Cluster-Aware Logic:

  • Modified test cases for all affected rules to include the cluster label in input series and expected alert labels, ensuring that alerts are validated for correct cluster-specific behavior. [1] [2] [3] [4] [5] [6] [7] [8] [9]

Improvements to Service Discovery and Alert Coverage:

  • Enhanced the "MiseEnvoyScrapeDown" alert logic to explicitly detect clusters missing the required metrics, using group by (cluster) (up{job="kube-state-metrics", cluster=~".*-svc-\d+"}) unless on(cluster) .... Test coverage is expanded for various scenarios (metric missing, metric present, metric goes down). [1] [2]

These changes collectively make alerting more robust and accurate in multi-cluster Kubernetes environments, and the updated tests ensure the new logic is thoroughly validated.

@openshift-ci openshift-ci bot requested review from geoberle and janboll January 28, 2026 20:40
@raelga raelga changed the title Fix/monitoring alert issues fix: Enforce cluster information on monitoring alerts Jan 28, 2026
Copy link
Contributor

@stevekuznetsov stevekuznetsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@sclarkso
Copy link
Collaborator

/test e2e-parallel

@openshift-ci openshift-ci bot removed the lgtm label Jan 28, 2026
@raelga
Copy link
Collaborator Author

raelga commented Jan 28, 2026

@sclarkso There was a legit error parsing the -svc-\\d+ expression. It has been updated to fix the issue.

@sclarkso
Copy link
Collaborator

/test e2e-parallel

@stevekuznetsov
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jan 29, 2026
@openshift-ci
Copy link

openshift-ci bot commented Jan 29, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: raelga, stevekuznetsov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 08fa06e into main Jan 29, 2026
20 checks passed
@openshift-merge-bot openshift-merge-bot bot deleted the fix/monitoring-alert-issues branch January 29, 2026 02:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants