feat(monitoring): add smartctl exporter for disk health monitoring#3772
feat(monitoring): add smartctl exporter for disk health monitoring#3772Rico Lin (ricolin) wants to merge 6 commits intomainfrom
Conversation
38276de to
c341d6e
Compare
c341d6e to
a42e9fa
Compare
|
Copilot review this PR |
There was a problem hiding this comment.
Pull request overview
Adds Prometheus smartctl-exporter deployment on bare-metal nodes to expose disk SMART metrics, and wires it into scraping/alerting so disk health issues can be monitored and alerted on.
Changes:
- Introduces a new
smartctl_exporterAnsible role that deploys a privileged DaemonSet in themonitoringnamespace. - Adds a
PodMonitorentry to enable Prometheus scraping of the exporter. - Adds Smartctl alert rules via a new
smartctl.libsonnetmixin and imports it into the alert mixins set.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| roles/smartctl_exporter/tasks/main.yml | Defines the DaemonSet manifest for smartctl-exporter. |
| roles/smartctl_exporter/meta/main.yml | Adds role metadata and dependency on defaults. |
| roles/smartctl_exporter/defaults/main.yml | Introduces configurable device exclusion regex. |
| roles/smartctl_exporter/README.md | Adds a minimal role README stub. |
| roles/kube_prometheus_stack/vars/main.yml | Adds a PodMonitor configuration to scrape smartctl-exporter pods. |
| roles/kube_prometheus_stack/files/jsonnet/smartctl.libsonnet | Adds alert rules for SMART health, wear, temperature, and exporter availability. |
| roles/kube_prometheus_stack/files/jsonnet/mixins.libsonnet | Imports the new smartctl alert mixin so it gets rendered with the ruleset. |
| roles/defaults/vars/main.yml | Adds the smartctl-exporter image reference. |
| releasenotes/notes/smartctl-exporter-76c029503b875604.yaml | Documents the new disk health monitoring feature. |
| playbooks/monitoring.yml | Wires the new smartctl_exporter role into the monitoring playbook. |
Reviewed the PR and addressed the following issues (commit Alert rules (
Image pinning (
Inhibition rules (
Documentation (
Vale vocabulary (
|
0273559 to
311c68d
Compare
Add a new smartctl_exporter Ansible role that deploys the prometheus smartctl-exporter as a DaemonSet on bare-metal nodes. This enables monitoring of disk SMART attributes including Media_Wearout_Indicator, temperature, reallocated sectors, and overall SMART health status. Components added: - roles/smartctl_exporter: DaemonSet deployment (privileged, bare-metal only) - PodMonitor for Prometheus scraping in kube_prometheus_stack - Alert rules in smartctl.libsonnet (SmartctlDiskUnhealthy P2, SmartctlDiskWearoutCritical P3, SmartctlDiskWearoutWarning P4, SmartctlDiskTemperatureHigh P4, SmartctlDiskReallocatedSectors P4, SmartctlExporterDown P5) - Inhibition rule: SmartctlDiskWearoutCritical suppresses SmartctlDiskWearoutWarning - Documentation for all alerts in monitoring.rst - Promtool tests for all alerts - Wired into monitoring playbook Change-Id: Id4166bba3e6efaa69dd239d466f72ed302fb4339 Co-authored-by: Copilot <[email protected]> Signed-off-by: ricolin <[email protected]>
311c68d to
aadd7ed
Compare
|
Copilot review PR again |
…essions exclusive - Fix prometheusAlerts+:: -> prometheusAlerts+: and groups: -> groups+: to follow codebase jsonnet conventions (visible, append-style fields) - Make SmartctlDiskWearoutWarning expression mutually exclusive with Critical: use "> 75 and <= 90" so Warning never fires when Critical is already firing - Add test case confirming SmartctlDiskWearoutWarning does not fire at 95% wear Co-authored-by: ricolin <[email protected]> Agent-Logs-Url: https://github.com/vexxhost/atmosphere/sessions/71117b97-ff07-43c7-85cb-ce64ad41ca83
All previous review comments are now addressed (commit Fixed in this review pass:
Already addressed in previous pass:
|
|
Copilot please rewrite this code to use the same pattern of using helm charts, specifically this one: https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-smartctl-exporter use chart-vendor, update .charts.yml and get it using that way. |
Use the upstream prometheus-smartctl-exporter Helm chart (v0.16.0) from prometheus-community instead of a manually crafted DaemonSet manifest. - Add prometheus-smartctl-exporter v0.16.0 to .charts.yml and fetch via chart-vendor - Rewrite tasks/main.yml to use kubernetes.core.helm - Replace defaults with helm release/namespace/values vars - Add vars/main.yml with image, bare-metal affinity, and ServiceMonitor - Remove manual PodMonitor from kube_prometheus_stack vars (chart deploys ServiceMonitor) Agent-Logs-Url: https://github.com/vexxhost/atmosphere/sessions/5c53ee5f-4e89-4b1e-8e43-90766273d3c9 Co-authored-by: mnaser <[email protected]>
Rewritten in commit
|
|
Copilot please review the Zuul CI failure and make sure you resolve it. |
The role was missing the vexxhost.kubernetes.upload_helm_chart meta dependency that copies the chart to /usr/local/src on the target host. Without it, the helm task fails with 'path not found'. Agent-Logs-Url: https://github.com/vexxhost/atmosphere/sessions/5b6f6a27-c08e-4e8d-8457-9996851e273b Co-authored-by: mnaser <[email protected]>
Fixed in The |
Signed-off-by: Yaguang Tang <[email protected]>
Summary
Add a new
smartctl_exporterAnsible role that deploys the prometheussmartctl-exporter as a DaemonSet on bare-metal nodes. This enables
monitoring of disk SMART attributes including Media_Wearout_Indicator,
temperature, reallocated sectors, and overall SMART health status.
Components
roles/smartctl_exporter/: DaemonSet deployment (privileged, bare-metal only)kube_prometheus_stackvars for Prometheus scrapingsmartctl.libsonnet:SmartctlDiskUnhealthy(P2) — SMART health check failedSmartctlDiskWearoutCritical(P3) — wear > 90%SmartctlDiskWearoutWarning(P4) — wear > 75%SmartctlDiskTemperatureHigh(P4) — temperature > 60°CSmartctlDiskReallocatedSectors(P4) — bad sectors detectedSmartctlExporterDown(P5) — exporter unreachableplaybooks/monitoring.yml