Skip to content

feat(monitoring): add smartctl exporter for disk health monitoring#3772

Open
Rico Lin (ricolin) wants to merge 6 commits intomainfrom
feat/smartctl-exporter
Open

feat(monitoring): add smartctl exporter for disk health monitoring#3772
Rico Lin (ricolin) wants to merge 6 commits intomainfrom
feat/smartctl-exporter

Conversation

@ricolin
Copy link
Copy Markdown
Member

@ricolin Rico Lin (ricolin) commented Mar 26, 2026

Summary

Add a new smartctl_exporter Ansible role that deploys the prometheus
smartctl-exporter as a DaemonSet on bare-metal nodes. This enables
monitoring of disk SMART attributes including Media_Wearout_Indicator,
temperature, reallocated sectors, and overall SMART health status.

Components

  • roles/smartctl_exporter/: DaemonSet deployment (privileged, bare-metal only)
  • PodMonitor: Added to kube_prometheus_stack vars for Prometheus scraping
  • Alert rules in smartctl.libsonnet:
    • SmartctlDiskUnhealthy (P2) — SMART health check failed
    • SmartctlDiskWearoutCritical (P3) — wear > 90%
    • SmartctlDiskWearoutWarning (P4) — wear > 75%
    • SmartctlDiskTemperatureHigh (P4) — temperature > 60°C
    • SmartctlDiskReallocatedSectors (P4) — bad sectors detected
    • SmartctlExporterDown (P5) — exporter unreachable
  • Wired into playbooks/monitoring.yml

@ricolin
Copy link
Copy Markdown
Member Author

Copilot review this PR

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Prometheus smartctl-exporter deployment on bare-metal nodes to expose disk SMART metrics, and wires it into scraping/alerting so disk health issues can be monitored and alerted on.

Changes:

  • Introduces a new smartctl_exporter Ansible role that deploys a privileged DaemonSet in the monitoring namespace.
  • Adds a PodMonitor entry to enable Prometheus scraping of the exporter.
  • Adds Smartctl alert rules via a new smartctl.libsonnet mixin and imports it into the alert mixins set.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
roles/smartctl_exporter/tasks/main.yml Defines the DaemonSet manifest for smartctl-exporter.
roles/smartctl_exporter/meta/main.yml Adds role metadata and dependency on defaults.
roles/smartctl_exporter/defaults/main.yml Introduces configurable device exclusion regex.
roles/smartctl_exporter/README.md Adds a minimal role README stub.
roles/kube_prometheus_stack/vars/main.yml Adds a PodMonitor configuration to scrape smartctl-exporter pods.
roles/kube_prometheus_stack/files/jsonnet/smartctl.libsonnet Adds alert rules for SMART health, wear, temperature, and exporter availability.
roles/kube_prometheus_stack/files/jsonnet/mixins.libsonnet Imports the new smartctl alert mixin so it gets rendered with the ruleset.
roles/defaults/vars/main.yml Adds the smartctl-exporter image reference.
releasenotes/notes/smartctl-exporter-76c029503b875604.yaml Documents the new disk health monitoring feature.
playbooks/monitoring.yml Wires the new smartctl_exporter role into the monitoring playbook.

Comment thread roles/kube_prometheus_stack/files/jsonnet/smartctl.libsonnet
Comment thread roles/kube_prometheus_stack/files/jsonnet/smartctl.libsonnet Outdated
Comment thread roles/kube_prometheus_stack/files/jsonnet/smartctl.libsonnet
Comment thread roles/smartctl_exporter/tasks/main.yml Outdated
Comment thread roles/defaults/vars/main.yml Outdated
Comment thread roles/kube_prometheus_stack/files/jsonnet/smartctl.libsonnet
Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 26, 2026

Copilot review this PR

Reviewed the PR and addressed the following issues (commit 0273559):

Alert rules (smartctl.libsonnet)

  • Fixed summary annotation format — removed template variables ({{ $labels.instance }}), now follows <Component>: <Impact statement> convention (e.g., "Disk: SMART health check failed")
  • Added runbook_url to all P2/P3 alerts (SmartctlDiskUnhealthy, SmartctlDiskWearoutCritical) as required by project guidelines
  • Removed empty {} selector from smartctl_device_percentage_used{} expressions

Image pinning (roles/defaults/vars/main.yml)

  • Added sha256 digest pin for prometheus_smartctl_exporter: v0.14.0@sha256:cfe22c36d7d2fac48ebf619707305acb65eb0fb670656eb80f356e606d782bc1

Inhibition rules (roles/kube_prometheus_stack/vars/main.yml)

  • Added inhibition rule so SmartctlDiskWearoutCritical suppresses SmartctlDiskWearoutWarning for the same instance, device, and namespace

Documentation (doc/source/admin/monitoring.rst)

  • Added full documentation for all 6 smartctl alerts in alphabetical order: SmartctlDiskReallocatedSectors, SmartctlDiskTemperatureHigh, SmartctlDiskUnhealthy, SmartctlDiskWearoutCritical, SmartctlDiskWearoutWarning, SmartctlExporterDown — each with description, likely root causes, and diagnostic/remediation steps

Vale vocabulary (.github/styles/config/vocabularies/Base/accept.txt)

  • Added Smartctl to the vocabulary so Vale linting passes cleanly on all new documentation

Add a new smartctl_exporter Ansible role that deploys the prometheus
smartctl-exporter as a DaemonSet on bare-metal nodes. This enables
monitoring of disk SMART attributes including Media_Wearout_Indicator,
temperature, reallocated sectors, and overall SMART health status.

Components added:
- roles/smartctl_exporter: DaemonSet deployment (privileged, bare-metal only)
- PodMonitor for Prometheus scraping in kube_prometheus_stack
- Alert rules in smartctl.libsonnet (SmartctlDiskUnhealthy P2,
  SmartctlDiskWearoutCritical P3, SmartctlDiskWearoutWarning P4,
  SmartctlDiskTemperatureHigh P4, SmartctlDiskReallocatedSectors P4,
  SmartctlExporterDown P5)
- Inhibition rule: SmartctlDiskWearoutCritical suppresses SmartctlDiskWearoutWarning
- Documentation for all alerts in monitoring.rst
- Promtool tests for all alerts
- Wired into monitoring playbook

Change-Id: Id4166bba3e6efaa69dd239d466f72ed302fb4339
Co-authored-by: Copilot <[email protected]>
Signed-off-by: ricolin <[email protected]>
@ricolin
Copy link
Copy Markdown
Member Author

Copilot review PR again

…essions exclusive

- Fix prometheusAlerts+:: -> prometheusAlerts+: and groups: -> groups+: to
  follow codebase jsonnet conventions (visible, append-style fields)
- Make SmartctlDiskWearoutWarning expression mutually exclusive with Critical:
  use "> 75 and <= 90" so Warning never fires when Critical is already firing
- Add test case confirming SmartctlDiskWearoutWarning does not fire at 95% wear

Co-authored-by: ricolin <[email protected]>
Agent-Logs-Url: https://github.com/vexxhost/atmosphere/sessions/71117b97-ff07-43c7-85cb-ce64ad41ca83
Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 26, 2026

Copilot review PR again

All previous review comments are now addressed (commit 3d2c858). Here's a summary of what's in the PR:

Fixed in this review pass:

  • prometheusAlerts+::prometheusAlerts+: and groups:groups+: to match the codebase jsonnet convention (visible, append-style fields like goldpinger.libsonnet)
  • SmartctlDiskWearoutWarning expression is now mutually exclusive with Critical: smartctl_device_percentage_used > 75 and smartctl_device_percentage_used <= 90 — Warning only fires in the 75–90% range; Critical fires above 90%
  • Added a test verifying SmartctlDiskWearoutWarning does not fire at 95% wear

Already addressed in previous pass:

  • sha256 image digest pinned
  • Summary format fixed (<Component>: <Impact>, no template vars)
  • runbook_url added for P2/P3 alerts
  • Alertmanager inhibition rule for wearout alert pair
  • Smartctl added to Vale vocabulary
  • Full monitoring.rst documentation for all 6 alerts (alphabetical)
  • Jinja2 args in tasks/main.yml fixed (no more {% if %} control blocks)
  • Promtool tests for all 6 alerts (positive and negative cases)

@mnaser
Copy link
Copy Markdown
Member

Copilot please rewrite this code to use the same pattern of using helm charts, specifically this one:

https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-smartctl-exporter

use chart-vendor, update .charts.yml and get it using that way.

Use the upstream prometheus-smartctl-exporter Helm chart (v0.16.0) from
prometheus-community instead of a manually crafted DaemonSet manifest.

- Add prometheus-smartctl-exporter v0.16.0 to .charts.yml and fetch via
  chart-vendor
- Rewrite tasks/main.yml to use kubernetes.core.helm
- Replace defaults with helm release/namespace/values vars
- Add vars/main.yml with image, bare-metal affinity, and ServiceMonitor
- Remove manual PodMonitor from kube_prometheus_stack vars (chart deploys
  ServiceMonitor)

Agent-Logs-Url: https://github.com/vexxhost/atmosphere/sessions/5c53ee5f-4e89-4b1e-8e43-90766273d3c9

Co-authored-by: mnaser <[email protected]>
Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 15, 2026

Copilot please rewrite this code to use the same pattern of using helm charts, specifically this one:

https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-smartctl-expor...

Rewritten in commit cde72a1. The role now uses the prometheus-smartctl-exporter Helm chart (v0.16.0) from prometheus-community:

  • Added prometheus-smartctl-exporter v0.16.0 to .charts.yml and fetched via chart-vendor
  • tasks/main.yml uses kubernetes.core.helm (same pattern as prometheus_pushgateway)
  • defaults/main.yml has helm release/namespace/chart vars with an smartctl_exporter_helm_values: {} override
  • vars/main.yml sets _smartctl_exporter_helm_values with image reference, bare-metal-only affinity, and ServiceMonitor with kube-prometheus-stack labels
  • Removed the manual PodMonitor from kube_prometheus_stack/vars/main.yml — the chart's ServiceMonitor handles scraping now

@mnaser
Copy link
Copy Markdown
Member

Copilot please review the Zuul CI failure and make sure you resolve it.

The role was missing the vexxhost.kubernetes.upload_helm_chart meta
dependency that copies the chart to /usr/local/src on the target host.
Without it, the helm task fails with 'path not found'.

Agent-Logs-Url: https://github.com/vexxhost/atmosphere/sessions/5b6f6a27-c08e-4e8d-8457-9996851e273b

Co-authored-by: mnaser <[email protected]>
Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 21, 2026

Copilot please review the Zuul CI failure and make sure you resolve it.

Fixed in e0b5586. The Zuul atmosphere-molecule-aio-ovn / -openvswitch jobs failed with:

path "/usr/local/src/prometheus-smartctl-exporter" not found

The smartctl_exporter role was missing the vexxhost.kubernetes.upload_helm_chart dependency in meta/main.yml — the same pattern used by prometheus_pushgateway and local_path_provisioner to sync the chart to /usr/local/src/<chart-name> on the target host before kubernetes.core.helm runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants