homelab-journey

Cluster Observability Part 6: Hardware Health Dashboards and Alerts

Grafana dashboards and 19 Prometheus alert rules for ZFS, SMART, disk temperature, fan speed, and Garage health — with manufacturer-sourced thresholds.

Introduction

Part 5 got the data into Prometheus — ZFS pool state, SMART metrics from every drive, thermal readings from the RK3588 SoCs, and Garage disk stats. But raw metrics in a time-series database are not actionable on their own. Without dashboards, hardware health requires active querying to spot a problem. Without alerting rules, a degraded ZFS mirror or a drive running hot produces no notification at all.

This post closes that gap. Two dashboards — one for SMART health, one for ZFS — and 19 alert rules covering pool degradation, disk failures, NVMe wear and spare capacity, temperatures with manufacturer-sourced thresholds, fan failures, and Garage cluster health. Along the way there are a few surprises: the standard community SMART dashboard doesn't work at all with the matusnovak exporter, NVMe and SATA drives expose temperature under completely different metric names, and the ZFS dashboard loses two panels to an OpenZFS 2.1 kernel change.

There is one known limitation at the end — the SMART metrics show pod IPs instead of node hostnames in alerts. It's documented, understood, and deferred to a future cleanup post rather than solved here.

This post is part of the Observability sub-series.

Part 1: Prometheus and Node Exporter
Part 2: Grafana Dashboards
Part 3: Dashboards for Storage and Kubernetes
Part 4: Alerting with Prometheus and Alertmanager
Part 5: Hardware Health Collectors
Part 6: Hardware Health Dashboards and Alerts

🏠 This is part of the Homelab Journey series - building a production Kubernetes cluster from scratch.

Hardware Health Collectors
Cluster Observability Part 6: Hardware Health Dashboards and Alerts (you are here)

This post assumes the hardware health collectors from Part 5 are running and scraping cleanly. It also assumes Alertmanager is configured and delivering email, which was set up in Part 4.

The SMART Dashboard

The natural starting point for a community dashboard is the Grafana library. For the official smartctl_exporter, dashboard ID 20204 is the standard choice. But the Bletchley cluster runs matusnovak/prometheus-smartctl — the image that actually builds for ARM64 — and that exporter uses smartprom_ as its metric prefix rather than smartctl_. Every panel in dashboard 20204 queries smartctl_* metrics. Importing it produces a dashboard full of empty panels.

The matusnovak repo itself ships a Grafana dashboard at grafana/grafana_dashboard.json. That's the right starting point. It's built against smartprom_ metrics and imports cleanly — Grafana prompts to map the DS_PROMETHEUS input to the existing Prometheus datasource and the panels load immediately.

The upstream dashboard has four panels. By the time it was finished, it had six. Here's what changed and why.

Filtering out the virtual devices

The first thing visible in the SMART Health panel is noise:

Grafana SMART Health table showing three sat drives on 10.244.0.154:9902, and scsi type devices with serial numbers beaf11, beaf21, beaf31 on other pod IPs, all showing health OK — The upstream dashboard with no filters applied. The scsi rows — serial numbers beaf11, beaf21, beaf31 — are virtual block devices from the TuringPi BMC and Longhorn, not physical drives.

The type column tells the story. Real drives have type="sat" (SATA) or type="nvme". The virtual devices have type="scsi" and serial numbers that are clearly synthetic. Adding type!="scsi" to the SMART Health panel query removes them:

Grafana SMART Health table showing only sat and nvme drives — two VK000960GWCFF SATA drives and four Kingston SNVS250G NVMe drives, all health OK — After adding type!="scsi" to the query. Two SATA drives on rock3 and four NVMe drives across all nodes — the real hardware.

The same type!="scsi" filter appears in every alert rule that touches smartprom_ metrics. The dashboard filter and the alert rules are consistent.

NVMe and SATA use different temperature metrics

The Temperature panel in the upstream dashboard queries smartprom_temperature_celsius_raw. After loading the dashboard, only two lines appeared — rock3's two SATA drives at around 21°C. The four NVMe drives were absent.

The reason: SATA and NVMe drives expose temperature through different SMART attributes, and the matusnovak exporter maps them to different metric names. On Bletchley, the SATA drives report under smartprom_temperature_celsius_raw and the NVMe drives report under smartprom_temperature. The upstream dashboard was built against SATA drives and only queries smartprom_temperature_celsius_raw — so the NVMe drives simply don't appear. Check which metrics your own drives produce in the Prometheus expression browser before assuming the same split applies. Adding a second query to the Temperature panel for the NVMe metric fixes it:

Grafana temperature timeseries showing six drive lines over 12 hours. Two green lines flat at 20-23°C at the bottom for SATA drives, four coloured lines for NVMe drives clustering between 45 and 59°C, with the mean, last, max and min values shown in a legend below — All six drives on one graph. The SATA drives barely move — 21°C regardless of workload. The NVMe drives idle around 47–53°C, with rock3 running warmest. The NVMe data only starts from when the second query was added — there's no historical data before that point.

The 30°C gap between SATA and NVMe idle temperatures is worth keeping in mind when setting alert thresholds.

Separating NVMe and SATA panels

The Error Metrics and Info Metrics panels in the upstream dashboard only show data for SATA drives. This is correct behaviour — reallocated sectors, pending sectors, and offline uncorrectable counts are SATA SMART attributes. NVMe drives use a different health model entirely, with metrics like available spare capacity and media error counts.

The solution is to split each panel by device type. The upstream Error Metrics panel becomes Error Metrics SATA, and a new Error Metrics NVMe panel is added alongside it with the NVMe-specific metrics. Same for Info Metrics. The final dashboard has six panels:

Full SMART Exporter Grafana dashboard showing six panels: SMART Health table with sat and nvme drives all OK, Error Metrics SATA and Error Metrics NVMe tables showing all zeros, Info Metrics SATA showing 21°C and 7.68 years uptime, Info Metrics NVMe showing 49-53°C temperatures and 13-14 weeks uptime, and the Temperature timeseries at the bottom — The finished dashboard. Six panels, NVMe and SATA separated throughout. The SATA drives show 7.68 years of power-on hours — nearly eight years of continuous operation. The NVMe drives show 13–14 weeks, having been installed with the cluster.

One detail in the Info Metrics SATA panel: 7.68 years. These VK000960GWCFF drives have been running continuously for nearly eight years. That context makes the SMART monitoring feel less precautionary and more like active management of hardware that has genuine history behind it.

One fix required for the error panels: the upstream dashboard had the "No value" field set to "Ignored", which caused columns where all values are zero to disappear entirely. Changing it to "As zeros" makes the zeros explicit — confirming that the metrics are being collected and reporting clean, not that the column is missing.

The extended dashboard JSON is stored in apps/monitoring/grafana/grafana_dashboard_smartctl.json. It is not auto-provisioned — Grafana loads it from its Longhorn-backed persistent volume, backed up like all other Grafana state. The file in git is a recovery reference if Grafana ever needs to be rebuilt from scratch.

The ZFS Dashboard

ZFS dashboard ID 7845 imports from the Grafana library. It has a node template variable that scopes the panels to a specific instance — set to rock3.vluwte.nl since ZFS metrics only exist there.

Most of the dashboard works well. The ARC section shows exactly what matters:

ZFS Grafana dashboard showing ARC Hit % panel at 100% for data and 99.9% for metadata, ARC Hits and Misses panel with data hits at 931 and misses at 0, ARC Size panel showing data at 2.74 GiB and metadata at 260 MiB, and below that ARC L2 panels all showing zero values, with a collapsed Pools section at the bottom labelled 2 panels — The ZFS dashboard after reorganisation. ARC hit rate is 100% for data — every read is served from memory. The L2ARC panels show zeros, which is correct: there's no L2ARC device on rock3. The Pools section is collapsed at the bottom.

The ARC L2 panels showing zeros is expected. L2ARC is a secondary cache on a separate fast device — rock3 doesn't have one, so the panels are accurately empty rather than broken.

The Pools section is a different story. Two panels — ZPOOL - Time and ZPOOL - Ops — show "No data":

Grafana dashboard Pools section showing two panels both with No data message: ZPOOL - Time on the left and ZPOOL - Ops on the right — OpenZFS 2.1 changed how pool I/O stats are exposed — the kstat paths these panels depend on are no longer available. Node Exporter v1.10.2 is not the problem.

These panels use node_zfs_zpool_rtime and node_zfs_zpool_reads. OpenZFS 2.1 changed how pool I/O statistics are exposed — those kstat paths are no longer available at /proc/spl/kstat/zfs/. Node Exporter v1.10.2 — well above the 1.4.0 version that added patches for this — is not the issue. The iostats kstat file exists, but the specific pool-level I/O paths these panels need are not there on this kernel version.

Rather than deleting the panels, the Pools section is moved to the bottom and collapsed. The panels still exist and are visible as "2 panels" in the collapsed header — documented rather than hidden. If a future Node Exporter update or OpenZFS version restores the metrics, the panels will start working again automatically.

Alert Rules

The alert rules live in prometheus-values.yaml under serverFiles."alerting_rules.yml" alongside the existing rules from Part 4. All 19 hardware rules are in a single hardware group — using multiple groups would work, but one group with inline comments is simpler to read and there are no evaluation interval requirements that would justify splitting.

Before applying with helm upgrade, the rules are validated:

yq '.serverFiles."alerting_rules.yml"' \
  ~/talos-cluster/bletchley/apps/monitoring/prometheus/prometheus-values.yaml \
  > /tmp/hardware-rules.yaml

promtool check rules /tmp/hardware-rules.yaml
# Checking /tmp/hardware-rules.yaml
#   SUCCESS: 37 rules found

The yq command extracts the alerting rules section — which already has the groups: wrapper — into a standalone file that promtool can validate. This catches PromQL syntax errors before they reach the cluster.

ZFS Pool State

- name: hardware
  rules:

    # ── ZFS Pool State ────────────────────────────────────────────────
    # node_zfs_zpool_state emits one series per state value per pool.
    # A state is active when its value == 1.
    # Degraded = one or more devices unavailable; the pool is still
    # serving data but redundancy is reduced. 
    # On a two-disk mirror, degraded means one more failure loses data.
    # Faulted/suspended/unavail = pool not serving data.
    # Offline/removed = may be intentional but worth alerting on.

    - alert: ZFSPoolDegraded
      expr: node_zfs_zpool_state{job="node-exporter", state="degraded"} == 1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "ZFS pool degraded on {{ $labels.instance }}"
        description: "Pool {{ $labels.zpool }} is degraded — one disk has failed. Redundancy is gone. Replace the failed disk immediately."

The original plan had a single ZFSPoolNotOnline rule. That was wrong — a degraded mirror is still online (the pool keeps serving data) but redundancy is gone, and that's the most important state to know about immediately. The correct approach is one rule per state: six rules covering degraded, faulted, suspended, unavail, offline, and removed. The first four are critical; the last two are warning since they may be intentional.

The metric name is node_zfs_zpool_state (not node_zfs_pool_state, which appears in some documentation and older examples). Always verify against your own /metrics output. The pool label is zpool, not pool. The job="node-exporter" filter scopes naturally to rock3, since only rock3 has ZFS metrics, and gives clean hostname instance labels rather than the pod-IP-style labels that the kubernetes-service-endpoints job produces.

SMART and NVMe

    # ── SMART — All Drives ────────────────────────────────────────────
    # smartprom_ metrics come from the matusnovak/prometheus-smartctl
    # DaemonSet (port 9902). Instance labels are pod IPs, not hostnames.
    # Virtual SCSI devices (TuringPi BMC / Longhorn) appear with
    # type="scsi" and must be excluded from all rules.

    - alert: DiskSMARTFailing
      expr: smartprom_smart_passed{type!="scsi"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "SMART failure on {{ $labels.instance }} drive {{ $labels.drive }}"
        description: "SMART overall health check returned failing status."

    # ── SMART — NVMe Specific ─────────────────────────────────────────
    # NVMe drives use different SMART attributes from SATA.
    # percentage_used and available_spare are NVMe-only metrics.
    # available_spare_threshold is manufacturer-set (Kingston: 10%).
    # Baseline: wear 20–31%, spare 100% on all four nodes.

    - alert: NVMeWearHigh
      expr: smartprom_percentage_used{type="nvme"} > 80
      for: 1h
      labels:
        severity: warning
      annotations:
        summary: "NVMe wear indicator high on {{ $labels.instance }}"
        description: "Drive {{ $labels.drive }} is at {{ $value }}% used life."

    - alert: NVMeSpareCapacityLow
      expr: smartprom_available_spare{type="nvme"} <= smartprom_available_spare_threshold{type="nvme"}
      for: 1h
      labels:
        severity: critical
      annotations:
        summary: "NVMe spare capacity critical on {{ $labels.instance }}"
        description: "Drive {{ $labels.drive }} available spare {{ $value }}% at or below manufacturer threshold."

One caveat on DiskSMARTFailing: overall SMART status is a coarse signal. Many drives continue to report PASSED right up until they fail. The rule catches hard failures, but the attribute-level panels in the dashboard — reallocated sectors for SATA, media errors for NVMe — are worth watching alongside it. Those metrics can show deterioration long before the overall status changes.

NVMeSpareCapacityLow uses the manufacturer's own threshold directly rather than a hardcoded number. Kingston sets the danger threshold at 10% — when available spare drops to or below that, the drive itself considers the situation critical. Using smartprom_available_spare_threshold in the expression means the alert fires at exactly the threshold the manufacturer chose, regardless of what that number is for a given drive. Know that this expression relies on both metrics having identical label sets; if your exporter adds extra labels, you may need an on(...) clause to align the series.

Disk Temperature

Temperature alerting is where the two-metric-name problem shows up again:

    # ── SMART — Disk Temperature ───────────────────────────────────────
    # SATA and NVMe use different metric names for temperature:
    #   SATA: smartprom_temperature_celsius_raw (type="sat") — idle ~21°C
    #   NVMe: smartprom_temperature (type="nvme") — idle ~49°C
    # Warning threshold (65°C) is a conservative shared value for both.
    # Critical thresholds are manufacturer-sourced from
    # node_hwmon_temp_crit_celsius: SATA=70°C, NVMe=89°C.
    # DiskTemperatureAlarm fires immediately (for: 0m) when the NVMe
    # drive's own hardware alarm flag is set — the most authoritative
    # signal, no threshold calculation needed.

    - alert: DiskTemperatureHigh
      expr: >
        smartprom_temperature_celsius_raw{type="sat"} > 65
        or
        smartprom_temperature{type="nvme"} > 65
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Disk temperature high on {{ $labels.instance }}"
        description: "Drive {{ $labels.drive }} at {{ $value }}°C — threshold is 65°C."

    - alert: DiskTemperatureCritical
      expr: >
        smartprom_temperature_celsius_raw{type="sat"} > 70
        or
        smartprom_temperature{type="nvme"} > 89
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Disk temperature critical on {{ $labels.instance }}"
        description: "Drive {{ $labels.drive }} at {{ $value }}°C — at or above manufacturer critical threshold (SATA: 70°C, NVMe: 89°C)."

    - alert: DiskTemperatureAlarm
      expr: node_hwmon_temp_alarm{job="node-exporter", chip="nvme_nvme0"} == 1
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: "NVMe temperature alarm on {{ $labels.instance }}"
        description: "NVMe drive hardware temperature alarm triggered — drive has exceeded its own critical threshold of 89.85°C."

The critical thresholds — SATA at 70°C, NVMe at 89°C — come from node_hwmon_temp_crit_celsius, read from the kernel's hwmon interface. These values are typically derived from device firmware, but may vary depending on drivers and hardware support. The SATA drives on rock3 idle at 21°C; NVMe drives idle around 49°C. Both are well below their warning thresholds at baseline.

DiskTemperatureAlarm has for: 0m — no delay. When the drive's own hardware alarm flag fires, the alert fires on the first evaluation where the condition is true, with no additional waiting period. This is different from the threshold-based rules: there is no value in waiting when the hardware itself is telling you there is a problem.

Node Temperature and Fan

    # ── Node Temperature (SoC) ────────────────────────────────────────
    # RK3588 exposes 7 thermal zones (zone0–zone6): package, big cores
    # (0 and 2), little cores, center, GPU, NPU. temp0 and temp1 report
    # identical values per zone — chip=~"thermal_thermal_zone.*" covers
    # all zones while excluding nvme_nvme0 and target* SATA drivetemp
    # sensors, both of which are covered by smartprom rules above.
    # max by (instance) fires one alert per node on the hottest zone.
    # Idle baseline: 46–48°C. No hwmon critical threshold available for
    # RK3588 thermal zones — 75°C warning and 90°C critical are
    # conservative values based on RK3588 datasheet behaviour.

    - alert: NodeTemperatureHigh
      expr: >
        max by (instance) (
          node_hwmon_temp_celsius{
            job="node-exporter",
            chip=~"thermal_thermal_zone.*"
          }
        ) > 75
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Node SoC temperature high on {{ $labels.instance }}"
        description: "Hottest thermal zone at {{ $value }}°C — idle baseline is 46–48°C."

    - alert: NodeTemperatureCritical
      expr: >
        max by (instance) (
          node_hwmon_temp_celsius{
            job="node-exporter",
            chip=~"thermal_thermal_zone.*"
          }
        ) > 90
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Node SoC temperature critical on {{ $labels.instance }}"
        description: "Hottest thermal zone at {{ $value }}°C — thermal throttling or shutdown risk."

    # ── Fan Speed ─────────────────────────────────────────────────────
    # All four nodes have a PWM fan (chip="platform_pwm_fan", sensor="fan1").
    # Idle RPM varies significantly by workload: rock3 ~2580 RPM,
    # rock1/2/4 ~263–321 RPM. Alert only on 0 RPM (fan stopped) to
    # avoid false fires from the wide normal idle range.

    - alert: NodeFanStopped
      expr: >
        node_hwmon_fan_rpm{
          job="node-exporter",
          chip="platform_pwm_fan"
        } == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Fan stopped on {{ $labels.instance }}"
        description: "Fan {{ $labels.sensor }} reading 0 RPM — possible fan failure."

The RK3588 exposes seven thermal zones through node_hwmon_temp_celsius — package, big cores, little cores, center, GPU, and NPU. Each zone also has temp0 and temp1 sensors that report identical values. Rather than writing seven separate rules that would all fire for the same physical event, max by (instance) reduces this to one alert per node showing the hottest zone.

The chip=~"thermal_thermal_zone.*" regex excludes nvme_nvme0 (covered by smartprom_temperature) and target* (SATA drivetemp, covered by smartprom_temperature_celsius_raw). No double-alerting.

Fan speed is simpler. Idle RPM varies enormously — rock3 runs at ~2580 RPM under its heavier workload while rock1/2/4 idle at 263–321 RPM. Any threshold between those values would produce constant false fires. The only unambiguously wrong value is zero. NodeFanStopped alerts only when a fan has stopped completely.

Garage

    # ── Garage ────────────────────────────────────────────────────────
    # Garage metrics use bare names (no garage_ prefix) for cluster
    # health and resync metrics. Disk space metrics use garage_ prefix.
    # Both volume="data" and volume="metadata" report identical disk
    # numbers — scope to volume="data" to avoid double-firing.
    # Current usage ~4.4% (22 GB of 500 GB ZFS quota on rock3).

    - alert: GarageClusterUnhealthy
      expr: cluster_healthy{job="garage"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Garage cluster unhealthy"
        description: "One or more Garage nodes are disconnected."

    - alert: GarageClusterUnavailable
      expr: cluster_available{job="garage"} == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "Garage cluster unavailable"
        description: "Garage cannot serve requests — cluster quorum lost."

    - alert: GarageBlockResyncErrors
      expr: block_resync_errored_blocks{job="garage"} > 0
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "Garage block resync errors"
        description: "{{ $value }} blocks failed to resync — data integrity risk."

    - alert: GarageStorageNearFull
      expr: garage_local_disk_avail{volume="data"} / garage_local_disk_total{volume="data"} < 0.15
      for: 1h
      labels:
        severity: warning
      annotations:
        summary: "Garage disk space low"
        description: "Less than 15% free on Garage data volume. Current baseline ~95.6% free (478 GB of 500 GB)."

Two things worth noting here. First, the metric names: cluster_healthy, cluster_available, and block_resync_errored_blocks have no garage_ prefix, despite being Garage metrics. The disk space metrics — garage_local_disk_avail and garage_local_disk_total — do. This reflects a mix of cluster-level and node-level metrics rather than a strict naming convention.

Second, the volume="data" filter on the disk rules. Garage exposes both volume="data" and volume="metadata" label values, both reporting identical numbers from the same underlying ZFS dataset. Without the filter, an alert would fire twice for the same condition.

Testing the Alert Pipeline

Most of these rules cannot be tested by triggering the real condition — pulling a disk to simulate ZFS degradation is not something to do casually, and inducing a SMART failure is not possible at all. What can be tested is the alert pipeline: that Prometheus evaluates the rule, Alertmanager receives the alert, and the email arrives with correctly resolved labels.

For the rules with numeric thresholds, the test is straightforward: set the threshold to a value that the current baseline readings exceed, apply with helm upgrade, wait for the for durations to elapse, and confirm the emails arrive.

All six testable rules are triggered simultaneously. Triggering them together speeds up validation and works well here since the rules are independent.

# Test thresholds — replace production values, apply, wait, restore

DiskTemperatureHigh:   > 15°C  (SATA at 21°C, NVMe at 49°C both fire)
DiskTemperatureCritical: > 16°C (1°C gap confirms both alert names are distinct)
NodeTemperatureHigh:   > 40°C  (SoC at 46–48°C fires on all four nodes)
NodeTemperatureCritical: > 41°C
NodeFanStopped:        < 5000 RPM (all fans below rock3's ~2580 RPM)
GarageStorageNearFull: < 0.99 free ratio (4.4% used triggers immediately)

After the upgrade, all six rules enter pending state:

Prometheus alerts page showing hardware group with PENDING (6) and INACTIVE (13) badges, with DiskTemperatureHigh, DiskTemperatureCritical, DiskTemperatureAlarm, NodeTemperatureHigh, NodeTemperatureCritical, NodeFanStopped and GarageStorageNearFull all showing PENDING counts in amber — Immediately after applying the test thresholds. All six rules are pending — the conditions are met but the for durations haven't elapsed yet.

The for durations are the interesting part. At around five minutes, the rules with for: 5m fire first:

Prometheus alerts page showing FIRING (3) PENDING (3) INACTIVE (13) badges. DiskTemperatureCritical, NodeTemperatureCritical and NodeFanStopped are FIRING in red, while DiskTemperatureHigh, NodeTemperatureHigh are still PENDING in amber, and GarageStorageNearFull is PENDING — Five minutes in. Rules with for: 5m are firing; rules with for: 10m are still pending. GarageStorageNearFull has a 1h for duration and will take much longer.

Ten minutes in, the longer-duration rules fire too:

Prometheus alerts page showing FIRING (5) PENDING (1) INACTIVE (13). DiskTemperatureHigh FIRING (6), DiskTemperatureCritical FIRING (6), NodeTemperatureHigh FIRING (4), NodeTemperatureCritical FIRING (4), NodeFanStopped FIRING (4), GarageStorageNearFull PENDING (1) — Ten minutes in. All five temperature and fan rules firing. GarageStorageNearFull is still pending — its 1h for duration won't elapse during this test.

The emails arrive in two distinct batches, which is a concrete demonstration of how for durations work in practice:

Email inbox showing 11 unread messages from alerts@vluwte.nl. The visible subjects include FIRING NodeTemperatureCritical for rock1, rock2, rock3, rock4, DiskTemperatureCritical for multiple pod IPs, and NodeFanStopped for rock3. An open email shows NodeTemperatureCritical for rock4 with description showing Hottest thermal zone at 49.923°C — The first batch at 10:49 — rules with for: 5m. NodeTemperatureCritical fires per node, showing the clean rock4.vluwte.nl:9100 instance label. The description correctly resolves to the actual temperature reading.

Email inbox showing a second batch of messages at 10:54. Subjects include FIRING NodeTemperatureHigh for rock1 through rock4, and FIRING DiskTemperatureHigh for multiple pod IPs. An open email shows DiskTemperatureHigh for 10.244.0.154:9902 grouped as 3 alerts with drive label showing dev/nvme0 at 51°C — The second batch at 10:54 — rules with for: 10m. The DiskTemperatureHigh email shows the pod IP instance label (10.244.0.154:9902) rather than a hostname — the known limitation of pod-discovery scraping. The drive and temperature labels resolve correctly.

The five-minute gap between batches is exactly what the for durations predict. A condition that lasts five minutes is serious enough to page; a condition that lasts ten minutes is serious enough to page again even if the first page was missed.

After restoring the production thresholds with another helm upgrade, the rules briefly show UNKNOWN — Prometheus is honest about the fact that it hasn't evaluated the rules since the config was reloaded. A single evaluation cycle resolves this to INACTIVE.

Prometheus alerts page showing hardware group with INACTIVE (19) badge, all 19 rules listed with green inactive indicators — Back to baseline. All 19 rules inactive, no false fires at production thresholds.

Known Limitation: Pod IPs in SMART Alerts

Alert emails for SMART metrics show pod IP addresses — 10.244.0.154:9902 — rather than node hostnames. This is visible in the DiskTemperatureHigh email in the second batch. Node Exporter alerts correctly show rock4.vluwte.nl:9100; SMART alerts show a pod IP.

The reason is how the scrape job discovers targets. Node Exporter is scraped via a Kubernetes service, so Prometheus uses the service DNS name and the instance label reflects the node hostname. The smartctl scrape job uses pod discovery — it finds the DaemonSet pods by their pod IP and uses that IP as the instance label. The DaemonSet runs one pod per node, but Prometheus has no automatic way to know which pod is on which node without being told explicitly.

Fixing it requires adding a relabelling rule to copy __meta_kubernetes_pod_node_name into a node label during scrape:

- source_labels: [__meta_kubernetes_pod_node_name]
  target_label: node

This is deferred to a future post that will address scrape job deduplication across the cluster. The drive label in the alert annotation identifies the disk clearly enough for a homelab — but this will be cleaned up.

What's Working Now

✅ SMART dashboard — 6 panels, NVMe/SATA separated, filtered to real drives
✅ ZFS dashboard — ARC panels working at 100% hit rate; Pools section collapsed (pool I/O kstat paths unavailable on this OpenZFS version)
✅ 19 hardware alert rules loaded and inactive at production thresholds
✅ Alert pipeline confirmed end-to-end — two email batches received, labels resolved correctly
✅ promtool validation confirmed: 37 rules total, zero syntax errors
⚠️ SMART alert emails show pod IPs — deferred to scrape job deduplication post

What's Next

The hardware is now observable and alertable. The remaining gap in the observability stack is logs — Prometheus and Alertmanager handle metric-based alerting well, but there is no log aggregation in place. Pod logs are ephemeral, vanish when pods restart, and require active querying with kubectl logs or stern. Parts 7 and 8 add Loki and Promtail to fix that, and complete the observability stack.

← Previous: Hardware Health Collectors

Questions or suggestions? Leave a comment below or reach out at igor@vluwte.nl.