homelab-journey
Cluster Observability Part 6: Hardware Health Dashboards and Alerts
Grafana dashboards and 19 Prometheus alert rules for ZFS, SMART, disk temperature, fan speed, and Garage health — with manufacturer-sourced thresholds.
Mastering the three pillars of observability: logs, metrics, and traces for a transparent tech stack.
homelab-journey
Grafana dashboards and 19 Prometheus alert rules for ZFS, SMART, disk temperature, fan speed, and Garage health — with manufacturer-sourced thresholds.
homelab-journey
Adding ZFS, SMART, and thermal collectors to the Bletchley cluster — and three ARM64 image attempts before finding one that actually works.
homelab-tools
stern tails logs from multiple Kubernetes pods at once. I used it to find a pending Traefik upgrade and reconstruct a PVC resize across five pod types.
homelab-journey
Enabling Alertmanager on the Bletchley cluster: alerting rules, SMTP delivery, end-to-end testing, and two Talos-specific surprises.
homelab-journey
Importing Longhorn, Kubernetes, and resource dashboards onto the Bletchley cluster — and fixing the scrape config that was only collecting from one of four nodes.
homelab-journey
Adding Grafana to the Bletchley cluster: Longhorn-backed storage, pre-configured Prometheus datasource, and the Node Exporter Full dashboard showing live node metrics.
homelab-journey
Installing Prometheus and node exporter on Talos Linux: namespace labelling, values files, the duplicate pod gotcha, and confirming all four nodes are being scraped.