Observability - vLuwte's Homelab Journey

Prometheus

Fixing the Alerts: New Rules, Better Groups, Less Noise

Fan compound rule, seven new Prometheus rules, a six-group restructure, one Loki rule from the backup incident, and what's still on hold.

Observability

Auditing the Alerts: What's Noise, What's Missing, What's Broken

Four months of running alerting on a homelab cluster. One false positive, one correct-but-broken underlying issue, eight days of silent backup failure, and what a rule-by-rule review actually looks like.

homelab-journey

Cluster Observability Part 11: Log-Based Alerting with Loki Ruler

Loki ruler setup, four log-based alert rules from real Part 10 findings, a silent config gotcha, end-to-end test, and the complete dual alerting architecture.

homelab-journey

Cluster Observability Part 10: Hunting for Errors with LogQL

Six investigations across a Kubernetes cluster using LogQL: bootstrap artefacts, a silent 3-week backup failure, an Authelia crash sequence, and what high error volume actually means.

homelab-journey

Cluster Observability Part 9: Know Your Logs Before You Hunt Them

Before hunting errors with LogQL, you need to know your log formats. Format discovery, the Garage INFO trap, klog envelopes, cardinality constraints, and fixing missing kube-system static pod logs.

homelab-journey

Cluster Observability Part 8: Talos System Logs

Talos system logs via Vector: why loki.source.tcp doesn't exist, how Vector fills the gap, and fixing Alloy's node-local filter in the same pass.

homelab-journey

Cluster Observability Part 7: Log Aggregation with Loki and Grafana Alloy

Loki + Grafana Alloy on ARM64: three deploy attempts, five Alloy River config gotchas, and logs finally flowing across all four cluster nodes.

homelab-journey

Prometheus Storage Overhead: Duplicate Scrapers and an Oversized Default Collector

Prometheus filling too fast — found a duplication and a 55% storage hog hiding in chart defaults. Fixed both, cut ingestion by 60%.

homelab-journey

Cluster Observability Part 6: Hardware Health Dashboards and Alerts

Grafana dashboards and 19 Prometheus alert rules for ZFS, SMART, disk temperature, fan speed, and Garage health — with manufacturer-sourced thresholds.

homelab-journey

Cluster Observability Part 5: Hardware Health Collectors - ZFS, SMART, and Thermal

Adding ZFS, SMART, and thermal collectors to the Bletchley cluster — and three ARM64 image attempts before finding one that actually works.

homelab-tools

stern: Real-Time Log Tailing Across Your Kubernetes Cluster

stern tails logs from multiple Kubernetes pods at once. I used it to find a pending Traefik upgrade and reconstruct a PVC resize across five pod types.

homelab-journey

Cluster Observability Part 4: Alerting with Prometheus and Alertmanager

Enabling Alertmanager on the Bletchley cluster: alerting rules, SMTP delivery, end-to-end testing, and two Talos-specific surprises.