Cluster Observability Part 3: Dashboards for Storage, Kubernetes, and Network
Importing Longhorn, Kubernetes, and resource dashboards onto the Bletchley cluster — and fixing the scrape config that was only collecting from one of four nodes.
Importing Longhorn, Kubernetes, and resource dashboards onto the Bletchley cluster — and fixing the scrape config that was only collecting from one of four nodes.
Installing Forgejo on Bletchley: self-hosted git, infra/apps repo structure, accidentally committing secrets, SQLite WAL, and a two-layer backup to Garage S3.
Testing the backup chain: etcd encryption with age, and a full Grafana PVC delete-and-restore to confirm Longhorn backups actually work.
Exposing Garage S3 via Traefik, why Synology Cloud Sync failed, and how rclone solved the offsite backup problem instead.
Building the local backup layer for the Bletchley cluster: ZFS mirror on rock3's SATA SSDs, Garage S3, NFS, and Longhorn recurring backups.
Adding TLS to the Bletchley cluster with cert-manager, TransIP DNS-01 challenges, Let's Encrypt staging and production issuers, and automatic HTTP→HTTPS redirects.
Adding MetalLB and Traefik to Bletchley: real IPs for LoadBalancer services, hostname-based routing, and a reader-suggested improvement that preserves source IPs from day one.
Adding Grafana to the Bletchley cluster: Longhorn-backed storage, pre-configured Prometheus datasource, and the Node Exporter Full dashboard showing live node metrics.
Installing Prometheus and node exporter on Talos Linux: namespace labelling, values files, the duplicate pod gotcha, and confirming all four nodes are being scraped.
The decisions behind the Longhorn installation: why 2 replicas on a 1Gb cluster, how version pinning protects reproducibility, and when to upgrade.
Installing Longhorn distributed storage on a Talos Linux cluster: NVMe preparation, Helm install, and why the namespace label matters before anything else.
How I discovered a missing system extension before it caused problems, and upgraded all four Bletchley cluster nodes in 15 minutes. Plus: what changes when you have running workloads.