homelab-journey

Reflection: 25 Posts and a Cluster I Can Stand Behind

74 days, 25 posts, one Kubernetes cluster. A look back at what got built, what's still to come, and what I learned doing it.

Introduction

On February 14, I published my first post. I had a blog, a TuringPi cluster sitting on my desk, and a vague plan to document whatever happened next. Eleven weeks later — 74 days to be exact — I'm hitting post 25. I could never have imagined that when I started this that I'd be here: a Kubernetes cluster I'm genuinely proud of, 25 posts published, and somehow more ideas than when I began.

This post is a pause. A look back at what got built, what it means, and what comes next.

🏠 This is part of the Homelab Journey series - building a production Kubernetes cluster from scratch.

Secret Management Part 1
Reflection: 25 Posts and a Cluster I Can Stand Behind (you are here)

25 Posts in 11 Weeks

When I started on February 14, I didn't set out to write 25 posts in under 11 weeks. I set out to document what I was doing so I wouldn't forget it — and so others might find it useful. The pace surprised me too.

Here's everything that got covered:

Blog Infrastructure (Posts 1–6) — before there was a cluster to write about, there needed to be a place to write. Ghost on RHEL, configuring it for technical blogging, analytics with Umami, a PostgreSQL upgrade that shouldn't have been delayed as long as it was.

Building the Cluster (Posts 7–11) — the TuringPi hardware post, two Talos Linux installation posts (the first attempt and the real one), an upgrade to add a missing extension, and then Longhorn for persistent storage with a proper deep dive into how it actually works.

Making It Observable (Posts 12–13, 20, 23) — Prometheus and Node Exporter, Grafana dashboards, and then expanding those dashboards to cover storage, Kubernetes internals, and networking. Alerting with Alertmanager came later and brought the observability stack to a meaningful place.

Networking and Access (Posts 14–15, 17, 21) — MetalLB and Traefik to give services real IPs, cert-manager for TLS everywhere with per-service certificates, and then Authelia to put authentication in front of everything that needed it.

Backups (Posts 16–18) — Garage S3, ZFS, Longhorn backup targets, offloading to a Synology with rclone, and then actually validating it all works by doing a restore. That last part matters more than people give it credit for.

Tooling (Posts 22, 24) — IT-Tools on Bletchley and k9s, because good tools make everything else easier.

Security (Post 25) — OpenBao, the first real secret management post, starting a series that I know will keep going.

The State of the Cluster

The Bletchley cluster is in a genuinely good place. That's not something I expected to be able to say 11 weeks in.

What's solid:

✅ Four RK1 nodes running Talos Linux — immutable, reproducible, and upgraded when needed
✅ Longhorn providing replicated persistent storage across nodes
✅ MetalLB + Traefik giving every service a proper address and TLS termination
✅ cert-manager issuing wildcard certificates automatically via TransIP DNS-01
✅ Authelia sitting in front of services that need authentication
✅ Prometheus, Grafana, and Alertmanager watching the cluster and telling me when something is wrong
✅ Full backup pipeline: Longhorn → Garage S3 → Synology offsite, with validated restores
✅ Forgejo for version control of cluster configuration
✅ OpenBao starting to manage secrets properly

That's a production-grade foundation. Not perfect — there are still loose ends — but solid enough to build on confidently.

What's Still to Come

Honest accounting: there are things I started that aren't finished, and things I haven't started yet that I know matter.

Security — OpenBao is in, but secret management is a series that has two more posts coming, plus eventually migrating all existing secrets to use External Secrets Operator properly. Security deserves sustained focus and it's going to get it.

Observability — the metrics and alerting story is good. The logs story doesn't exist yet. Loki and Promtail are coming, and with them a much more complete picture of what the cluster is doing. Hardware health collectors for ZFS, SMART, and thermal data are also on the list.

Tooling — IT-Tools and k9s are a start, but there's more to document here. Tools that make cluster management better are worth their own series thread.

GitOps — this is where I want to take the cluster next, and it deserves its own focus. The goal is fully declarative: every workload, every configuration, every version defined in Git. A commit triggers a reconciliation, a new workload appears, nothing exists that isn't tracked. ArgoCD or Flux will get there. Right now the cluster is well-built but still largely hand-crafted — GitOps is what makes it reproducible by design rather than by memory.

Continuous improvement — the cluster is solid, but it isn't finished in the sense that maintenance never ends. Backups are already slipping: I added a new PVC recently and it isn't in the backup schema yet. I set up an ingress for Alertmanager but not for Prometheus. These are exactly the kind of small gaps that accumulate into real problems. Tightening this up — automating what can be automated, building the habits for what can't — is ongoing work, not a future project.

The ideas list keeps growing. Every post I write surfaces two more things worth exploring. That used to feel like the list would never get shorter — now I think that's just what this is. The cluster becomes more capable, the understanding deepens, and the interesting problems shift.

What I Actually Learned

Not about Kubernetes or Talos or Longhorn specifically. About doing this kind of project.

Write while you do it. Posts written days after the work is done are worse than posts written the same evening. The friction was highest at the start, lowest when the habit was established.

The documentation is the system of record. Three times already I've gone back to my own posts to remember exactly how I did something. That wasn't the plan, but it's become more than documentation — it's an externalised memory for the cluster. The exact configs, version choices, architecture decisions, known gotchas. Without the posts, reproducing the cluster would be significantly harder, even for me.

Good enough to publish is good enough. I didn't ship posts when everything was perfect. I shipped them when they were accurate and useful. The difference matters.

The cluster taught me the posts. I didn't plan to go this deep on observability, or to make backups a three-post series. The work revealed what needed explaining, and the posts shaped what I understood.

At some point the question changed. Early posts asked: how do I install this? Later posts asked: how do I trust this in six months? That shift — from setup mindset to operations mindset — happened gradually and without announcement. Upgrade strategies, backup validation, secret lifecycle thinking, failure detection. It's a different set of concerns, and it's a sign the cluster became something real rather than something experimental.

Don't put off the boring maintenance. The PostgreSQL upgrade is the most clicked post on the blog — as of early April, 12 clicks and 279 impressions in Google Search, ahead of everything else, and the trend is still climbing as more posts go live. It was a post I almost didn't write because the problem felt embarrassing: running an EOL database version on a server I managed myself. The fix turned out to be easier than expected. The lesson is that other people are running the same thing and searching for exactly that. The unglamorous problems are often the most useful ones to document.

(Numbers above are from April 4 — I'll update these closer to publication.)

More documentation doesn't mean the better choice. Choosing an ingress controller took longer than it should have. nginx has an enormous body of documentation and examples, which made it feel like the obvious path — but a significant portion of that documentation describes configurations from years ago, patterns that predate modern ingress controllers, or setups that simply don't apply to a Talos cluster in 2026. What tipped it was finding that nginx's ingress controller was heading toward retirement, and that information turned out to be just as abundant as everything else written about it. Traefik has been on the cluster from day one and has done everything asked of it.

Tooling catches what the eye misses. The Forgejo backup job had been failing silently. No alert, no obvious symptom, nothing in the day-to-day cluster view to suggest anything was wrong. k9s surfaced it: a job in a failed state, sitting there unremarked. That's one example of a pattern that repeated throughout the project — Prometheus catching a scraping issue, observability exposing a PVC not in the backup scope, dashboards showing gaps that weren't visible any other way. The cluster is stable partly because these things were caught before they became incidents. That's not luck; that's what observability is for.

Security isn't just a YAML file — it's a protocol. The moment that made this real was accidentally committing Talos machine configs into Forgejo early in the project. It happened on one of the first commits, before the habits were established. The right call was to restart — not patch around it — because a repository with secrets in its history isn't clean no matter how many commits follow. That decision cost time, but it set the right baseline.

It's also why the OpenBao post took longer than expected. The unseal ceremony, the secrets store architecture, how keys are split and where they live — this isn't something to skim. Getting it wrong quietly is worse than getting it wrong loudly, because you won't know until something breaks in production. Installing OpenBao was the straightforward part. The time went into understanding what comes after: rotation schedules, access policies, knowing which service touches which secret, and what happens when something needs to change. The series has two more posts coming precisely because the first install is the easy bit. The harder work is the practice that follows.

74 Days

February 14 to April 29. 25 posts. A cluster running Talos Linux, Longhorn, Traefik, cert-manager, Authelia, Prometheus, Grafana, Alertmanager, Garage, Forgejo, OpenBao, and more — all on four ARM64 nodes in a box that fits on a shelf.

I'm proud of what's on the page and what's running in the rack.

← Previous: Secret Management Part 1

Questions or suggestions? Leave a comment below or reach out at igor@vluwte.nl.

Reflection: 25 Posts and a Cluster I Can Stand Behind

Introduction

25 Posts in 11 Weeks

The State of the Cluster

What's Still to Come

What I Actually Learned

74 Days

Read more

Fixing a Postgres Backup Failure: Region Mismatch and a Missing NAS Copy

Longhorn Disk Alerting: Getting the Signal Right

Metrics Server on Talos: The Reboot That Broke Garage

Longhorn Snapshot Overhead: Why the Alerts Were Right and Wrong