homelab-journey

Keeping the Bletchley Cluster Current: Talos, Kubernetes, and Every Helm Chart

Upgrading Talos, Kubernetes, and thirteen Helm charts with real workloads running — the safe order, the tools, and what went wrong. The how, not the what.

Introduction

A homelab cluster that never gets updated is a liability waiting to happen. Security patches accumulate, compatibility windows close, and the longer you wait the more there is to understand before you can safely touch anything. The Bletchley cluster was built clean with pinned versions — that discipline served its purpose during the build phase, keeping the setup reproducible and the posts accurate. But months later, those versions were stale across every layer: the OS, Kubernetes itself, and a dozen Helm charts.

This post documents the first full update pass: how to know what needs updating, what the safe order is when you have real workloads, and what actually went wrong along the way. It is the natural sequel to Post 9, which covered upgrading Talos extension images when there were no workloads to worry about. This time there were: Longhorn volumes in active replication, OpenBao holding cluster secrets, Authelia protecting every ingress, and Forgejo backing the cluster's own configuration store.

The methodology matters more than the specific versions — by the time you read this, newer releases will almost certainly be available. Always check Talos releases and run nova find before starting your own upgrade pass rather than following the version numbers here. The cluster will need updating again, and the process documented here is the one that will be followed next time too.

🏠 This is part of the Homelab Journey series - building a production Kubernetes cluster from scratch.

Backup Controller v0.11.0
Keeping the Bletchley Cluster Current: Talos, Kubernetes, and Every Helm Chart (you are here)

Taking Stock Before Touching Anything

The first step is not updating anything — it is understanding what is running and whether the cluster is healthy before anything changes. Starting an upgrade on a cluster with a pod already in CrashLoopBackOff is a bad idea; any problems introduced by the upgrade will be harder to separate from pre-existing ones.

# Current Talos version on all nodes
talosctl version

# Kubernetes client and server version
kubectl version

# All Helm releases across all namespaces
helm list -A

# Node status
kubectl get nodes -o wide

# Any pods not in Running or Completed state
kubectl get pods -A | grep -v Running | grep -v Completed

This audit turned up something worth addressing before the upgrade started: OpenBao was in failed state in Helm. Not a crash — the pod was running normally, serving secrets, auto-unsealed. What had failed was a helm upgrade from two weeks earlier that tried to change an immutable StatefulSet field. Kubernetes had correctly rejected the change before anything was modified, but the Helm release record was left in a failed state. A helm rollback openbao 4 -n openbao restored clean Helm state without touching the running pod.

The lesson: helm list -A before any upgrade pass. The state column matters.

The pre-upgrade Helm inventory looked like this:

Release	Installed	Status
authelia	0.10.50	deployed
cert-manager	v1.19.4	deployed
external-secrets	2.2.0	deployed
forgejo	16.2.1	deployed
grafana	10.5.15	deployed
longhorn	1.10.2	deployed
metallb	0.15.3	deployed
nfs-server-provisioner	1.8.0	deployed
node-exporter	4.52.0	deployed
openbao	0.26.2	deployed
prometheus	28.12.0	deployed
smartctl-exporter	0.16.0	deployed
traefik	39.0.6	deployed

Garage (S3 object storage) was not in this list — it was deployed as a static StatefulSet manifest rather than a Helm chart. It needs to be checked separately.

Discovering What's Outdated

Helm does not have a built-in update checker. helm list -A shows what is installed but not whether newer versions exist. There are a few approaches:

helm repo update followed by helm search repo <chart-name> --versions works for individual charts but is tedious across thirteen releases. A better tool for the overview is Nova by Fairwinds:

nova find

Nova scans the cluster and reports every Helm release against its latest available version. For this cluster it produced a table showing eight outdated releases, four current, and two requiring attention — Grafana was flagged as deprecated (the chart moved registries), and Traefik's "latest" was 40.0.0-ea.3, an early access release.

Three things Nova taught that are worth knowing upfront:

Nova does not know which Helm repo you installed from. It reports the chart name and latest version, but not the repo command to upgrade. You still need helm repo list to build the correct helm upgrade command for each release.

Nova can report stale data. For Grafana, Nova showed 10.5.15 as the latest when the actual latest was 12.x on the new OCI registry. Always cross-check with the upstream release page before treating Nova's output as authoritative.

Nova picks up pre-release versions as latest. Traefik 40.0.0-ea.3 has a clear -ea suffix indicating early access. Nova correctly flagged it as "outdated" but didn't distinguish it from a stable release — that judgment requires a human check.

For the Talos layer, Nova is not relevant. Talos has its own release channel at github.com/siderolabs/talos/releases. The cluster was on v1.12.4; the current stable was v1.12.6 — a patch release, no intermediate step required.

One additional discovery from the extension audit: rock3 was running a different Talos schematic from the other three nodes, because it has a ZFS extension to support the SATA SSDs that back Garage. This had been set up when Garage was deployed but not recorded in the project files. Two schematics would be needed at Image Factory — one for rock1, rock2, rock4, and a separate one for rock3 including ZFS.

One more thing: always do a final version check immediately before running the upgrade, not just at planning time. Versions keep moving. During this update pass, Garage v2.3.0 shipped while the planning was already underway — the target version at the time of the Nova scan was v2.2.1, but by the time the actual upgrade command was ready to run, v2.3.0 was already out. A quick check of the upstream release page immediately before applying saved an unnecessary second upgrade a week later. Nova tells you what was outdated at the time you ran it, not what is outdated right now.

For this first update pass, Nova as a manual audit tool was the right level of complexity — get familiar with what is running and how the update process works before adding more automation. The natural next step is Renovate Bot: a tool that watches your Helm chart versions and Kubernetes image tags in a Git repository and opens pull requests automatically when new versions are released. That shifts the model from "check when you remember" to "notified on every release, decide whether to merge." That is a post for another time, but it is the direction this process will grow toward.

The Safe Update Order

With a multi-layer system, the update order matters. The guiding principle is straightforward: update dependencies before dependents. Anything that other things depend on must be updated first.

For the Bletchley cluster that meant:

Talos first. The OS is the foundation. Everything else runs on it.

Kubernetes second. Only after all Talos nodes are on the new version. Talos manages the Kubernetes upgrade via talosctl upgrade-k8s and handles the rolling update of control plane components across all nodes automatically.

Helm charts in dependency order:

Longhorn — storage. Everything that uses persistent volumes depends on this.
cert-manager — certificates. Ingresses depend on valid TLS.
MetalLB — external IP assignment. Ingresses depend on it.
Traefik — ingress routing. Applications depend on it.
nfs-server-provisioner — storage for some workloads.
Prometheus and node-exporter — observability.
Grafana — depends on Prometheus.
OpenBao and external-secrets — secrets. Applications depend on this.
Authelia — authentication. Applications depend on it.
Forgejo — application.
Garage — application (static manifest, not Helm).

The principle for each Helm chart is: diff before applying, watch during, verify after.

# See what would change before applying it
helm diff upgrade <release> <repo/chart> \
  --version <new-version> \
  -f <values-file>.yaml \
  -n <namespace>

# Apply
helm upgrade <release> <repo/chart> \
  --version <new-version> \
  -f <values-file>.yaml \
  -n <namespace>

# Verify
kubectl get pods -n <namespace>

The helm-diff plugin is essential here. It shows the exact Kubernetes manifest changes before anything is applied, making it possible to catch breaking schema changes before they take down a workload.

Upgrading Talos: Four Nodes, One at a Time

The Talos upgrade follows the same node-by-node sequence documented in Post 9, but this time it is a real version bump — 1.12.4 to 1.12.6 — rather than just an extension image update.

New schematics were built at factory.talos.dev for v1.12.6: one for rock1, rock2, rock4 (iscsi-tools + nfsd + util-linux-tools + rk1 board overlay), and a separate one for rock3 with the ZFS extension added.

The sequence for each node:

# Drain the node
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

# Upgrade (use the correct schematic for the node)
talosctl upgrade \
  --nodes <node>.vluwte.nl \
  --image factory.talos.dev/metal-installer/<schematic-id>:v1.12.6 \
  --wait

# Verify Talos version
talosctl version -n <node>.vluwte.nl

# Check pod is back
kubectl get pods -n openbao

# Uncordon
kubectl uncordon <node>

# Verify Longhorn before moving on
kubectl get volumes.longhorn.io -n longhorn-system

Control plane nodes first (rock1, rock2, rock3), worker last (rock4).

A consistent pattern emerged across all four nodes: the Longhorn instance-manager pod refused eviction every time, with kubectl drain logging Cannot evict pod as it would violate the pod's disruption budget repeatedly before eventually succeeding. This is Pod Disruption Budget (PDB) protection — a Kubernetes mechanism that limits how many pods of a given workload can be unavailable simultaneously. Longhorn sets a PDB on its instance-manager pods so that kubectl drain cannot evict them until it is satisfied the remaining nodes can maintain volume replication. kubectl drain retries automatically; no intervention is needed. On rock1 it took three attempts before the PDB condition was met; on rock2, which was the second control plane node to go down, six attempts. The number of retries reflects how much replication headroom Longhorn had at each point in the sequence — fewer healthy nodes means stricter conditions.

After each uncordon, before draining the next node, Longhorn volume health was checked explicitly:

kubectl get volumes.longhorn.io -n longhorn-system

One or two volumes briefly showed degraded status after each node came back, recovering within a minute or two as replication rebuilt. Waiting for all volumes to show healthy before moving on is the gate that makes the node-by-node approach safe. The 40GiB volume degraded twice during the pass — after rock2 and again after rock4 — and recovered both times. Skipping this check and draining the next node while volumes are still rebuilding risks taking replication below the minimum threshold.

Rock3 was the busiest drain by far. In addition to the usual control plane workloads, it hosted garage-0 (the ZFS-backed S3 storage), nfs-server-provisioner-0, and two Longhorn UI pods. Everything evicted cleanly, including OpenBao which was rescheduled to another node mid-drain. When rock3 came back up, OpenBao auto-unsealed via the Proxmox LXC transit instance without any manual intervention — exactly as designed.

Stop. Let Things Run. Then Continue.

After all four nodes were on Talos v1.12.6, the Kubernetes upgrade was the logical next step. It was also the end of the afternoon. The decision was to stop there, let the cluster run overnight, and pick it up the next morning.

This turned out to matter.

Bletchley runs an automated backup offload job overnight — Longhorn snapshots go to Garage S3, which then syncs to a Synology NAS over the local network. The next morning that job had failed. Digging into the Garage status revealed the cause: the pod was reporting NO ROLE ASSIGNED, a sign that something had not come back up cleanly after the rock3 node reboot. The upgrade itself had appeared to go smoothly — pod running, no obvious errors — but the problem only surfaced when the backup job tried to use Garage in anger.

The investigation led to an incorrect repair attempt (more on this below), which made things worse before the actual fix was found. But the key point is that without the overnight stop, the backup failure would not have been discovered until much later — possibly days, or not until a restore was needed. The problem was only visible because the full backup cycle had time to run and fail in the normal way.

This is how upgrades should be treated, especially on the first pass through a new update process. Each step is a checkpoint, not a handoff. After Talos, wait for regular workloads to run — backup jobs, certificate renewals, monitoring scrapes. If something is wrong, it will usually surface within hours of normal operation. If everything looks good after a full cycle, continue to the next step.

Rushing from Talos to Kubernetes to Helm in a single session is possible and might work fine. But it compresses the verification window to near zero. For a production-ish homelab where the update process itself is being learned, the slow approach pays for itself the first time something goes wrong quietly.

Upgrading Kubernetes

With all four nodes on Talos v1.12.6, Kubernetes could be upgraded. Talos handles this through its own command:

# Dry run first
talosctl upgrade-k8s \
  --nodes rock1.vluwte.nl \
  --to 1.35.2 \
  --dry-run

The dry run is worth running. It shows exactly which component images will be updated on which nodes, and flags any manifest changes before applying them. In this case the only manifest change was the kube-proxy image tag — everything else was already consistent.

# Apply
talosctl upgrade-k8s \
  --nodes rock1.vluwte.nl \
  --to 1.35.2

Despite only specifying rock1 as the target node, talosctl upgrade-k8s upgrades the entire cluster — it discovers all control plane and worker nodes automatically and performs a rolling update. The output is detailed:

updating "kube-apiserver" to version "1.35.2"
 > "10.0.140.11": starting update
 > update kube-apiserver: v1.35.0 -> 1.35.2
 > "10.0.140.11": successfully updated
...

After completion, all four nodes confirmed v1.35.2:

kubectl get nodes
NAME    STATUS   ROLES           AGE   VERSION
rock1   Ready    control-plane   58d   v1.35.2
rock2   Ready    control-plane   58d   v1.35.2
rock3   Ready    control-plane   58d   v1.35.2
rock4   Ready    <none>          58d   v1.35.2

Clean and uneventful.

The Helm Pass: Eight Charts, Two Incidents

The Helm pass followed the dependency order established above. Most charts upgraded without issue. Two did not in ways worth documenting — a third problem with Garage is covered separately below, because it belongs in the timeline where it actually happened.

The OpenBao StatefulSet problem

The helm upgrade to 0.27.2 failed immediately:

Error: UPGRADE FAILED: cannot patch "openbao" with kind StatefulSet:
StatefulSet.apps "openbao" is invalid: spec: Forbidden: updates to
statefulset spec for fields other than 'replicas', 'ordinals',
'template'...

This was the same error that had left the release in failed state two weeks earlier. The chart update attempted to change an immutable StatefulSet field — the kind of change Kubernetes does not allow in-place. The fix is to delete the StatefulSet and let Helm recreate it:

kubectl delete statefulset openbao -n openbao
# then rerun helm upgrade

The PVC survives StatefulSet deletion, so no data is lost. OpenBao came back up within 21 seconds and auto-unsealed via the Proxmox LXC transit instance. The pod was confirmed unsealed before moving on.

This is now a known pattern for OpenBao upgrades — check for this error and be prepared to delete the StatefulSet. The auto-unseal architecture means the brief downtime during pod restart is not a significant concern.

Reviewing this afterwards: next time the plan is to use kubectl delete statefulset openbao -n openbao --cascade=orphan instead of a plain delete. The --cascade=orphan flag removes the StatefulSet object — the "contract" Kubernetes uses to manage the pod — while leaving the pod itself running. Helm then creates the new StatefulSet in its place. Zero pod downtime during the StatefulSet recreation, with the helm upgrade still triggering the controlled rollout as normal.

The Authelia schema change

The helm-diff run for Authelia failed before showing any diff:

Error: Failed to render chart: exit status 1:
values don't meet the specifications of the schema(s) in the following chart(s):
authelia:
- at '/configMap/session/cookies/0': additional properties 'authelia_url' not allowed

The authelia_url field in the values file was a workaround for a quirk in an earlier chart version. The new chart (0.10.58) removed that workaround from its schema because it now constructs the URL correctly from domain and subdomain alone. Removing authelia_url from the values file resolved the error, and the diff then showed only the expected changes: updated image tags and a set of new rate limiting defaults enabled by the new version.

This is exactly what helm-diff is for — catching a breaking values file change before it takes the service down rather than during the upgrade itself.

The Garage Incident: What the Overnight Stop Found

Garage runs as a single-instance StatefulSet on rock3, backed by the ZFS-mounted SATA SSDs. It was not touched during the Talos upgrade — no image change, no manifest change. But rock3 was drained and rebooted as part of the Talos pass, which means Garage came back up after the node restart along with everything else.

The next morning the backup offload job had failed. The Synology sync from Garage S3 was throwing errors. Checking the Garage logs and bucket status was the first step:

kubectl exec -n garage garage-0 -- /garage bucket list

The output showed NO ROLE ASSIGNED. This was unexpected — Garage had not been upgraded, only restarted as a side effect of the node drain. Interpreting this as the pod having lost its cluster identity, garage layout assign and garage layout apply were run to reassign the node — which wrote a new layout on top of the existing one. Garage began behaving worse, not better.

The correct interpretation of NO ROLE ASSIGNED in this context is a timing issue, not a data loss. Garage is a single instance and when the pod starts it needs to load its node key from persistent storage before it can serve requests properly. If the pod becomes ready and receives a status request before that read completes — which can happen when the pod starts quickly after a node reboot but the underlying ZFS storage mount takes a moment longer to be fully usable — it reports NO ROLE ASSIGNED. The identity is not missing; it just has not been loaded yet. A likely out-of-order boot: pod up, storage not quite ready, identity load incomplete.

A clean pod restart gives the sequence a second chance to happen in the right order. The layout commands had added confusion by writing conflicting state on top of valid persistent data. The actual fix was:

kubectl rollout restart statefulset/garage -n garage

After that restart, Garage picked up its persistent node key from disk, all buckets came back, and the backup offload ran successfully on the next cycle.

Two lessons here. First, the obvious one: if Garage shows NO ROLE ASSIGNED after a pod restart, do not run any layout commands. Restart the pod and wait. It will self-recover from its persistent state.

Second, and less obvious: the problem was found because the overnight stop gave the backup job time to run and fail. If the Kubernetes and Helm upgrades had continued immediately after the Talos pass, the Garage issue would still have existed — but it would have been buried under several more layers of change. Finding it in isolation, with only the Talos upgrade having happened, made the investigation much simpler. The backup failure pointed directly at storage. One layer changed, one place to look.

What Was Deferred and Why

Two items from the Nova output were not upgraded during this pass.

Traefik — Nova reported 40.0.0-ea.3 as the latest, but the -ea suffix indicates early access. Staying on 39.0.6 until a stable 40.x release is available is the right call. The ingress controller is not the place for experimental versions.

Grafana — The grafana-community Helm chart (installed from the standard Helm repo) is deprecated. The active chart has moved to an OCI registry and requires migrating the Helm release to point at the new source. Nova showed the installed version as current, but the upstream release is significantly ahead. This migration is real work — it is treated as a separate task rather than a routine chart bump.

Lessons Learned

Running a full update pass across a mature cluster taught a few things that are not obvious from documentation:

Audit before you update. The helm list -A audit caught the OpenBao failed state before the upgrade started. Starting an upgrade on a cluster with a known broken Helm release is asking for confusion.

Do a final version check immediately before upgrading, not just at planning time. Nova tells you what was outdated when you ran it. By the time you actually run the upgrade command, newer versions may have shipped. A quick upstream release page check immediately before applying is cheap insurance — it saved an extra Garage upgrade in this pass.

One step at a time, then wait. After the Talos pass, stopping overnight before continuing to Kubernetes gave the cluster time to run its normal workloads. The backup job failed. That failure pointed directly at a Garage problem with only one layer changed — one place to look. Compressing everything into a single session would have buried that signal under several more changes. For a first full update pass, slow is fast.

Nova is a useful starting point, not a definitive answer. It gives a fast overview of what is outdated, but its data can be stale and it does not distinguish stable from pre-release. Always cross-check with the upstream release page before upgrading.

The helm-diff plugin is mandatory, not optional. The Authelia schema change would have been a service outage without it. Running the diff first turned it into a two-minute values file fix.

Pod Disruption Budgets are your friend during drains. The repeated eviction refusals from Longhorn's instance-manager pods are PDB protection working as designed — Longhorn will not allow a node drain until it is confident the remaining nodes can carry the replication load. kubectl drain retries automatically. Expect anywhere from one to six retry cycles per node depending on how much replication headroom exists at that point in the sequence.

Check Longhorn volume health explicitly before draining the next node. Volumes go degraded transiently after each uncordon as replication rebuilds. Running kubectl get volumes.longhorn.io -n longhorn-system and waiting for all volumes to show healthy is the gate that makes the node-by-node approach safe. Skip it and you risk draining into an already-degraded replication state.

Auto-unseal earns its keep during upgrades. OpenBao was evicted and rescheduled multiple times across the Talos pass. Each time it came back sealed, and each time it unsealed itself via the Proxmox LXC transit instance without any intervention. An upgrade pass that required manual unsealing after each node would be significantly more tedious.

What's Running Now

After the complete pass, the cluster is running:

Talos: v1.12.6 on all four nodes
Kubernetes: v1.35.2 on all four nodes
Longhorn: 1.11.1 — all volumes healthy, all nodes schedulable
cert-manager: v1.20.2 — all certificates valid
Prometheus: 29.2.1 — scraping normally
node-exporter: 4.53.1
OpenBao: 0.27.2 — unsealed, active
external-secrets: 2.3.0 — all ExternalSecrets synced
Authelia: 0.10.58 — ForwardAuth functioning
Forgejo: 17.0.0 (Forgejo 15 LTS)
Garage: v2.3.0 — buckets accessible, ZFS pool intact
Traefik: 39.0.6 — held, waiting for stable 40.x
Grafana: 10.5.15 — held, OCI migration pending

← Previous: Backup Controller v0.11.0

Questions or suggestions? Leave a comment below or reach out at igor@vluwte.nl.