Upgrading Talos Linux Nodes: Adding a Missing Extension

How I discovered a missing system extension before it caused problems, and upgraded all four Bletchley cluster nodes in 15 minutes. Plus: what changes when you have running workloads.

Introduction

Sometimes a gap in a setup shows up at exactly the right moment. This post documents how I upgraded all four nodes of the Bletchley cluster to add a missing system extension — and what to keep in mind when doing this with running workloads.


🏠 This is part of the Homelab Journey series - building a production Kubernetes cluster from scratch.

Other posts in this series:


Why I Needed to Upgrade

While preparing to install Longhorn for distributed storage, I discovered that my Talos image was missing a required system extension: util-linux-tools.

Longhorn needs this extension for fstrim — the tool it uses to reclaim unused space from volumes. Without it, Longhorn will install and appear to work, but volume trimming will fail silently. Not something to discover after running workloads for months.

My original schematic had:

  • siderolabs/iscsi-tools — for iSCSI/persistent volumes
  • siderolabs/nfsd — for future NFS exports from the SATA drives

What it was missing:

  • siderolabs/util-linux-tools — for Longhorn volume trimming via fstrim

The good news: catching this before installing Longhorn meant a clean fix. The better news: the upgrade process in Talos is remarkably straightforward.

How Talos Image Upgrades Work

This is one of the things that makes Talos genuinely different from traditional Linux distributions. There is no package manager, no apt upgrade, no SSH into the node to install something. Instead:

  1. I define my desired system state (including extensions) as a schematic in the Image Factory
  2. The factory produces a new installer image with a unique schematic ID
  3. I tell talosctl to upgrade a node to that new image
  4. The node pulls the image itself, applies it, and reboots
  5. Atomic swap: the new image is written to a passive partition while the current version keeps running — on reboot the node flips to the new partition, and if it fails to boot it flips back automatically

The laptop never touches the actual image files. The node does all the work, and the atomic swap means there is always a safe path back if something goes wrong. This is meaningfully safer than a traditional package upgrade where there is no automatic fallback.

Step 1: Rebuild the Image

I headed to factory.talos.dev and added siderolabs/util-linux-tools to the existing schematic. The final schematic for the Bletchley cluster looks like this:

overlay:
  image: siderolabs/sbc-rockchip
  name: turingrk1
customization:
  systemExtensions:
    officialExtensions:
      - siderolabs/iscsi-tools
      - siderolabs/nfsd
      - siderolabs/util-linux-tools

The factory produced a new schematic ID:

9c8ed39c24b3518d9538cd50bdcb3d267da3d26623cbf97d6340406b0dab7f79

Previous schematic ID (for reference):

d7a56218964f9ec22ae62c243a60d23a76853a82dc435eec34ed1cb2b5aabfe3

Nothing to download. The factory serves the image directly to each node during the upgrade.

Step 2: Upgrade Each Node

The upgrade command:

talosctl -n rock1.vluwte.nl upgrade \
  --image factory.talos.dev/metal-installer/9c8ed39c24b3518d9538cd50bdcb3d267da3d26623cbf97d6340406b0dab7f79:v1.12.4 \
  --preserve

Since Talos 1.8, the system disk is never wiped during upgrades — --preserve is now the default behaviour for talosctl upgrade. I still include it explicitly out of habit and clarity, but it is no longer required. Worth knowing if you are reading older documentation that treats it as critical.

I upgraded one node at a time, verifying each before moving to the next.

What Happens During the Upgrade

When the upgrade command runs, talosctl watches the node and reports progress. The full sequence:

  1. Connecttalosctl establishes connection, gets actor ID
  2. Download — node pulls the new installer image from factory.talos.dev
  3. Verify — image checksum validated
  4. Stop services — gracefully stops all running workloads and Talos services
  5. Apply — new image written to boot partition (old partition kept as fallback)
  6. Reboot — node reboots into the new image (the terminal shows unavailable, retrying...)
  7. Boot — node comes back up, Talos starts, Kubernetes services restart
  8. Rejoin — node rejoins the cluster
  9. Healthytalosctl confirms the node is back and healthy

The unavailable, retrying... message at step 6 is normal — the node is rebooting. talosctl keeps polling and reconnects automatically once it is back.

While waiting, I like to open a second terminal and run:

talosctl -n rock1.vluwte.nl dmesg --follow

This shows the node's dmesg messages while the upgrade starts. When the node reboots the connection closes. After a little while the command will be able to connect and show the dmesg messages again in real-time. Much better than staring at a hanging terminal wondering if anything is happening.

Step 3: Verify

After each node completed, I verified the extensions were present:

talosctl -n rock1.vluwte.nl get extensions

Expected output:

NODE              NAMESPACE   TYPE              ID            VERSION   NAME               VERSION
rock1.vluwte.nl   runtime     ExtensionStatus   0             1         iscsi-tools        v0.2.0
rock1.vluwte.nl   runtime     ExtensionStatus   1             1         nfsd               v1.12.4
rock1.vluwte.nl   runtime     ExtensionStatus   2             1         util-linux-tools   2.41.2
rock1.vluwte.nl   runtime     ExtensionStatus   3             1         schematic          9c8ed39c24b3518d9538cd50bdcb3d267da3d26623cbf97d6340406b0dab7f79
rock1.vluwte.nl   runtime     ExtensionStatus   modules.dep   1         modules.dep        6.18.9-talos

Three things to confirm: all three extensions are present, the schematic ID matches the new one, and the kernel version is consistent across nodes.

To confirm the node actually rebooted at the expected time:

talosctl -n rock1.vluwte.nl dmesg | grep "Booting Linux"

This returns a timestamp of the last boot — a simple sanity check that the upgrade did what was expected.

After all four nodes were done, a final cluster-wide check:

kubectl get nodes -o wide

All nodes should show Ready with a consistent Talos version and Kubernetes version.

If You Have Running Workloads

My cluster had no workloads yet, which made this straightforward. With running workloads there are additional steps before upgrading each node.

1. Drain the node first

Draining tells Kubernetes to gracefully reschedule all pods away from the node before it goes down:

kubectl drain rock1 --ignore-daemonsets --delete-emptydir-data
  • --ignore-daemonsets is needed because DaemonSet pods (like Longhorn managers) cannot be moved and are handled automatically
  • --delete-emptydir-data is needed if any pods use temporary emptyDir volumes — that data will be lost

2. Wait for pods to reschedule

Before running the upgrade, confirm the node has no running pods (except DaemonSets):

kubectl get pods --all-namespaces --field-selector spec.nodeName=rock1

3. Run the upgrade

Same command as above, with --preserve.

4. Uncordon the node after it comes back

After the upgrade completes and the node is verified, allow Kubernetes to schedule pods on it again:

kubectl uncordon rock1

5. Wait for pods to settle before moving to the next node

Give Kubernetes a moment to reschedule workloads back and confirm everything is healthy before draining the next node. For clusters with Longhorn storage, also check that any volume replicas that were rebuilding have completed — taking down a second node while a Longhorn rebuild is still in progress on the first is not something to risk.

The full sequence for a node upgrade with running workloads becomes:

drain → verify pods moved → upgrade → verify extensions → uncordon → verify healthy → next node

I plan to document this in practice in a future post when a real Talos version upgrade is due — with running workloads, Longhorn volumes in active use, and the full procedure under realistic conditions. That will be the proper test.

Results

All four nodes upgraded cleanly in sequence. Total time was roughly 15 minutes for the cluster — about 3-4 minutes per node including download, reboot, and verification.

NAME    STATUS   ROLES           AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE          KERNEL-VERSION   CONTAINER-RUNTIME
rock1   Ready    control-plane   2d    v1.35.0   10.0.140.11   <none>        Talos (v1.12.4)   6.18.9-talos     containerd://2.1.6
rock2   Ready    control-plane   2d    v1.35.0   10.0.140.12   <none>        Talos (v1.12.4)   6.18.9-talos     containerd://2.1.6
rock3   Ready    control-plane   2d    v1.35.0   10.0.140.13   <none>        Talos (v1.12.4)   6.18.9-talos     containerd://2.1.6
rock4   Ready    <none>          47h   v1.35.0   10.0.140.14   <none>        Talos (v1.12.4)   6.18.9-talos     containerd://2.1.6

With the correct extensions in place across all nodes, Longhorn installation could proceed. That post — including the Talos-specific configuration that the standard Longhorn docs don't cover — was published a few days ago.


← Previous: Building the Bletchley Cluster


Questions or suggestions? Leave a comment below or reach out at igor@vluwte.nl.