Building the Bletchley Cluster: Installing Talos Linux on TuringPi 2

Complete guide to installing Talos Linux v1.12.4 on TuringPi 2 RK1 modules — VLAN configuration, HA control plane, and every command along the way.

Introduction

Three weeks ago I wrote about planning this reinstallation. Today I'm documenting the actual installation — every command, every error, every decision. The cluster is running. Here's exactly how it happened.


🏠 This is part of the Homelab Journey series - building a production Kubernetes cluster from scratch.

Other posts in this series:


What I'm Building

The cluster is named bletchley — four RK1 modules in a TuringPi 2, running Talos Linux v1.12.4 on Kubernetes v1.35.0.

Node Role IP Storage
rock1 Control Plane 10.0.140.11 32GB eMMC + 250GB NVMe
rock2 Control Plane 10.0.140.12 32GB eMMC + 250GB NVMe
rock3 Control Plane 10.0.140.13 32GB eMMC + 250GB NVMe + 2x 900GB SATA
rock4 Worker 10.0.140.14 32GB eMMC + 250GB NVMe

Key design decisions:

  • Nodes live on VLAN 140 (10.0.140.0/24) — isolated from the rest of my network
  • VIP 10.0.140.10 floats across control plane nodes for HA API access
  • Talos installs to eMMC, leaving NVMe and SATA free for Longhorn storage
  • DHCP reservations give fixed IPs without static configuration in Talos
  • allowSchedulingOnControlPlanes: true — small cluster, no reason to waste resources

Phase 0: Preparation

Before touching the nodes, a few things need to be in place on the network and on the machine I'll be managing the cluster from.

Network

Required:

  • VLAN 140 (optional) — I isolate the cluster nodes on a dedicated VLAN, but this isn't required. If you're running a flat network, the nodes will simply stay on your management network and you can skip the VLAN configuration in the patch files. If you do use a VLAN: the switch port connected to the TuringPi needs to be configured as a trunk port, carrying both the untagged management VLAN and tagged VLAN 140. Nodes boot untagged (picking up a temporary IP on the management network), receive their config, and then come up on VLAN 140. Without a trunk port, the node will disappear from the network after the config is applied and never come back.
  • DHCP server — required regardless of whether you use a VLAN. I run DHCP on VLAN 140 with reservations mapping each node's MAC address to a fixed IP. On a flat network the setup is the same, just without the VLAN scope. The reservations are what matter — Talos config stays simple because DHCP always hands out the same address to the same node.
  • VIP 10.0.140.10 excluded from the DHCP pool. The VIP floats between control plane nodes and must never be assigned dynamically to anything else.
  • Internet access from VLAN 140 — the nodes need to reach DNS, HTTPS (for pulling container images), and NTP on first boot.

Optional:

  • DNS entries — not required, but I prefer connecting by name rather than IP. I created rock1-4.vluwte.nl pointing to the node IPs and bletchley.vluwte.nl pointing to the VIP. This also prepares the ground for SSL certificates later.
  • NTP server — Talos defaults to public NTP pools. I run my own NTP server (ntp.luwte.net) and configured the nodes to use it. If you don't have an internal NTP server, simply leave this out of the patch files and Talos will use the defaults.
  • SSL certificates — not needed for installation, but worth planning for. Having proper DNS entries in place now means you can add certificates later without reconfiguring the cluster. I'll cover this in a future post.

Laptop

Install talosctl and kubectl:

# macOS
brew install siderolabs/tap/talosctl
brew install kubectl

# Verify
talosctl version --client
# Client: Tag: v1.12.4

kubectl version --client
# Client Version: v1.35.0

Both are needed — talosctl for talking to Talos nodes, kubectl for talking to Kubernetes once the cluster is up.

Create the working directory where all cluster files will live:

mkdir -p ~/talos-cluster/bletchley
cd ~/talos-cluster/bletchley

Phase 1: Building the Talos Image

Talos images aren't one-size-fits-all. The RK1 modules need a specific ARM64 image with the right hardware support and the system extensions I'll need later.

I used Talos Image Factory to build a custom image with two extensions:

  • iscsi-tools — required for Longhorn storage
  • util-linux-tools — for Longhorn volume trimming via fstrim
  • nfsd — for future NFS exports from the SATA drives

The resulting schematic ID is d7a56218964f9ec22ae62c243a60d23a76853a82dc435eec34ed1cb2b5aabfe3. I'm noting this here because it's how you reproduce this exact image — if something breaks six months from now and I need to reflash, I come back to this ID.

# Download the compressed image (laptop)
curl -L -o metal-arm64.raw.xz \
  "https://factory.talos.dev/image/d7a56218964f9ec22ae62c243a60d23a76853a82dc435eec34ed1cb2b5aabfe3/v1.12.4/metal-arm64.raw.xz"

# Decompress for CLI flashing (BMC requires uncompressed)
xzcat metal-arm64.raw.xz > metal-arm64.raw

# Copy to TuringPi BMC
scp metal-arm64.raw root@turingpi:/mnt/sdcard/talos-1.12.4-with-nfsd+util-linux-tools+iscsi/

Why decompress? The TuringPi BMC CLI (tpi flash) requires a raw uncompressed image. The web UI accepts .xz but the CLI doesn't. I learned this the first time around.


Phase 2: Flashing the Nodes

With the image on the BMC's SD card, flashing is straightforward. I used the CLI for nodes 1-3 and the web UI for node 4 (just to document both methods).

CLI Flashing (Nodes 1-3)

SSH into the BMC first:

ssh root@turingpi

Then for each node:

# Power off the node (if running)
tpi power off -n 1

# Flash the image
tpi flash -l -i /mnt/sdcard/talos-1.12.4-with-nfsd+util-linux-tools+iscsi/metal-arm64.raw -n 1

# Power on
tpi power on -n 1

# Watch the UART to confirm boot
tpi uart -n 1 get

The -l flag is "local" — reads from the BMC filesystem rather than fetching over the network. Flashing takes a couple of minutes per node.

What I saw in UART after boot:

[talos] entering maintenance mode

That's the Talos maintenance mode — the node is running, waiting for configuration. It picks up a temporary untagged IP from DHCP (mine were 192.168.0.110-113).

Web UI Flashing (Node 4)

Navigate to https://turingpi.luwte.net/ → Flash Node tab → select Node 4 → fill in the file path and optionally the SHA-256 for verification, then click Install OS:

Browser showing TuringPi Flash Node tab with Node 4 selected, metal-arm64.raw.xz and SHA-256 hash entered
Flash Node tab ready to go — node selected, image file and SHA-256 filled in, Install OS button waiting.

The web UI uploads the .xz file to the BMC first. It decompresses on the fly — no need to decompress manually like the CLI method:

TuringPi web UI showing image upload progress to Node 4 at 27.09 MB of 88.36 MB transferred
Upload in progress — the compressed image is being transferred to the BMC at 27.09 MB / 88.36 MB.

Once the upload is complete, the BMC verifies the SHA-256 checksum and writes the uncompressed image to the node's eMMC. This is the longer of the two phases:

TuringPi web UI showing CRC check and flash in progress with 1.19 GB written to Node 4
CRC verified, flashing in progress — 1.19 GB written to eMMC. The status reads "Checking CRC and flashing image to node 4..."

After flashing completes, the node is powered off. To power it back on, click Edit first — the power toggles are read-only until edit mode is active:

TuringPi web UI Nodes tab showing Node 1-3 powered on and Node 4 powered off after flashing
Node 4 is off after flashing. Click Edit, then use the toggle to power it on.

Verifying All Nodes Are Ready

After flashing all four nodes:

Node Maintenance Mode IP Status
rock1 192.168.0.110 Maintenance ✓
rock2 192.168.0.111 Maintenance ✓
rock3 192.168.0.112 Maintenance ✓
rock4 192.168.0.113 Maintenance ✓

Node 3 showed a SATA handshake error during boot — it recovered immediately. It's the SATA controller initialising the two 900GB drives. Not a problem.


Phase 3: Generating Configuration

Generating the Base Configs

talosctl gen config bletchley https://10.0.140.10:6443 \
  --output-dir ~/talos-cluster/bletchley

This generates three files:

  • controlplane.yaml — base control plane configuration
  • worker.yaml — base worker configuration
  • talosconfig — credentials for talosctl to connect to the cluster

The URL https://10.0.140.10:6443 is the VIP — this is baked into the cluster's TLS certificates, so it's important to get this right from the start.

The Control Plane Patch

The base configs need customising for my setup. Create cp.patch.yaml:

machine:
  install:
    disk: /dev/mmcblk0
  time:
    servers:
      - ntp.luwte.net
  network:
    interfaces:
      - interface: end0
        vlans:
          - vlanId: 140
            dhcp: true
            vip:
              ip: 10.0.140.10
cluster:
  allowSchedulingOnControlPlanes: true

Why each setting:

  • disk: /dev/mmcblk0 — explicitly install to eMMC, not NVMe. Without this, Talos might pick the NVMe and I'd lose my Longhorn storage disk.
  • time.servers — my internal NTP server. Talos defaults to public NTP pools; I want time sync going to my own server.
  • interface: end0 — the RK1's ethernet interface. The boot logs showed end0, not eth0.
  • vlanId: 140 — put the node on VLAN 140 with DHCP. My switch has DHCP reservations that give each node a fixed IP.
  • vip.ip: 10.0.140.10 — the floating VIP shared across control plane nodes.
  • allowSchedulingOnControlPlanes: true — four nodes total, I want workloads running everywhere.

The Worker Patch

Create worker.patch.yaml:

machine:
  install:
    disk: /dev/mmcblk0
  time:
    servers:
      - ntp.luwte.net
  network:
    interfaces:
      - interface: end0
        vlans:
          - vlanId: 140
            dhcp: true

Same as control plane but without the VIP — workers don't participate in VIP management.

A Note on IPv6

The nodes pick up IPv6 addresses automatically (2a10:3781:4bc9:...). If your network doesn't have IPv6 routing on this subnet, you'll see NTP timeout errors like this in the boot logs:

time query error with server "2a10:3781:4bc9:0:92ec:77ff:fe13:a988": i/o timeout

This is harmless — Talos falls back to IPv4. Disabling IPv6 in the patch is an option if it bothers you, but it resolves itself.


Phase 4: Applying Configuration

Configure talosctl Endpoints

The generated talosconfig has credentials but no endpoints — the nodes didn't have their final IPs when the config was generated. Set them now:

export TALOSCONFIG=~/talos-cluster/bletchley/talosconfig
talosctl config endpoint 10.0.140.11 10.0.140.12 10.0.140.13
talosctl config node 10.0.140.11

Apply to the Control Plane Nodes

Apply to each node using the maintenance mode IP (the temporary untagged address):

# rock1
talosctl apply-config --insecure \
  --nodes 192.168.0.110 \
  --config-patch @cp.patch.yaml \
  --file controlplane.yaml

# rock2
talosctl apply-config --insecure \
  --nodes 192.168.0.111 \
  --config-patch @cp.patch.yaml \
  --file controlplane.yaml

# rock3
talosctl apply-config --insecure \
  --nodes 192.168.0.112 \
  --config-patch @cp.patch.yaml \
  --file controlplane.yaml

Expected behaviour: The command sends the config and then the connection drops with a timeout or graceful_stop error. This is normal — the node received the config, removed its untagged IP, and came back up on VLAN 140. You'll never see a clean success response.

error applying new configuration: rpc error: code = Unavailable desc = closing transport due 
to: connection error ... received prior goaway: code: NO_ERROR, debug data: "graceful_stop"

That error means success.

What Happens During Boot

Watching the UART during rock1's first boot with the new config shows the network transition clearly:

[talos] removed address 192.168.0.110/24 from "end0"
[talos] created new link ... "end0.140", "kind": "vlan"
[talos] assigned address "10.0.140.11/24" ... "link": "end0.140"
[talos] setting hostname ... "hostname": "rock1", "domainname": "vluwte.nl"
[talos] created route ... "gateway": "10.0.140.1" ... "link": "end0.140"

The node drops off the untagged network, creates a VLAN subinterface, and comes back on 10.0.140.11. The DHCP reservation kicks in and it gets exactly the IP I expected.

The initial NTP lookup failure (server misbehaving) is also normal — it happens during the brief window when DNS resolvers are switching from the DHCP-provided defaults (1.1.1.1, 8.8.8.8) to my internal servers (10.0.0.4, 10.0.0.13). It resolves itself.

Apply Worker Config to rock4

talosctl apply-config --insecure \
  --nodes 192.168.0.113 \
  --config-patch @worker.patch.yaml \
  --file worker.yaml

Phase 5: Bootstrapping etcd

With all four nodes configured and running, etcd is waiting on every node. I can see this in the UART logs:

etcd is waiting to join the cluster, if this node is the first node in the cluster, 
please run `talosctl bootstrap` against one of the following IPs:
[10.0.140.11 ...]

Apply config to all nodes before bootstrapping. Bootstrap tells one node to initialise a new etcd cluster. If other control plane nodes aren't ready yet, they'll fail to join.

Before bootstrapping I ran a health check to confirm all three control plane nodes were in the expected state:

igor@granite ~ % talosctl -n 10.0.140.11 health
discovered nodes: ["10.0.140.11" "10.0.140.12" "10.0.140.13"]
waiting for etcd to be healthy: ...
waiting for etcd to be healthy: 3 errors occurred:
* 10.0.140.11: service "etcd" not in expected state "Running": current state [Preparing] Running pre state
* 10.0.140.12: service "etcd" not in expected state "Running": current state [Preparing] Running pre state
* 10.0.140.13: service "etcd" not in expected state "Running": current state [Preparing] Running pre state

This is exactly what I wanted to see. Two things to confirm before proceeding: all three nodes are discovered (10.0.140.11, 10.0.140.12, 10.0.140.13), and etcd is in Preparing state on all of them — meaning they're waiting for bootstrap, not stuck or erroring.

Then bootstrap:

igor@granite ~ % talosctl bootstrap -n 10.0.140.11
igor@granite ~ %

No output — just the prompt returning immediately. That's the expected success response. The bootstrap request was sent to rock1 and etcd will now initialise in the background.

Now watch it take effect:

igor@granite ~ % talosctl -n 10.0.140.11 health
discovered nodes: ["10.0.140.10" "10.0.140.12" "10.0.140.13"]
waiting for etcd to be healthy: ...
waiting for etcd to be healthy: OK
waiting for etcd members to be consistent across nodes: ...
waiting for etcd members to be consistent across nodes: OK
waiting for etcd members to be control plane nodes: ...
waiting for etcd members to be control plane nodes: OK
waiting for apid to be ready: ...
waiting for apid to be ready: OK
waiting for all nodes memory sizes: ...
waiting for all nodes memory sizes: OK
waiting for all nodes disk sizes: ...
waiting for all nodes disk sizes: OK
waiting for no diagnostics: ...
waiting for no diagnostics: OK
waiting for kubelet to be healthy: ...
waiting for kubelet to be healthy: OK
waiting for all nodes to finish boot sequence: ...
waiting for all nodes to finish boot sequence: OK
waiting for all k8s nodes to report: ...
waiting for all k8s nodes to report: can't find expected node with IPs ["10.0.140.12" ...]
waiting for all k8s nodes to report: OK
waiting for all control plane static pods to be running: ...
waiting for all control plane static pods to be running: OK
waiting for all control plane components to be ready: ...
waiting for all control plane components to be ready: can't find expected node with IPs ["10.0.140.10" "10.0.140.11" ...]
waiting for all control plane components to be ready: can't find expected node with IPs ["10.0.140.12" ...]
waiting for all control plane components to be ready: OK
waiting for all k8s nodes to report ready: ...
waiting for all k8s nodes to report ready: OK
waiting for kube-proxy to report ready: ...
waiting for kube-proxy to report ready: OK
waiting for coredns to report ready: ...
waiting for coredns to report ready: OK
waiting for all k8s nodes to report schedulable: ...
waiting for all k8s nodes to report schedulable: OK

A few things to note in this output. The discovered nodes now shows 10.0.140.10 (the VIP) instead of 10.0.140.11 — rock1 is now reachable via the VIP, which is correct. The can't find expected node errors are transient — Kubernetes was still registering nodes at that point and resolved itself within seconds. Every check ends with OK, which is what matters.


Phase 6: Verifying the Cluster

Get kubeconfig

The cluster is running but I can't talk to it with kubectl yet — for that I need a kubeconfig file. Talos generates this from the cluster itself, pulling the certificates and endpoint information that kubectl needs to authenticate and connect. I pull it directly from rock1 and store it in the same directory as the other cluster files:

talosctl kubeconfig ~/talos-cluster/bletchley/kubeconfig -n 10.0.140.11
export KUBECONFIG=~/talos-cluster/bletchley/kubeconfig

The export sets the environment variable for the current session so kubectl knows which config to use. Once this is added to ~/.zshrc (as covered in the "What's in the Files" section), this won't be needed manually again.

Check Nodes

kubectl get nodes
NAME    STATUS   ROLES           AGE     VERSION
rock1   Ready    control-plane   8m37s   v1.35.0
rock2   Ready    control-plane   8m31s   v1.35.0
rock3   Ready    control-plane   9m2s    v1.35.0
rock4   Ready    <none>          2m20s   v1.35.0

All four nodes Ready. Kubernetes v1.35.0.

Check System Pods

igor@granite bletchley % kubectl get pods -A
NAMESPACE     NAME                              READY   STATUS    RESTARTS         AGE
kube-system   coredns-7859998f6-68v26           1/1     Running   0                9m23s
kube-system   coredns-7859998f6-844kv           1/1     Running   0                9m23s
kube-system   kube-apiserver-rock1              1/1     Running   0                8m50s
kube-system   kube-apiserver-rock2              1/1     Running   0                8m53s
kube-system   kube-apiserver-rock3              1/1     Running   0                9m25s
kube-system   kube-controller-manager-rock1     1/1     Running   2 (9m50s ago)    8m50s
kube-system   kube-controller-manager-rock2     1/1     Running   2 (9m44s ago)    8m53s
kube-system   kube-controller-manager-rock3     1/1     Running   2 (9m41s ago)    9m25s
kube-system   kube-flannel-2dz7z                1/1     Running   0                9m1s
kube-system   kube-flannel-5l4vr                1/1     Running   0                2m44s
kube-system   kube-flannel-bhzqr                1/1     Running   0                9m26s
kube-system   kube-flannel-mg5jd                1/1     Running   0                8m55s
kube-system   kube-proxy-jk8fv                  1/1     Running   0                2m44s
kube-system   kube-proxy-ntjrq                  1/1     Running   0                8m55s
kube-system   kube-proxy-q4xqm                  1/1     Running   0                9m26s
kube-system   kube-proxy-vnzt2                  1/1     Running   0                9m1s
kube-system   kube-scheduler-rock1              1/1     Running   3 (9m33s ago)    8m50s
kube-system   kube-scheduler-rock2              1/1     Running   2 (9m44s ago)    8m53s
kube-system   kube-scheduler-rock3              1/1     Running   2 (9m41s ago)    9m25s
igor@granite bletchley %

Everything running. A quick sanity check on what's here: three kube-apiserver, kube-controller-manager and kube-scheduler pods — one per control plane node. Four kube-flannel and kube-proxy pods — one per node including the worker. Two coredns pods for DNS within the cluster.

The kube-controller-manager and kube-scheduler pods show 2-3 restarts each — this is normal during bootstrap while they wait for etcd and the API server to stabilise.

Check Node Details

kubectl get nodes tells me the nodes are Ready, but it doesn't tell me much else. kubectl describe nodes gives a much richer picture — labels, taints, resource capacity, what's running on each node, and annotations set by Talos and Flannel. I use it as a final sanity check to confirm the cluster is configured the way I intended, not just that it's running.

A few things worth noting:

Extensions are confirmed on every node:

extensions.talos.dev/iscsi-tools=v0.2.0
extensions.talos.dev/nfsd=v1.12.4
extensions.talos.dev/util-linux-tools=2.41.2

These annotations confirm the custom image from Phase 1 was applied correctly on every node. The Image Factory schematic ID is also annotated on each node — a built-in audit trail that ties back to exactly what was built and when. If I need to reflash six months from now, this tells me which schematic to use.

Resources per node:

  • 8 CPUs (7950m allocatable)
  • ~8GB RAM (~7.3GB allocatable)
  • 110 pods capacity

The difference between total and allocatable is what Talos and the Kubernetes system components reserve for themselves. 7950m out of 8000m CPUs and ~7.3GB out of ~8GB RAM is reasonable — the overhead is small.

Taints on control plane nodes:

node-role.kubernetes.io/control-plane:NoSchedule

Normally this taint prevents regular workloads from being scheduled on control plane nodes. Because allowSchedulingOnControlPlanes: true is set in the patch, Talos automatically adds a corresponding toleration that overrides this — so workloads can land on all four nodes regardless.

Flannel and pod networking:

flannel.alpha.coreos.com/public-ip: 10.0.140.11
flannel.alpha.coreos.com/backend-type: vxlan

Flannel is the Container Network Interface (CNI) — the component responsible for pod-to-pod networking across the cluster. Without a CNI, pods on different nodes can't talk to each other. Talos installs Flannel by default as part of bootstrap.

It operates in VXLAN mode here, which means it creates an overlay network that tunnels pod traffic between nodes using UDP encapsulation on top of the existing node network. Each node gets its own subnet from the PodCIDR range (10.244.x.0/24 by default), and Flannel routes traffic between those subnets. The public-ip annotation shows which node IP Flannel is using as the tunnel endpoint — confirming it picked up the correct VLAN 140 address and not the old management IP.


Powering Down the Cluster

Talos doesn't have a traditional OS you can just pull the plug on — shut down nodes gracefully to avoid etcd corruption and filesystem issues.

Since allowSchedulingOnControlPlanes: true is set, workloads can run on all four nodes, so drain all of them first:

# 1. Drain all nodes (evict workloads gracefully)
kubectl drain rock4 --ignore-daemonsets --delete-emptydir-data
kubectl drain rock3 --ignore-daemonsets --delete-emptydir-data
kubectl drain rock2 --ignore-daemonsets --delete-emptydir-data
kubectl drain rock1 --ignore-daemonsets --delete-emptydir-data

--ignore-daemonsets skips DaemonSet pods like Flannel and kube-proxy that run on every node by design and can't be moved. --delete-emptydir-data allows eviction of pods using temporary local storage — that data is lost, which is fine during a shutdown.

# 2. Shut down nodes via Talos (worker first, VIP holder last)
# First, check which node currently holds the VIP:
talosctl -n rock1.vluwte.nl,rock2.vluwte.nl,rock3.vluwte.nl,rock4.vluwte.nl get addresses | grep 10.0.140.10
rock1.vluwte.nl   network     AddressStatus   end0.140/10.0.140.10/32    1    10.0.140.10/32    end0.140

The node name in the first column is the current VIP holder — shut that one down last.

talosctl shutdown -n 10.0.140.14
talosctl shutdown -n 10.0.140.13
talosctl shutdown -n 10.0.140.12
talosctl shutdown -n 10.0.140.11  # VIP holder last

# 3. Power off the TuringPi board via BMC
tpi power off -n 4
tpi power off -n 3
tpi power off -n 2
tpi power off -n 1

To bring it back up, power on in any order — Talos and etcd handle the rest automatically:

tpi power on -n 1
tpi power on -n 2
tpi power on -n 3
tpi power on -n 4

What's in the Files

All cluster access lives in two files — talosconfig for talosctl and kubeconfig for kubectl. If I lose these, or need to manage the cluster from another machine, these are the files to copy over. I'm backing them up somewhere safe; controlplane.yaml and talosconfig contain cluster secrets and should be treated like passwords.

~/talos-cluster/bletchley/
├── controlplane.yaml    # Control plane base config (contains cluster secrets)
├── worker.yaml          # Worker base config
├── talosconfig          # talosctl credentials
├── kubeconfig           # kubectl credentials
├── cp.patch.yaml        # Control plane customisation patch
└── worker.patch.yaml    # Worker customisation patch

Using the Files from Another Machine

To manage the cluster from a different machine, copy both credential files over and point the tools at them:

scp ~/talos-cluster/bletchley/talosconfig user@othermachine:~/talos-cluster/bletchley/
scp ~/talos-cluster/bletchley/kubeconfig user@othermachine:~/talos-cluster/bletchley/

Then set the environment variables and configure the endpoints:

export TALOSCONFIG=~/talos-cluster/bletchley/talosconfig
export KUBECONFIG=~/talos-cluster/bletchley/kubeconfig

talosctl config endpoint 10.0.140.11 10.0.140.12 10.0.140.13
talosctl config node 10.0.140.11

Setting Environment Variables at Login

Rather than exporting these variables every session, I add them to my shell profile so they're set automatically at login.

In ~/.zshrc or ~/.bashrc:

# Talos / Kubernetes - bletchley cluster
export TALOSCONFIG=~/talos-cluster/bletchley/talosconfig
export KUBECONFIG=~/talos-cluster/bletchley/kubeconfig

Then reload the shell:

source ~/.zshrc   # or source ~/.bashrc

Now kubectl get nodes and talosctl version work straight away in any new terminal session.


Lessons Learned

A hanging talosctl apply-config can be safely interrupted. When applying config to rock3, the command hung after the connection dropped. Rather than waiting for the timeout, I opened a second terminal and checked whether the node had come up correctly on VLAN 140:

talosctl -n rock1.vluwte.nl,rock2.vluwte.nl,rock3.vluwte.nl get members

Rock3 was listed with the right IP and hostname — the work was already done. A ^C to kill the hanging command was all that was needed. The apply itself takes only seconds; if the node looks right when you expect it to be ready, there's no need to wait on a timeout.

The graceful_stop error is success. Every time I applied a config, the connection dropped with a timeout. I kept second-guessing it. It's fine — the node is transitioning networks.

Apply all nodes before bootstrapping. I bootstrapped after applying rock1's config in an earlier attempt. The bootstrap failed because rock2 and rock3 weren't ready to join. The right approach is to wait until all control plane nodes are on VLAN 140 and showing etcd is waiting.

The NTP hostname lookup failure is transient. ntp.luwte.net failing to resolve during boot isn't a real problem — it happens in the 200ms window between DNS resolvers switching. Talos retries and succeeds once the internal DNS servers are configured.

The VIP appears in etcd member listings as an additional address for rock1. kubectl get nodes initially only showed the VIP for rock1's discovery, not 10.0.140.11 directly. This confused me briefly — it's just how VIP assignment works with Talos.


What's Next

The cluster is running but it has no persistent storage. All four nodes have 250GB NVMe drives sitting completely unused, and rock3 has two 900GB SATA drives. The next step is installing Longhorn to pool that storage into distributed persistent volumes.

After Longhorn: MetalLB for load balancing, then deploying the first real workloads.


Conclusion

Four RK1 modules, Talos v1.12.4, Kubernetes v1.35.0, HA control plane, running on VLAN 140. The bletchley cluster is live.

The whole process from flashing to kubectl get nodes took about four hours — including a failed bootstrap attempt and carefully documenting every step. Without documentation it would have taken less time but I'd have no idea how to reproduce it.


← Previous: Talos: First Attempt
→ Next: Upgrading Talos Linux Nodes


Questions or suggestions? Leave a comment below or reach out at igor@vluwte.nl.