homelab-journey

Secret Management Part 4: Worst Case Recovery — Rebuilding from First Principles

Losing OpenBao's Raft data is not inherently a data loss event — if every non-regeneratable secret is correctly externalised and maintained. The rebuild, mapped precisely.

Introduction

Part 3 ended with a statement that deserves more than a sentence: losing the Proxmox OpenBao instance permanently is not inherently a data loss event — if every non-regeneratable secret is correctly externalised and maintained. That claim needs unpacking — because it contradicts what most people assume when they first think about what happens if the Transit unseal backend disappears, and because the condition attached to it matters.

This post is for anyone who read Part 3 and wanted to follow the thread further. If Part 3 gave you working auto-unseal and a documented DR runbook, that's a complete and reasonable stopping point. Part 4 goes deeper — into what you actually lose in each failure scenario, why the worst case is recoverable, and what the rebuild looks like in practice. It also updates the DR runbook with scenarios that weren't there before.

The thinking in this post came out of questioning assumptions during the writing of Part 3. Starting from "losing Proxmox means losing secrets" and working through why that's wrong turned out to be the most useful thing to document in this entire series.

🏠 This is part of the Homelab Journey series - building a production Kubernetes cluster from scratch.

Secret Management Part 3: Auto-Unseal with OpenBao on Proxmox
Secret Management Part 4: Worst Case Recovery — Rebuilding from First Principles (you are here)

This post assumes you have completed Parts 1–3. The architecture, credential inventory, and DR scenarios described here build directly on that foundation.

The assumption that needed examining

After completing the Transit auto-unseal setup, the natural question was: what's the actual worst case? The initial answer felt obvious — if the Proxmox OpenBao instance is permanently lost and no PBS backup is available, the transit key is gone, Bletchley can't unseal, and the secrets in Raft storage are unrecoverable.

That answer is technically correct about the Raft data. But it's the wrong question. The right question is: what are those secrets, where did they come from, and does losing OpenBao mean losing them?

The reframing: OpenBao is the operational layer

Every secret stored in OpenBao on Bletchley is an external credential — an API key, a password, an S3 access key, an SMTP credential. None of them were generated by OpenBao. They were all sourced from somewhere external and stored in OpenBao because that's the right place to manage them operationally.

That somewhere external is 1Password.

The requirement from Part 1 of this series was that every credential must exist in 1Password before being stored in OpenBao. This wasn't just a backup policy — it was the design. OpenBao is the operational layer: it makes secrets available to workloads efficiently, with audit trails, policies, and ESO syncing. 1Password is the source of truth: the actual credential values live there and are authoritative.

This distinction changes the blast radius calculation entirely. Losing OpenBao's Raft data is not inherently a data loss event — if every non-regeneratable secret is correctly externalised and maintained. The credentials still exist in 1Password. The rebuild path is to stand up a new OpenBao instance and re-populate. Painful. Time-consuming. But recoverable — conditional on the discipline of having kept 1Password current. Classification errors or drift between systems turn recovery back into data loss.

This is the insight that is almost never documented, because most documentation stops at "keep good backups" without asking what you actually lose if the backups fail too.

The three scenarios mapped precisely

Scenario 1 — LXC restore from backup, within token TTL

PBS restore → transit key intact → token still valid → start LXC → unseal Proxmox OpenBao with key shares → Bletchley recovers automatically via CrashLoopBackOff.

Nothing lost. No config changes needed. This is the intended recovery path.

Scenario 2 — LXC restore from backup, token expired

PBS restore → transit key intact → token expired → start LXC → unseal Proxmox OpenBao → create new token → sed into Helm values → helm upgrade → delete Bletchley pod → auto-unseals normally.

Nothing lost. Token replacement is the only extra step. The key point: the token is purely a credential — it grants permission to call the transit decrypt endpoint. It has no relationship to the transit key or the master key. An expired token means Proxmox rejects the authentication call. The transit key is untouched. Bletchley's encrypted master key in Raft is untouched. The lock and the key still fit perfectly; the pass to turn it just needs renewing.

The recovery is: create a new token with the same policy on Proxmox, substitute it into the Helm values via the sed procedure in README.md, run helm upgrade, delete the pod. Normal startup from that point.

Scenario 3 — No LXC restore possible, transit key permanently gone

The transit key is gone. Bletchley's master key is encrypted against that transit key. Bletchley cannot unseal — it enters CrashLoopBackOff indefinitely. The Shamir recovery keys cannot help because there is no running OpenBao process to accept them.

The Raft data is cryptographically inaccessible. The encrypted secrets in Raft storage cannot be decrypted without the master key, and the master key cannot be decrypted without the transit key.

But — as established above — this is not the same as the secrets being lost.

What Scenario 3 actually requires

The rebuild path for Scenario 3:

Step 0 — Mark old 1Password entries as obsolete before creating anything new

Before standing up a new instance, rename the existing 1Password entries for the lost Proxmox OpenBao to make clear they are no longer valid — for example: "OBSOLETE — Proxmox / transit instance — lost [date]". Do not delete them yet; they may still be useful if a restore turns out to be possible after all. But they must be clearly distinguished from the new entries you are about to create. Creating new entries without first marking the old ones risks grabbing the wrong key shares during recovery.

Step 1 — Stand up a new Proxmox OpenBao instance

A fresh LXC on Proxmox, same process as Part 3. New initialisation, new transit key, new key shares stored in 1Password with unambiguous labels including the date — for example: "Proxmox / transit instance — rebuilt [date]". New unseal token, also stored in 1Password.

Step 2 — Capture auto-generated secrets if Bletchley is still running

If Bletchley is still running (OpenBao still unsealed from the last restart before the Proxmox failure), the ESO-synced Kubernetes Secrets still contain the current values of auto-generated secrets that were classified as "store in 1Password." Read and capture them now — this window closes the moment OpenBao is reinitialised:

# Example — Authelia storage encryption key
kubectl get secret authelia -n authelia \
  -o go-template='{{range $k,$v := .data}}{{$k}}={{$v | base64decode}}{{"\n"}}{{end}}'

Store any captured values in 1Password before proceeding.

Step 3 — Reinitialise Bletchley OpenBao on a new PVC

Use a new Longhorn PVC rather than reinitialising on the existing one. The existing PVC contains the old Raft state encrypted against the lost transit key — reinitialising on it risks Raft confusion with stale data, unexpected failures, and debugging sessions that are entirely avoidable. A new PVC is a clean slate.

Delete the existing PVC only after the new OpenBao instance is confirmed working. Until then keep it as a forensic artefact in case a restore becomes possible.

This is the point of no return. After reinitialising on the new PVC, the old encrypted Raft data on the old PVC is the only copy — and it is inaccessible without the lost transit key.

Update the Helm values with the new Transit seal configuration and new token, run helm upgrade, delete the pod. OpenBao initialises fresh against the new Proxmox instance.

Step 4 — Re-populate all credentials

Two categories, two recovery paths:

For external credentials — go through the runbook credential inventory and re-enter each value from 1Password. Every bao kv put operation from the original migration, run again.

For regeneratable secrets — reinstall the relevant chart or run the generation step. The chart produces a new equivalent value.

For auto-generated secrets classified as "store in 1Password" — treat these the same as external credentials: read from 1Password (or from what you captured in Step 2), re-enter into OpenBao.

Step 5 — Verify ESO sync across all namespaces

kubectl get externalsecrets -A
# Expected: all 8 showing SecretSynced: True

Once ESO is syncing again, all workloads can access their secrets normally.

Step 6 — Clean up

Delete the old PVC once recovery is confirmed. Delete the old "OBSOLETE" 1Password entries — they provide no value now that the instance they unlock is gone, and keeping them creates future confusion.

The actual single point of failure — and what it means for every new secret

Working through Scenario 3 makes the real single point of failure visible: it is not OpenBao, not Proxmox, not the transit key. It is 1Password — specifically, whether 1Password contains an accurate and current record for every secret that cannot be freely regenerated.

That last qualifier matters. Not all secrets need to be in 1Password. The right question is not "is this secret in OpenBao?" but "what does recovery look like for this specific secret?" That question has two possible answers, each with a different recovery path:

External credentials — 1Password is the source of truth. API keys, S3 access keys, SMTP passwords, any credential that was created in an external system and then stored in OpenBao. These cannot be regenerated without going back to the external system. Recovery: read from 1Password, re-enter into OpenBao. The Part 2 inventory documents all of these.

Regeneratable secrets — regeneration is the recovery. Random values auto-generated by a chart at install time, where a fresh random value is functionally equivalent to the original. Recovery: reinstall the chart or run the generation step again. Storing these in 1Password is unnecessary overhead — it adds maintenance burden without meaningful benefit, and creates confusion about whether the stored value is current after any reinstall or rotation.

The classification checklist

The boundary between these two categories is not always obvious. Every time a new secret enters the cluster — whether migrated, installed by a chart, or generated manually — it deserves a deliberate classification. The outcome is binary: store in 1Password, or don't. The checklist is how you get there.

Work through the questions in order. A single "store" outcome ends the checklist — no further questions needed.

1. Is there a mechanism to regenerate this secret?
If no: store in 1Password. A secret that cannot be regenerated by any means is irreplaceable — it must be preserved externally regardless of how it was created.
If yes: continue.

2. Is any stored data encrypted with this secret?
If yes: store in 1Password. A fresh value cannot decrypt data encrypted with the original. Without the original key, that data becomes permanently indecipherable — users must re-enroll, re-upload, or start fresh with whatever that data represented. The Authelia storage encryption key is an example: it protects TOTP and WebAuthn registrations in the Authelia database. Regenerating it means every user re-enrolls their 2FA device. Whether that is acceptable depends on context; the question forces you to make that decision deliberately rather than discover it during recovery.
If no: continue.

3. Is this secret shared with an external system?
If yes: store in 1Password. A webhook signing secret, an HMAC key that a third party validates against, any secret where an external system has a copy — regenerating it breaks the integration until the external system is also updated. Treat these as external credentials regardless of how they were generated.
If no: continue.

4. Was this secret produced by a one-time process that cannot be repeated?
If yes: store in 1Password. Some secrets are generated once during a bootstrap step that no longer exists — the tooling is gone, the state that produced it is gone, or the process is not documented. There is no regeneration path.
If no: continue.

5. Does anything downstream depend on this specific value?
If yes: store in 1Password. If other secrets, tokens, or configurations were derived from or seeded by this value, regenerating it without updating all derived material creates inconsistency. Map the dependencies and treat it as irreplaceable until you understand them fully.
If no: continue.

6. Is this secret rotated automatically by an external system?
If yes: store in 1Password — and establish a process to keep it current. Some external systems rotate their own credentials on a schedule — certain cloud providers, certificate authorities, or APIs that enforce key rotation policies. If the external system rotates the credential without a corresponding update in 1Password and OpenBao, the stored value becomes stale. The symptom is a workload failure when ESO next syncs the now-incorrect credential. The answer to question 6 doesn't just determine whether to store the secret — it identifies a maintenance obligation. Document the rotation schedule alongside the 1Password entry.
If no: do not store in 1Password. Recovery is regeneration — reinstall the chart or run the generation step. Document which chart or command produces it so the recovery is unambiguous.

What the checklist gives you beyond the binary answer:

The value is not just where you land — it is the thinking along the way. Answering question 2 forces you to know whether your application stores anything encrypted with this secret. Answering question 3 forces you to know which external systems depend on it. Answering question 5 forces you to map the dependencies. That understanding is the recovery plan, and it is only useful if you have it before you need it.

For the current cluster: the Authelia storage encryption key lands on "store" at question 2. The Authelia JWT secret and session encryption key pass all five questions and land on "do not store" — losing them invalidates active sessions but nothing encrypted is lost, and re-login is the recovery. External credentials (TransIP API key, S3 keys, SMTP password) never reach question 1's "if yes" path — they have no regeneration mechanism and land on "store" immediately.

1Password as single point of failure — and offline access

With 1Password holding all external credentials and classified auto-generated secrets, it becomes the true single point of failure for Scenario 3 recovery. That means it deserves the same operational care as any critical dependency — including thinking through what happens when you need it most and it's hardest to reach.

The scenario worth thinking about: full site outage combined with ISP failure. Proxmox is down, Bletchley is down, internet is gone. You need the Proxmox unseal key shares to start recovery. Can you get them?

1Password's local cache. The 1Password desktop and mobile apps cache an encrypted copy of the vault locally on any previously authorised device. If you have a laptop or phone that was signed in before the outage, you can access the vault offline — no internet required. This is the most important mitigation: keep at least one authorised device available, not just a browser session.

The Emergency Kit. 1Password's Emergency Kit is a PDF containing your account email, Secret Key, and space to write your master password. It is designed to be printed and stored physically — in a safe, alongside other important documents, or near the physical infrastructure. The Emergency Kit gives you the credentials to authenticate to 1Password from any device, but authentication itself requires reaching 1Password's servers. It solves the "lost all devices" problem, not the "no internet" problem. It is still essential — print it, store it physically, treat it with the same care as the Proxmox unseal key shares themselves.

A local offline copy. 1Password supports exporting the vault to an encrypted file. Storing a recent export in a physically secure location — separate from the cluster, not on any cluster storage — provides a true offline fallback that is independent of both internet access and 1Password's availability. The export is encrypted with your master password; it requires 1Password or a compatible tool to open, but it works without any network.

The practical recommendation: maintain at least one authorised device with a cached vault, print and physically store the Emergency Kit, and consider a periodic vault export to offline storage. The goal is that any single failure — internet down, devices lost, 1Password unavailable — still leaves a recovery path open.

The architecture is only as resilient as its least accessible component when you need it most.

Token expiry — the simpler case

Token expiry is distinct from infrastructure loss and simpler to recover from. It deserves its own treatment because the symptom (Bletchley in CrashLoopBackOff, unable to unseal) looks identical to Proxmox being down, which can lead to misdiagnosis.

If the pod has been down for longer than the token period (768 hours — just over 32 days), the next startup attempt will fail with a token authentication error rather than a connectivity error. The transit key is untouched. The Raft data is intact. Nothing is lost.

Diagnosis: check the Proxmox OpenBao instance directly.

pct enter 501
bao status   # confirm it's running and unsealed
# Then check if the token is still valid:
bao token lookup <token>
# If expired: token_duration will show 0 or an error

Recovery: create a new token with the same policy, substitute via sed, run helm upgrade, delete the pod.

# On Proxmox LXC 501
bao login   # root token or valid admin token
bao token create -policy=bletchley-unseal -period=768h -orphan
# Store new token in 1Password, then:

# On workstation
sed 's/###BLETCHLEY_UNSEAL_TOKEN###/<new-token>/' openbao-values.yaml | \
  helm upgrade openbao openbao/openbao -n openbao -f -
kubectl delete pod openbao-0 -n openbao

Normal auto-unseal resumes from the next startup.

Key rotation and backup alignment

Key rotation on the Proxmox instance generates a new transit key version:

# On Proxmox LXC 501
bao write -f transit/keys/bletchley-unseal/rotate

OpenBao retains all previous key versions by default — a currently running Bletchley instance is unaffected, because the existing ciphertext can still be decrypted by the old version. On subsequent operations including restarts, OpenBao will begin using the new key version for encryption — the master key ciphertext in Raft is re-encrypted against the new version at the next opportunity.

This creates a critical interaction with PBS backups. After a rotation and a Bletchley restart, the master key ciphertext in Raft requires the new key version. A PBS backup taken before the rotation has only the old version. If a catastrophic failure then occurs and the restore is from the pre-rotation backup, the transit key version mismatch makes Bletchley unable to unseal — even though Proxmox is running and healthy.

The rule is simple but must be explicit: always take a PBS backup immediately after rotating the transit key. The window of unrecoverability is the gap between the rotation and the next backup. The DR runbook includes this requirement.

One more thing on rotation: the min_decryption_version setting controls which old key versions remain valid for decryption. The default keeps all versions. Advancing min_decryption_version is a legitimate security practice — revoking old versions limits the exposure window if a key is compromised. But done carelessly, if the version used to encrypt Bletchley's current master key is revoked, Bletchley cannot unseal even with a perfectly healthy Proxmox instance. Any change to min_decryption_version requires verifying that Bletchley's current master key version is still above the new minimum.

The DR runbook

The recovery procedures live in docs/runbooks/openbao-recovery.md in the Forgejo repository. That file is the single authoritative reference — this post explains the reasoning behind it, not a summary to maintain alongside it.

Part 3 documented the infrastructure restart scenarios (Bletchley restart, full site outage, LXC node migration, catastrophic Proxmox loss via PBS restore). The runbook has been updated with the credential-layer scenarios this post introduced:

Token expired — Proxmox is healthy, but Bletchley fails to unseal because the transit token has expired. Diagnosis: check the token directly on the Proxmox instance. Recovery: create a new token with the same policy, substitute via sed, run helm upgrade, delete the pod.
Token revoked or policy modified — identical symptoms to token expiry. Same recovery. Diagnose first by checking the Proxmox instance rather than assuming an infrastructure failure.
Transit key rotation — not a failure scenario but an operational procedure with a mandatory follow-up: take a PBS backup of LXC 501 immediately after every rotation. The backup must contain the new key version before Bletchley restarts.
Catastrophic Proxmox loss, no restore — Scenario 3. Stand up a new Proxmox OpenBao instance, reinitialise Bletchley's seal, re-populate from 1Password (external credentials) and chart reinstall (regeneratable secrets). The Part 2 credential inventory is the recovery checklist. If Bletchley is still running when the decision to rebuild is made, read any classified auto-generated secrets from the running ESO-synced Kubernetes Secrets first — that window closes on reinitialisation.

The runbook also now includes the classification outcome for each current cluster secret, so recovery decisions during Scenario 3 don't require reconstructing the checklist under pressure.

What's working now

This post doesn't change the cluster configuration — everything operational was completed in Part 3. What it adds is:

✅ Accurate blast radius understanding for all failure scenarios
✅ Two-category secret classification: external credentials (1Password) vs regeneratable secrets (regenerate)
✅ Classification checklist for every new secret entering the cluster
✅ Token expiry documented as a distinct, simpler recovery case
✅ Key rotation + backup alignment requirement documented and in the runbook
✅ Scenario 3 rebuild sequence documented — Part 2 credential inventory is the recovery plan
✅ DR runbook updated with credential-layer failure scenarios
⚠️ Resilience of Scenario 3 depends on classification discipline — every new secret must be assessed when it enters the cluster, not retrospectively

What's next

The secret management series has covered installation, migration, auto-unseal, and recovery. The next post moves from operations to observability — OpenBao has a UI and supports audit logging, neither of which has been formally set up. Part 5 will walk through the UI, enable an audit device, and look at what token and policy management looks like in practice once the operational layer is running.

← Previous: Secret Management Part 3

Questions or suggestions? Leave a comment below or reach out at igor@vluwte.nl.