homelab-journey

Backup Controller v0.11.0: Detecting What Wasn't Created by the Controller

The controller managed its own backups perfectly. It had no idea about the rest. v0.11.0 adds an audit script and two new alert conditions to close that gap.

Introduction

After publishing the backup controller post, I ran into something that had been quietly sitting in the cluster: backups that predated the controller entirely. Nine backups from the old c-bsl5cr and c-afuh2t recurring jobs, three manual snapshots from early testing — all sitting in Garage S3, all invisible to the controller. No ownership, no expiry, no alert.

The controller was doing exactly what it was designed to do. The problem was that its scope ended at its own boundary. Anything it didn't create, it didn't know about. That's a reasonable design choice, but it means stale backups can accumulate silently — the exact problem the controller was built to prevent.

This post covers the two things I added to close that gap: an audit script for one-off inspection, and a controller extension that raises alerts for untracked and expired backups during every reconcile pass. The code is at vluwte/longhorn-backup-controller — now at v0.11.0.

🏠 This is part of the Homelab Journey series - building a production Kubernetes cluster from scratch.

Cluster Observability Part 6
Backup Controller v0.11.0: Detecting What Wasn't Created by the Controller (you are here)

The problem: backups outside the controller's view

The controller classifies Longhorn RecurringJob objects as managed or unmanaged based on the backup.vluwte.nl/managed=true label it stamps on every job it creates. What it never looked at was the resulting Backup objects in longhorn-system.

Each Longhorn Backup object records which RecurringJob triggered it in spec.labels.RecurringJob. Backups from controller-managed jobs have a job name starting with backup-<namespace>-<n>. Backups from old manually created jobs have opaque auto-generated names like c-bsl5cr or c-afuh2t. Backups triggered by a manual snapshot have no RecurringJob field at all.

Running a quick kubectl get backups.longhorn.io -n longhorn-system made the picture clear:

NAME                      VOLUME                                          STATE       CREATED
backup-329783ba091b4ab0   snap-de214e4984764463                           Completed   2026-03-08T14:31:04Z
backup-1296476243ba48d3   c-bsl5cr-dfda4915-d4ef-48b0-97e4-cfd7b685d466   Completed   2026-03-10T09:10:06Z
...
backup-6a86e4940a03406d   c-lsv43w-7d35e42c-600a-497b-b200-3f768d843a6b   Completed   2026-04-10T21:10:06Z

The snap-* prefix in the VOLUME column meant a manual snapshot. The c-bsl5cr-* prefix meant an old recurring job. Both predated the controller. Both were consuming space in Garage S3 with no plan for cleanup.

The audit script

The first thing I wanted was visibility: a read-only tool to list every backup object and classify it. That became scripts/audit_backups.py.

python3 scripts/audit_backups.py
python3 scripts/audit_backups.py --unmanaged-only
python3 scripts/audit_backups.py --json

The script fetches all Backup objects from longhorn-system, fetches all controller-managed RecurringJob names (label backup.vluwte.nl/managed=true), and classifies each backup by cross-referencing spec.labels.RecurringJob against the managed job set:

Status	Meaning
`managed`	Created by a controller-owned `RecurringJob`
`archive`	Manual backup with a future `archive-delete-after` date
`archive-expired`	Archive date has passed — ready for cleanup
`unmanaged-job`	Created by a `RecurringJob` the controller doesn't own
`manual`	No `RecurringJob` and no expiry label — completely untracked

The classification is based on actual job ownership, not naming convention. A backup is managed if and only if its RecurringJob field names a job that carries backup.vluwte.nl/managed=true.

Acknowledging manual backups

Before extending the controller to alert on untracked backups, I needed a pattern for acknowledging intentional ones. A backup created for a one-off purpose — a pre-migration snapshot, a restore test — should be distinguishable from a backup that is simply forgotten.

The archive label pattern already existed on PVCs and RecurringJob objects. The same label applied to a Backup object's metadata.labels does the job:

kubectl label backup.longhorn.io -n longhorn-system backup-329783ba091b4ab0 \
  backup.vluwte.nl/archive-delete-after=2026-05-01

Once labelled, the backup is treated as a known archive. Before the date it is archive — acknowledged, intentional, no alert. On or after the date it is archive-expired — ready for cleanup, alert fires.

This keeps the same label convention used everywhere else in the controller, and requires no controller changes to support new label types.

The controller extension

With the audit script working and the label pattern defined, the controller needed two new behaviours during each full reconcile pass:

1. Alert on expired manual archives — scan all Backup objects for backup.vluwte.nl/archive-delete-after in metadata.labels. If the date has passed, raise backup_manual_archive_expired{backup, delete_after}.

2. Alert on untracked backups — for any backup not created by a controller-managed job and carrying no archive label, raise backup_untracked{backup, reason}. The reason label is either manual (no RecurringJob field) or unmanaged-job (job exists but is not controller-owned).

Both checks run in a single pass over all Backup objects — one API call, shared with the existing reconcile. Controller-managed backups are skipped immediately via an O(1) set lookup against managed_job_names. Only unlabelled or unmanaged backups incur any further processing.

The new metrics follow the same pattern as every other metric in the controller: one gauge entry per offending object, rebuilt on every reconcile pass, stale entries zeroed automatically.

Two new alert rules in the manifests:

- alert: BackupUntracked
  expr: backup_untracked == 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Backup '{{ $labels.backup }}' is untracked"
    description: "Backup '{{ $labels.backup }}' was not created by the backup-controller
      (reason={{ $labels.reason }}) and has no archive-delete-after label. Add a
      'backup.vluwte.nl/archive-delete-after' label to acknowledge it, or delete it
      from Garage S3 and remove the Backup object."

- alert: ManualBackupArchiveExpired
  expr: backup_manual_archive_expired == 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Backup '{{ $labels.backup }}' has passed its archive-delete-after date"
    description: "Backup '{{ $labels.backup }}' was manually created with
      archive-delete-after={{ $labels.delete_after }}. That date has passed. Delete
      the backup from Garage S3 and remove the Backup object when no longer needed."

The for: 5m gives a grace period for newly created manual backups — the BackupUntracked alert will not fire immediately, giving time to add the acknowledgement label before Alertmanager routes it. Worst case: a backup created just after a reconcile pass is seen on the next pass 5 minutes later, then the alert fires 5 minutes after that — 10 minutes maximum from creation to alert.

What's Working Now

✅ scripts/audit_backups.py — lists all backups with ownership classification
✅ backup_untracked metric — fires for unmanaged-job and manual backups
✅ backup_manual_archive_expired metric — fires when archive-delete-after has passed
✅ BackupUntracked and ManualBackupArchiveExpired alert rules in both manifest variants
✅ Old pre-controller backups identified, labelled or deleted
✅ Code at v0.11.0: github.com/vluwte/longhorn-backup-controller

← Previous: Cluster Observability Part 6

Questions or suggestions? Leave a comment below or reach out at igor@vluwte.nl.

Backup Controller v0.11.0: Detecting What Wasn't Created by the Controller

Introduction

The problem: backups outside the controller's view

The audit script

Acknowledging manual backups

The controller extension

What's Working Now

Read more

Fixing a Postgres Backup Failure: Region Mismatch and a Missing NAS Copy

Longhorn Disk Alerting: Getting the Signal Right

Metrics Server on Talos: The Reboot That Broke Garage

Longhorn Snapshot Overhead: Why the Alerts Were Right and Wrong