homelab-journey

Prometheus Behind Traefik and Authelia: Closing the Last Ingress Gap

Two files close the Prometheus ingress gap — then a latent cert-manager bug surfaces. The ESO key name mismatch that renewal tests don't catch.

Introduction

Every service on the Bletchley cluster has a hostname, a TLS certificate, and Authelia protecting it — except one. Prometheus has been accessible only via kubectl port-forward, which means opening a terminal, running a command, and leaving it running for the duration of the session. Every time. It was a gap I knew about and kept deferring because the tunnel worked well enough and there was always something more pressing.

Closing it should have been a fifteen-minute job. Two files: one Ingress manifest and one line in the Authelia values. It was, in the end — but it surfaced a latent bug introduced during the OpenBao migration that had been sitting undetected for weeks. The fix was also one line, but finding it required working through the full cert-manager certificate issuance chain. That diagnostic path is worth documenting.

🏠 This is part of the Homelab Journey series - building a production Kubernetes cluster from scratch.

Keeping the Cluster Current
Prometheus Behind Traefik and Authelia: Closing the Last Ingress Gap (you are here)

The Two-File Change

The pattern for exposing a cluster service via Traefik with TLS and Authelia is well established at this point. Alertmanager got the same treatment in the Alertmanager post. The Prometheus ingress follows it exactly.

A new file in apps/monitoring/ingresses/ingress-prometheus.yaml:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: prometheus
  namespace: monitoring
  annotations:
    traefik.ingress.kubernetes.io/router.entrypoints: websecure
    cert-manager.io/cluster-issuer: letsencrypt-production
    traefik.ingress.kubernetes.io/router.middlewares: traefik-redirect-to-https@kubernetescrd,traefik-authelia@kubernetescrd
spec:
  ingressClassName: traefik
  rules:
    - host: prometheus.bletchley.vluwte.nl
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: prometheus-server
                port:
                  number: 80
  tls:
    - hosts:
        - prometheus.bletchley.vluwte.nl
      secretName: prometheus-tls

And a two-line addition to apps/authelia/authelia-values.yaml:

- domain: "prometheus.bletchley.vluwte.nl"
  policy: one_factor

No TLS secret to create manually. cert-manager reads the cert-manager.io/cluster-issuer annotation and the tls.secretName field together and provisions the certificate automatically, storing it in the named secret. Traefik reads the same secretName field to know which secret to load. The secret prometheus-tls does not exist before the Ingress is applied — cert-manager creates it.

Applied with:

kubectl apply -f apps/monitoring/ingresses/ingress-prometheus.yaml
helm upgrade authelia authelia/authelia -n authelia -f apps/authelia/authelia-values.yaml

The Unexpected Detour

Traefik logs an error immediately after applying the Ingress:

ERR Error configuring TLS error="secret monitoring/prometheus-tls does not exist"
ingress=prometheus namespace=monitoring providerName=kubernetes

This is expected — Traefik starts routing as soon as the Ingress is applied, but cert-manager hasn't finished issuing the certificate yet. The error resolves itself once the secret exists. The question is how long that takes.

Checking the certificate pipeline:

kubectl get certificate,certificaterequest,order,challenge -n monitoring
NAME                            STATE     AGE
alertmanager-tls-1-1585379262   valid     27d
grafana-tls-2-1860064419        valid     49d
prometheus-tls-1-1846305707     pending   8m49s

Eight minutes and still pending. The Alertmanager and Grafana orders are valid from weeks ago. Something is wrong with this one.

kubectl describe challenge -n monitoring

Presented:   false
Processing:  true
Reason:      get credential error: key not found "privateKey" in secret 'cert-manager/transip-secret'
State:       pending

The TransIP webhook can't find the privateKey it needs to create the DNS-01 challenge TXT record.

Tracing the Root Cause

The cert-manager-webhook-transip chart uses the TransIP API to create _acme-challenge TXT records during DNS-01 validation. It reads a private key from a Kubernetes secret. The key name it expects — privateKey — is defined in the ClusterIssuer, not in external documentation:

kubectl get clusterissuer letsencrypt-production -o yaml

solvers:
- dns01:
    webhook:
      config:
        privateKeySecretRef:
          name: transip-secret
          key: privateKey

The key: privateKey field is what the webhook looks for. Checking what the secret actually contains:

kubectl get secret transip-secret -n cert-manager -o jsonpath='{.data}' | python3 -m json.tool

{
    "apiKey": "PRIVATE_KEY=="
}

The value is correct — it is the TransIP private key. The key name is wrong. The secret has apiKey where the webhook expects privateKey.

This was introduced during the OpenBao/ESO migration. The ExternalSecret in infra/certificates/externalsecret-transip.yaml was syncing the private key from OpenBao faithfully, but naming it apiKey in the resulting Kubernetes secret:

data:
  - secretKey: apiKey       # ← this is what lands in the Kubernetes secret
    remoteRef:
      key: cert-manager/transip
      property: apiKey      # ← this is the key name in OpenBao

The fix is changing secretKey — the output key name in the Kubernetes secret — from apiKey to privateKey. The remoteRef.property stays as apiKey because that is what it is called in OpenBao and that does not need to change.

data:
  - secretKey: privateKey   # ← fixed
    remoteRef:
      key: cert-manager/transip
      property: apiKey

Applied from the infra/certificates/ directory:

kubectl apply -f externalsecret-transip.yaml
externalsecret.external-secrets.io/transip-secret configured

ESO resynced the secret immediately on apply. Verifying the secret now has the correct key:

kubectl get secret transip-secret -n cert-manager -o jsonpath='{.data}' | python3 -m json.tool

{
    "privateKey": "PRIVATE_KEY=="
}

In case the secret does not immediately resync, you can run:

kubectl annotate externalsecret transip-secret -n cert-manager force-sync=$(date +%s) --overwrite

The secret is now correct but the pending challenge does not retry on its own. Deleting the order forces cert-manager to create a new one:

kubectl delete order prometheus-tls-1-1846305707 -n monitoring
order.acme.cert-manager.io "prometheus-tls-1-1846305707" deleted from monitoring namespace

cert-manager creates a new order immediately. Confirming the certificate is issued:

kubectl get certificate -n monitoring

NAME               READY   SECRET             AGE
alertmanager-tls   True    alertmanager-tls   27d
grafana-tls        True    grafana-tls        49d
prometheus-tls     True    prometheus-tls     147m

Why Renewal Worked But New Issuance Didn't

The TransIP secret was migrated to OpenBao in Secret Management Part 1 and verified at the time by deleting the it-tools-tls certificate secret and watching cert-manager renew it:

kubectl delete secret it-tools-tls -n it-tools
kubectl get certificate it-tools-tls -n it-tools -w

NAME           READY
it-tools-tls   True
it-tools-tls   False   ← secret deleted, renewal triggered
it-tools-tls   False   ← DNS-01 challenge in progress
it-tools-tls   True    ← new certificate issued

That test passed. The certificate renewed successfully. The chain appeared complete.

It was not a sufficient test.

Certificate renewal reuses an existing ACME account and order flow in a way that happened not to hit the broken code path. New certificate issuance triggers a fresh DNS-01 challenge that calls the TransIP webhook directly — which is exactly where the key name mismatch breaks things. The bug was latent after the migration, undetectable by renewal testing alone, and only surfaced when Prometheus became the first new certificate requested after OpenBao was set up.

The correct validation test is to request a certificate for a hostname that has never had one before, or to delete the ACME order entirely and force a full re-issuance. A renewal test confirms the ACME account is valid and the secret exists; it does not confirm the webhook can authenticate with the upstream DNS provider.

The Authelia Rolling Update Snag

With the certificate sorted, the Authelia Helm upgrade triggered a rolling update — a new pod created to replace the old one. The new pod stayed in ContainerCreating:

authelia-79c698cfd7-lfz75   0/1     ContainerCreating   0          2m11s

The cause:

Warning  FailedAttachVolume  attachdetach-controller
Multi-Attach error for volume "pvc-f70b6bb3-..."
Volume is already used by pod(s) authelia-76bf74c4ff-ncnnd

Authelia's Longhorn PVC is RWO — ReadWriteOnce — meaning only one node can attach it at a time. The old pod holds the volume. The new pod cannot start until the old one releases it. Kubernetes is waiting for a clean handoff that isn't happening because both pods are on different nodes.

The fix is to scale to zero first, clearing all pods and releasing the volume, then scale back to one:

kubectl scale deployment authelia -n authelia --replicas=0
kubectl get pods -n authelia

NAME                    READY   STATUS    RESTARTS   AGE
redis-8db4d4fc6-lgktk   1/1     Running   0          3d13h

kubectl scale deployment authelia -n authelia --replicas=1
kubectl get pods -n authelia

NAME                        READY   STATUS    RESTARTS   AGE
authelia-79c698cfd7-4vslm   1/1     Running   0          40s
redis-8db4d4fc6-lgktk       1/1     Running   0          3d13h

This is a known characteristic of RWO volumes with single-replica deployments. The brief downtime during scale-to-zero is acceptable for Authelia — it is not in the critical path for cluster operation, and the scale cycle takes under a minute.

For single-replica deployments on RWO storage, setting strategy.type: Recreate in the Helm values automates this — the old pod is terminated before the new one starts, releasing the volume cleanly. Worth applying to Authelia, Grafana, and Forgejo before the next upgrade cycle.

Browser showing the Authelia Sign In page at https://auth.bletchley.vluwte.nl with a redirect parameter pointing back to prometheus.bletchley.vluwte.nl — Hitting prometheus.bletchley.vluwte.nl unauthenticated redirects to the Authelia login portal

Prometheus query interface at https://prometheus.bletchley.vluwte.nl/query showing the expression input field and No data queried yet message — Prometheus accessible at its real hostname after authentication — no port-forward required

What's Working Now

✅ https://prometheus.bletchley.vluwte.nl accessible via Traefik with a valid Let's Encrypt certificate
✅ Authelia ForwardAuth protecting Prometheus — unauthenticated requests redirect to the login portal
✅ TransIP ExternalSecret key name corrected in code — secretKey: privateKey
✅ cert-manager → ESO → OpenBao chain validated end-to-end with a new certificate issuance
✅ No more kubectl port-forward to access Prometheus

Lessons Learned

The ClusterIssuer tells you what key name is expected — no need to read webhook documentation. kubectl get clusterissuer <name> -o yaml shows the key: field under privateKeySecretRef directly.
The full certificate pipeline in one command — kubectl get certificate,certificaterequest,order,challenge -n <namespace> shows every stage at once and makes it immediately clear which step has stalled.
kubectl describe challenge is the diagnostic endpoint — when an order is stuck pending, the challenge describe output gives the exact error. Everything above it in the chain is healthy; the challenge is where failures surface.
Renewal is not a sufficient validation test for the DNS-01 webhook — renewal reuses existing state and may succeed even when new issuance would fail. Test with a genuinely new certificate to validate the full chain.
RWO volumes and rolling updates don't mix cleanly — for single-replica deployments on RWO storage, scale to zero before upgrading rather than relying on a rolling update that will stall waiting for volume handoff. Setting strategy.type: Recreate in Helm values automates this for future upgrades.

← Previous: Keeping the Cluster Current

Questions or suggestions? Leave a comment below or reach out at igor@vluwte.nl.

Prometheus Behind Traefik and Authelia: Closing the Last Ingress Gap

Introduction

The Two-File Change

The Unexpected Detour

Tracing the Root Cause

Why Renewal Worked But New Issuance Didn't

The Authelia Rolling Update Snag

What's Working Now

Lessons Learned

Read more

Fixing a Postgres Backup Failure: Region Mismatch and a Missing NAS Copy

Longhorn Disk Alerting: Getting the Signal Right

Metrics Server on Talos: The Reboot That Broke Garage

Longhorn Snapshot Overhead: Why the Alerts Were Right and Wrong