Phase 7: Validation¶

Time estimate: ~15 minutes

Work through this checklist from top to bottom to confirm a successful recovery. Every item must pass before the homelab is considered fully restored.

7.1 Kubernetes Nodes¶

All 3 nodes must be Ready with Tailscale IPs:

sudo kubectl get nodes -o wide

Expected output:

NAME          STATUS   ROLES                  AGE   VERSION   INTERNAL-IP       EXTERNAL-IP
k3s-server    Ready    control-plane,master   Xm    v1.x.x    <k3s-server-ts-ip>    <none>
k3s-agent-1   Ready    <none>                 Xm    v1.x.x    <k3s-agent-1-ts-ip>    <none>
k3s-agent-2   Ready    <none>                 Xm    v1.x.x    <k3s-agent-2-ts-ip>     <none>

✅ Pass: All 3 nodes Ready, all INTERNAL-IP values start with 100.
❌ Fail - LAN IPs shown: Re-run the k3s Ansible playbook (the flannel-iface config was not applied)
❌ Fail - node not Ready: Check node logs: sudo kubectl describe node <name>

7.2 Flux Kustomizations¶

All managed Kustomizations must be Ready:

flux get kustomizations -n flux-system
# Or with kubectl:
kubectl get kustomizations -n flux-system

Expected output:

NAME            READY   STATUS
flux-system     True    Applied revision: main@sha1:...
apps            True    Applied revision: main@sha1:...
adguard         True    Applied revision: main@sha1:...
authentik       True    Applied revision: main@sha1:...
...

Investigate a failing kustomization:

flux get kustomization <name> -n flux-system --verbose
kubectl describe kustomization <name> -n flux-system
# Look at the "Message" field for error details

Force reconciliation:

flux reconcile kustomization <name> -n flux-system

7.4 Cloudflare DNS Records and Tunnel¶

Log in to dash.cloudflare.com and confirm the zone for example.com is active. All DNS records are managed by OpenTofu - running tofu apply recreates them. Verify these are present:

Record	Type	Value
`mail.example.com`	CNAME	`<tunnel-id>.cfargotunnel.com` (proxied)
`status.example.com`	CNAME	`<tunnel-id>.cfargotunnel.com` (proxied)
`resend._domainkey.example.com`	TXT	DKIM key from Resend dashboard
`send.example.com`	MX	`feedback-smtp.us-east-1.amazonses.com`
`send.example.com`	TXT	`v=spf1 include:amazonses.com ~all`
`_dmarc.example.com`	TXT	`v=DMARC1; p=none;`

Verify the Cloudflare Tunnel is connected:

kubectl logs -n cloudflared deployment/cloudflared --since=5m | grep -E "connect|registered|error"
# Should show: "Connection registered" - no errors

Verify Email Routing is enabled (inbound mail forwarding):

Cloudflare dashboard → the example.com zone → Email → Email Routing
Confirm status shows Enabled
Confirm destination admin@example.com shows Verified
If Unverified: click the address and resend the verification email

✅ Pass: Tunnel connected, DNS records present, Email Routing enabled and verified ❌ Fail - tunnel not connected: Check cloudflared-tunnel-credentials secret was patched (Phase 6.2)

7.5 MetalLB Load Balancer¶

MetalLB provides LoadBalancer-type IP addresses from the LAN IP pool <metallb-pool-range>:

# Check MetalLB pods are running
sudo kubectl -n metallb-system get pods

Expected:

NAME                          READY   STATUS    RESTARTS
controller-<hash>             1/1     Running   0
speaker-<hash>                1/1     Running   0
speaker-<hash>                1/1     Running   0
speaker-<hash>                1/1     Running   0

# Verify the IP address pool is configured
sudo kubectl -n metallb-system get ipaddresspool

Expected: A pool covering <metallb-pool-range> with status Auto Assigned.

7.6 Cross-Node Pod Communication (Flannel Health)¶

Verify Flannel over Tailscale is working by testing cross-node DNS resolution:

# Launch a temporary test pod and run a DNS lookup
sudo kubectl run dnstest \
  --image=busybox:1.35 \
  --restart=Never \
  --rm \
  -it \
  -- nslookup kubernetes.default.svc.cluster.local

Expected output:

Server:    10.43.0.10
Address 1: 10.43.0.10 kube-dns.kube-system.svc.cluster.local

Name:      kubernetes.default.svc.cluster.local
Address 1: 10.43.0.1 kubernetes.default.svc.cluster.local

✅ Pass: DNS resolves successfully
❌ Fail - command hangs: Flannel VXLAN is broken

If DNS hangs, investigate Flannel:

# Check Flannel is using tailscale0
sudo kubectl -n kube-system logs -l app=flannel --tail=30 | grep -E "tailscale|iface"

# On any affected node, check if flannel.1 interface exists
# (run via kubectl debug or node SSH)
ssh ubuntu@k3s-agent-1.tailnet.ts.net "ip link show flannel.1"

If flannel.1 is missing, see the Flannel over Tailscale guide for the manual recovery procedure.

7.7 Tailscale Operator¶

# Check the operator is running
sudo kubectl -n tailscale get pods
# All pods: Running

# Check that the operator is connected to the tailnet
sudo kubectl -n tailscale logs -l app=operator --tail=20
# Should show: "logged in" or "reconciling" - not authentication errors

# Verify a Tailscale ingress has an address assigned (e.g. Authentik)
sudo kubectl -n authentik get ingress authentik -o jsonpath='{.status.loadBalancer}'

7.8 Longhorn Storage¶

# Check all Longhorn pods are running
sudo kubectl -n longhorn-system get pods
# All should show Running or Completed

# Check Longhorn nodes (should show all 3 k3s nodes)
sudo kubectl -n longhorn-system get nodes.longhorn.io

Expected: All 3 nodes listed with READY=True and conditions showing healthy disk and networking status.

Access Longhorn UI (if the Longhorn dashboard ingress is configured): - Should be accessible via Tailscale at a URL defined in k3s/manifests/

Reference: Longhorn documentation

7.9 AWS S3 Backups¶

# Verify the S3 bucket is accessible
aws s3 ls s3://<S3_BACKUP_BUCKET_NAME> --region us-east-1

# If the bucket has existing backups, verify their integrity
aws s3 ls s3://<S3_BACKUP_BUCKET_NAME>/ --recursive --human-readable

✅ Pass: Bucket is accessible and lists backup objects
❌ Fail - access denied: Check AWS credentials are correct
❌ Fail - bucket not found: Run tofu apply to recreate the bucket

7.10 cert-manager¶

# ClusterIssuer must be Ready
kubectl get clusterissuer letsencrypt-production -o jsonpath='{.status.conditions[0].message}'
# Expected: "The ACME account was registered with the ACME server"

# Verify no certificates are failing
kubectl get certificates --all-namespaces | grep -v "True\|Ready"
# No output means all certs are issued

✅ Pass: ClusterIssuer Ready, no failed certificates
❌ Fail - ACME registration failing: Check cert-manager logs (kubectl logs -n cert-manager deployment/cert-manager)
❌ Fail - certificate not issued: Cloudflare Tunnel must be working first (Section 7.4) so HTTP-01 challenges can reach the cluster

7.11 Authentik SSO¶

# Pods should be running
kubectl get pods -n authentik
# Expected: authentik-server-*, authentik-worker-*, postgresql-* all Running

# Check server is healthy
kubectl logs -n authentik deployment/authentik-server --since=2m | grep -i error | tail -5
# Should be empty (no errors)

Log in to Authentik:

Open https://authentik.tailnet.ts.net
Log in as akadmin with the bootstrap password (from Phase 6.3)
Navigate to Applications → Applications - verify your configured apps are listed
Navigate to Applications → Outposts - verify the Embedded Outpost shows as healthy

If applications are missing

Authentik application/provider config is stored in its PostgreSQL database. If the CNPG cluster PVC survived, the config is intact. If the PVC was wiped, you need to manually recreate providers and applications - see docs/authentik.md for the procedure.

✅ Pass: Both pods Running, UI accessible, apps and outpost present
❌ Fail - pods CrashLoopBackOff: Usually a bad secret-key - verify authentik-credentials was patched (Phase 6.3)
❌ Fail - database connection refused: CNPG cluster may need time to come up; wait 5 minutes and retry

7.12 Stalwart Email Server¶

# Pod should be running
kubectl get pods -n stalwart
# Expected: stalwart-* Running

# Check logs for startup errors
kubectl logs -n stalwart deployment/stalwart --since=2m | grep -iE "error|panic|failed" | head -10
# Should be empty

Log in to the admin UI:

Open https://mail.tailnet.ts.net
Log in as admin with the password from Phase 6.4
Navigate to Directory → Accounts - verify noreply@example.com exists

Send a test email through Authentik:

kubectl exec -n authentik deployment/authentik-worker -- ak test_email admin@example.com 2>&1 | \
  grep -E "email_sent|error" | tail -3
# Expected: "message": "Email to admin@example.com sent"

Check admin@example.com inbox (or Resend dashboard at resend.com) to confirm delivery.

✅ Pass: Pod running, admin UI accessible, test email delivered
❌ Fail - pod not starting: Check stalwart-secrets was patched (Phase 6.4); check logs for config parse errors
❌ Fail - auth rejected (535): SMTP username must be noreply (short form), not noreply@example.com
❌ Fail - email not delivered: Check Resend dashboard for bounces; verify resend-api-key is correct

7.13 Full End-to-End Test¶

The ultimate test: make a change to the GitHub repository and verify Flux applies it automatically.

# On your laptop (or any machine with git and kubectl)
# 1. Make a trivial change to a manifest (e.g., add a harmless annotation)
# 2. Commit and push to main
git add . && git commit -m "test: validate Flux reconcile" && git push origin main

# 3. Wait ~10 minutes for Flux to poll (or force immediately):
flux reconcile source git homelab -n flux-system
flux reconcile kustomization apps -n flux-system

# 4. Verify the change was applied
sudo kubectl get <resource> -n <namespace> -o yaml | grep <your-annotation>

✅ Pass: Change appears in the cluster within 10 minutes
❌ Fail: Check flux get sources git -n flux-system and verify the SSH deploy key is correct

Recovery Complete! 🎉¶

If all checks above pass, the homelab has been successfully recovered.

Final checklist:

Post-Recovery Tasks¶

Clean up old Tailscale devices from the previous installation:
Go to login.tailscale.com/admin/machines
Delete any offline devices from the old installation
Verify Cloudflare Email Routing destination is still verified (check Cloudflare dashboard → Email Routing)

Verify backup schedule is running on the game server:

ssh ubuntu@game-server.tailnet.ts.net
sudo systemctl status minecraft-backup.timer

Document any issues encountered during recovery in the GitHub repository (create an issue or update this guide)

Common Issues¶

Symptom	Likely Cause	Fix
Nodes show LAN IPs	`flannel-iface` not set	Re-run k3s Ansible playbook
Flux kustomizations stuck reconciling	SSH deploy key wrong	Re-create `flux-system` Git credentials
Tailscale devices not appearing	Tailscale auth key expired	Generate new key via OpenTofu
MetalLB not assigning IPs	L2Advertisement not reconciled	`flux reconcile kustomization metallb-config -n flux-system`
DNS test pod hangs	Flannel VXLAN broken	See Flannel over Tailscale
Tailscale operator auth errors	OAuth secret not applied	Complete Phase 6.1
Longhorn volumes degraded	Node count changed	Allow time for replica rebalancing
Cloudflare Tunnel offline	Tunnel token not patched	Complete Phase 6.2
Authentik CrashLoopBackOff	`secret-key` wrong/missing	Complete Phase 6.3
Stalwart SMTP 535 errors	Username is full email not short name	Use `noreply` not `noreply@example.com`
Emails not relayed (direct MX)	Resend API key wrong or routing config missing	Check `queue.strategy.route` in configmap, check DB overrides