Phase 7: Validation¶
Time estimate: ~15 minutes
Work through this checklist from top to bottom to confirm a successful recovery. Every item must pass before the homelab is considered fully restored.
7.1 Kubernetes Nodes¶
All 3 nodes must be Ready with Tailscale IPs:
Expected output:
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP
k3s-server Ready control-plane,master Xm v1.x.x <k3s-server-ts-ip> <none>
k3s-agent-1 Ready <none> Xm v1.x.x <k3s-agent-1-ts-ip> <none>
k3s-agent-2 Ready <none> Xm v1.x.x <k3s-agent-2-ts-ip> <none>
✅ Pass: All 3 nodes Ready, all INTERNAL-IP values start with 100.
❌ Fail - LAN IPs shown: Re-run the k3s Ansible playbook (the flannel-iface config was not applied)
❌ Fail - node not Ready: Check node logs: sudo kubectl describe node <name>
7.2 Flux Kustomizations¶
All managed Kustomizations must be Ready:
Expected output:
NAME READY STATUS
flux-system True Applied revision: main@sha1:...
apps True Applied revision: main@sha1:...
adguard True Applied revision: main@sha1:...
authentik True Applied revision: main@sha1:...
...
Investigate a failing kustomization:
flux get kustomization <name> -n flux-system --verbose
kubectl describe kustomization <name> -n flux-system
# Look at the "Message" field for error details
Force reconciliation:
7.4 Cloudflare DNS Records and Tunnel¶
Log in to dash.cloudflare.com and confirm the zone for
example.com is active. All DNS records are managed by OpenTofu - running tofu apply
recreates them. Verify these are present:
| Record | Type | Value |
|---|---|---|
mail.example.com |
CNAME | <tunnel-id>.cfargotunnel.com (proxied) |
status.example.com |
CNAME | <tunnel-id>.cfargotunnel.com (proxied) |
resend._domainkey.example.com |
TXT | DKIM key from Resend dashboard |
send.example.com |
MX | feedback-smtp.us-east-1.amazonses.com |
send.example.com |
TXT | v=spf1 include:amazonses.com ~all |
_dmarc.example.com |
TXT | v=DMARC1; p=none; |
Verify the Cloudflare Tunnel is connected:
kubectl logs -n cloudflared deployment/cloudflared --since=5m | grep -E "connect|registered|error"
# Should show: "Connection registered" - no errors
Verify Email Routing is enabled (inbound mail forwarding):
- Cloudflare dashboard → the
example.comzone → Email → Email Routing - Confirm status shows Enabled
- Confirm destination
admin@example.comshows Verified - If Unverified: click the address and resend the verification email
✅ Pass: Tunnel connected, DNS records present, Email Routing enabled and verified
❌ Fail - tunnel not connected: Check cloudflared-tunnel-credentials secret was patched (Phase 6.2)
7.5 MetalLB Load Balancer¶
MetalLB provides LoadBalancer-type IP addresses from the LAN IP pool <metallb-pool-range>:
Expected:
NAME READY STATUS RESTARTS
controller-<hash> 1/1 Running 0
speaker-<hash> 1/1 Running 0
speaker-<hash> 1/1 Running 0
speaker-<hash> 1/1 Running 0
Expected: A pool covering <metallb-pool-range> with status Auto Assigned.
7.6 Cross-Node Pod Communication (Flannel Health)¶
Verify Flannel over Tailscale is working by testing cross-node DNS resolution:
# Launch a temporary test pod and run a DNS lookup
sudo kubectl run dnstest \
--image=busybox:1.35 \
--restart=Never \
--rm \
-it \
-- nslookup kubernetes.default.svc.cluster.local
Expected output:
Server: 10.43.0.10
Address 1: 10.43.0.10 kube-dns.kube-system.svc.cluster.local
Name: kubernetes.default.svc.cluster.local
Address 1: 10.43.0.1 kubernetes.default.svc.cluster.local
✅ Pass: DNS resolves successfully
❌ Fail - command hangs: Flannel VXLAN is broken
If DNS hangs, investigate Flannel:
# Check Flannel is using tailscale0
sudo kubectl -n kube-system logs -l app=flannel --tail=30 | grep -E "tailscale|iface"
# On any affected node, check if flannel.1 interface exists
# (run via kubectl debug or node SSH)
ssh ubuntu@k3s-agent-1.tailnet.ts.net "ip link show flannel.1"
If flannel.1 is missing, see the Flannel over Tailscale guide
for the manual recovery procedure.
7.7 Tailscale Operator¶
# Check the operator is running
sudo kubectl -n tailscale get pods
# All pods: Running
# Check that the operator is connected to the tailnet
sudo kubectl -n tailscale logs -l app=operator --tail=20
# Should show: "logged in" or "reconciling" - not authentication errors
# Verify a Tailscale ingress has an address assigned (e.g. Authentik)
sudo kubectl -n authentik get ingress authentik -o jsonpath='{.status.loadBalancer}'
7.8 Longhorn Storage¶
# Check all Longhorn pods are running
sudo kubectl -n longhorn-system get pods
# All should show Running or Completed
# Check Longhorn nodes (should show all 3 k3s nodes)
sudo kubectl -n longhorn-system get nodes.longhorn.io
Expected: All 3 nodes listed with READY=True and conditions showing healthy disk and
networking status.
Access Longhorn UI (if the Longhorn dashboard ingress is configured):
- Should be accessible via Tailscale at a URL defined in k3s/manifests/
Reference: Longhorn documentation
7.9 AWS S3 Backups¶
# Verify the S3 bucket is accessible
aws s3 ls s3://<S3_BACKUP_BUCKET_NAME> --region us-east-1
# If the bucket has existing backups, verify their integrity
aws s3 ls s3://<S3_BACKUP_BUCKET_NAME>/ --recursive --human-readable
✅ Pass: Bucket is accessible and lists backup objects
❌ Fail - access denied: Check AWS credentials are correct
❌ Fail - bucket not found: Run tofu apply to recreate the bucket
7.10 cert-manager¶
# ClusterIssuer must be Ready
kubectl get clusterissuer letsencrypt-production -o jsonpath='{.status.conditions[0].message}'
# Expected: "The ACME account was registered with the ACME server"
# Verify no certificates are failing
kubectl get certificates --all-namespaces | grep -v "True\|Ready"
# No output means all certs are issued
✅ Pass: ClusterIssuer Ready, no failed certificates
❌ Fail - ACME registration failing: Check cert-manager logs (kubectl logs -n cert-manager deployment/cert-manager)
❌ Fail - certificate not issued: Cloudflare Tunnel must be working first (Section 7.4) so HTTP-01 challenges can reach the cluster
7.11 Authentik SSO¶
# Pods should be running
kubectl get pods -n authentik
# Expected: authentik-server-*, authentik-worker-*, postgresql-* all Running
# Check server is healthy
kubectl logs -n authentik deployment/authentik-server --since=2m | grep -i error | tail -5
# Should be empty (no errors)
Log in to Authentik:
- Open
https://authentik.tailnet.ts.net - Log in as
akadminwith the bootstrap password (from Phase 6.3) - Navigate to Applications → Applications - verify your configured apps are listed
- Navigate to Applications → Outposts - verify the Embedded Outpost shows as healthy
If applications are missing
Authentik application/provider config is stored in its PostgreSQL database. If the CNPG
cluster PVC survived, the config is intact. If the PVC was wiped, you need to manually
recreate providers and applications - see docs/authentik.md for the procedure.
✅ Pass: Both pods Running, UI accessible, apps and outpost present
❌ Fail - pods CrashLoopBackOff: Usually a bad secret-key - verify authentik-credentials was patched (Phase 6.3)
❌ Fail - database connection refused: CNPG cluster may need time to come up; wait 5 minutes and retry
7.12 Stalwart Email Server¶
# Pod should be running
kubectl get pods -n stalwart
# Expected: stalwart-* Running
# Check logs for startup errors
kubectl logs -n stalwart deployment/stalwart --since=2m | grep -iE "error|panic|failed" | head -10
# Should be empty
Log in to the admin UI:
- Open
https://mail.tailnet.ts.net - Log in as
adminwith the password from Phase 6.4 - Navigate to Directory → Accounts - verify
noreply@example.comexists
Send a test email through Authentik:
kubectl exec -n authentik deployment/authentik-worker -- ak test_email admin@example.com 2>&1 | \
grep -E "email_sent|error" | tail -3
# Expected: "message": "Email to admin@example.com sent"
Check admin@example.com inbox (or Resend dashboard at resend.com) to confirm delivery.
✅ Pass: Pod running, admin UI accessible, test email delivered
❌ Fail - pod not starting: Check stalwart-secrets was patched (Phase 6.4); check logs for config parse errors
❌ Fail - auth rejected (535): SMTP username must be noreply (short form), not noreply@example.com
❌ Fail - email not delivered: Check Resend dashboard for bounces; verify resend-api-key is correct
7.13 Full End-to-End Test¶
The ultimate test: make a change to the GitHub repository and verify Flux applies it automatically.
# On your laptop (or any machine with git and kubectl)
# 1. Make a trivial change to a manifest (e.g., add a harmless annotation)
# 2. Commit and push to main
git add . && git commit -m "test: validate Flux reconcile" && git push origin main
# 3. Wait ~10 minutes for Flux to poll (or force immediately):
flux reconcile source git homelab -n flux-system
flux reconcile kustomization apps -n flux-system
# 4. Verify the change was applied
sudo kubectl get <resource> -n <namespace> -o yaml | grep <your-annotation>
✅ Pass: Change appears in the cluster within 10 minutes
❌ Fail: Check flux get sources git -n flux-system and verify the SSH deploy key is correct
Recovery Complete! 🎉¶
If all checks above pass, the homelab has been successfully recovered.
Final checklist:
- All 3 k3s nodes are
Readywith Tailscale IPs - All Flux Kustomizations are
Ready - Cloudflare Tunnel connected, public services (
mail.example.com,status.example.com) accessible - Cloudflare Email Routing enabled and destination address verified
- cert-manager ClusterIssuer Ready, TLS certificates issued
- MetalLB IP pool is configured
- Cross-node pod DNS resolution works
- Tailscale operator is running and authenticated
- Longhorn storage nodes are healthy
- Authentik UI accessible, applications and outpost configured
- Stalwart admin UI accessible at
mail.tailnet.ts.net - Test email sends successfully via Authentik → Stalwart → Resend
- S3 bucket is accessible
- Flux auto-reconciles a test commit from GitHub
Post-Recovery Tasks¶
- Clean up old Tailscale devices from the previous installation:
- Go to login.tailscale.com/admin/machines
-
Delete any offline devices from the old installation
-
Verify Cloudflare Email Routing destination is still verified (check Cloudflare dashboard → Email Routing)
-
Verify backup schedule is running on the game server:
-
Document any issues encountered during recovery in the GitHub repository (create an issue or update this guide)
Common Issues¶
| Symptom | Likely Cause | Fix |
|---|---|---|
| Nodes show LAN IPs | flannel-iface not set |
Re-run k3s Ansible playbook |
| Flux kustomizations stuck reconciling | SSH deploy key wrong | Re-create flux-system Git credentials |
| Tailscale devices not appearing | Tailscale auth key expired | Generate new key via OpenTofu |
| MetalLB not assigning IPs | L2Advertisement not reconciled | flux reconcile kustomization metallb-config -n flux-system |
| DNS test pod hangs | Flannel VXLAN broken | See Flannel over Tailscale |
| Tailscale operator auth errors | OAuth secret not applied | Complete Phase 6.1 |
| Longhorn volumes degraded | Node count changed | Allow time for replica rebalancing |
| Cloudflare Tunnel offline | Tunnel token not patched | Complete Phase 6.2 |
| Authentik CrashLoopBackOff | secret-key wrong/missing |
Complete Phase 6.3 |
| Stalwart SMTP 535 errors | Username is full email not short name | Use noreply not noreply@example.com |
| Emails not relayed (direct MX) | Resend API key wrong or routing config missing | Check queue.strategy.route in configmap, check DB overrides |