Disaster Recovery - Overview¶
Scope: Complete homelab rebuild from zero. Use this guide when the physical Proxmox server (
chronobyte) is destroyed, unbootable, or otherwise unrecoverable.A reader with access to the Bitwarden vault and minimal technical knowledge can follow this guide end-to-end and restore the entire homelab stack.
What is this Homelab?¶
This homelab runs a small Kubernetes cluster on a single physical server using several layers of technology stacked on top of each other. Here is what each layer does:
graph TD
apps["Your Applications<br/><i>Jellyfin, Uptime Kuma, Authentik, etc.</i>"]
k3s["Kubernetes (k3s)<br/><i>orchestrates containers</i>"]
vms["Virtual Machines<br/><i>4 VMs created by OpenTofu</i>"]
proxmox["Proxmox VE<br/><i>hypervisor running on bare metal</i>"]
server["Physical Server - chronobyte"]
apps --> k3s
k3s --> vms
vms --> proxmox
proxmox --> server
subgraph cloud["Supporting services (external / cloud)"]
ts["Tailscale - secure private networking"]
cf["Cloudflare - public DNS records"]
s3["AWS S3 - backups and infrastructure state"]
gh["GitHub - all code and Kubernetes manifests"]
bw["Bitwarden - stores all secrets/credentials"]
end
The key insight: everything except the physical hardware is either stored in the cloud (code in GitHub, state in AWS S3, backups in S3, secrets in Bitwarden) or can be recreated automatically. This is why recovery is possible from zero.
Recovery Phases¶
Follow these phases in order. Each phase links to a dedicated document with full details.
| # | Phase | Time | Description |
|---|---|---|---|
| 0 | Prerequisites | 15 min | Verify Bitwarden vault and gather all credentials |
| 1 | External Services | 10 min | Confirm GitHub, AWS S3, Cloudflare, Tailscale are intact |
| 2 | Proxmox Rebuild | 30–45 min | Install Proxmox VE, configure network, create VM template |
| 3 | OpenTofu Apply | 15 min | Provision all 4 VMs, DNS records, S3 bucket, Tailscale keys |
| 4 | k3s Cluster | 20 min | Deploy Kubernetes control plane and worker nodes |
| 5 | Flux Bootstrap | 15 min | Install Flux CD and trigger GitOps reconcile |
| 6 | Secrets Restore | 10 min | Apply secrets that cannot be stored in git |
| 7 | Validation | 15 min | Verify all services are healthy |
Total estimated time: 2–3 hours (assuming Proxmox installs cleanly and VMs boot without issues)
What is NOT Lost in a Hardware Failure¶
Because critical state lives outside the physical server:
| Data | Location | Notes |
|---|---|---|
| All Kubernetes manifests | GitHub (hexabyte8/homelab) |
Everything needed to redeploy every service |
| OpenTofu infrastructure state | AWS S3 (chronobyte-homelab-tf-state) |
Tracks all created VMs, DNS records, S3 buckets |
| Game server backups | AWS S3 | Versioned, AES-256 encrypted, 90-day retention |
| All credentials and secrets | Bitwarden | Single source of truth for all passwords and keys |
Technology Reference¶
Not sure what a technology does or how it works? These pages explain each one from scratch:
| Technology | What it does |
|---|---|
| Proxmox VE | Hypervisor - runs virtual machines on bare metal |
| OpenTofu | Infrastructure as Code - creates VMs, DNS records, S3 buckets automatically |
| Kubernetes / k3s | Container orchestration - runs and manages containers across multiple nodes |
| Flux CD & GitOps | Continuous delivery - keeps cluster state in sync with GitHub |
| Tailscale | Private networking - secure VPN mesh between all machines |
| Cloudflare | Public DNS, Tunnel, and Email Routing - maps domain names and forwards inbound mail |
| Longhorn | Distributed storage - persistent volumes for Kubernetes workloads |
| Ansible | Configuration management - automates node setup and application deployment |
| AWS S3 | Object storage - game server backup destination |
| Bitwarden | Secrets management - stores all credentials securely |
| GitHub Actions | CI/CD - automates deployment workflows |
Services Running on the Cluster¶
| Service | Namespace | Access | Purpose |
|---|---|---|---|
| Flux CD | flux-system |
- (cluster internal) | GitOps controller |
| Authentik | authentik |
https://authentik.tailnet.ts.net |
SSO / Identity Provider |
| Stalwart | stalwart |
https://mail.tailnet.ts.net |
Email server + webmail |
| AdGuard Home | adguard |
https://adguard.tailnet.ts.net |
DNS-based ad blocking |
| Jellyfin | jellyfin |
https://jellyfin.tailnet.ts.net |
Media server |
| Uptime Kuma | uptime-kuma |
https://status.example.com (public) |
Uptime monitoring |
| MkDocs/Zensical | GitHub Pages | https://docs.chronobyte.net |
This documentation |
| Longhorn | longhorn-system |
https://longhorn.tailnet.ts.net |
Storage UI |
| cert-manager | cert-manager |
- (cluster internal) | Automatic TLS certificates |
| cloudflared | cloudflared |
- (cluster internal) | Cloudflare Tunnel daemon |
| Traefik | kube-system |
- (cluster internal) | Ingress reverse proxy |
| MetalLB | metallb-system |
- (cluster internal) | LoadBalancer IP assignment |
| Tailscale Operator | tailscale |
- (cluster internal) | Tailscale Ingress provisioner |
| CNPG | cnpg-system |
- (cluster internal) | PostgreSQL operator |
Quick-Reference: Infrastructure Map¶
graph TD
subgraph phys["Physical Server: chronobyte<br/>OS: Proxmox VE | LAN: <proxmox-lan-ip> | Tailscale: chronobyte.tailnet.ts.net"]
srv["VM: k3s-server (VMID 102)<br/>LAN: <k3s-server-lan-ip> | Tailscale: <k3s-server-ts-ip><br/>Role: Kubernetes control plane"]
ag1["VM: k3s-agent-1 (VMID 101)<br/>LAN: <k3s-agent-1-lan-ip> | Tailscale: <k3s-agent-1-ts-ip><br/>Role: Kubernetes worker node"]
ag2["VM: k3s-agent-2 (VMID 103)<br/>LAN: <k3s-agent-2-lan-ip> | Tailscale: <k3s-agent-2-ts-ip><br/>Role: Kubernetes worker node"]
gs["VM: game-server (VMID 104)<br/>LAN: DHCP | Tailscale: auto-assigned<br/>Role: Minecraft game server"]
end
srv --- ag1
srv --- ag2
Before You Start¶
- Read the Prerequisites page first - without the Bitwarden secrets, recovery is impossible.
- Do not skip steps - each phase depends on the previous one completing successfully.
- Use GitHub Actions where possible - the recommended path uses pre-built workflows that handle secrets automatically.
- Check the Technology Reference pages if you are unfamiliar with a tool before using it.