
March 2026. I’m packing my bags for KubeCon Amsterdam with a knot in my stomach. The homelab had been through power outages before, and getting everything back up was never quick or clean. Three Proxmox nodes, 90+ apps in Kubernetes, smart home controlling a lot of stuff around the house, including the water system. A power outage while I’m away would be a nightmare. I keep telling myself it won’t happen this time. I’ve got redundancy in place now. It’ll be fine… right? Fingers crossed.
It happened again.
With great automation comes great… headache. 🤦🏻
While I was at the conference, the power went out at home. Not unusual, it happens. But this time, the outage lasted longer than the UPS could handle. The Proxmox boxes went down, and when the electricity came back, they didn’t recover on their own. They stayed down.
That single event revealed just how many single points of failure I actually had:
My wife called. No WiFi, no water. She tried to help but:
After returning from the office, she improvised: ran an electrical cable down to the basement and bypassed the smart home system entirely to get the water pump running directly. Creative, effective, but not something anyone should have to do.
Looking at this honestly, the failures weren’t random. They were predictable:
| Root Cause | Impact |
|---|---|
| Relay switches without physical buttons | No manual override when HA is down |
| Zigbee coordinator not on UPS | HA can’t control smart devices during outages |
| Single effective DNS (Proxmox AdGuard) | Synology backup was unreliable under load |
| Existing UPS runtime too short | Nodes went down before power returned, and didn’t recover on their own |
| APs not on UPS | WiFi died immediately with power |
| Nothing labeled | Wife couldn’t identify or troubleshoot devices |
| No fallback WiFi SSID | ISP router WiFi wasn’t configured as backup |
I didn’t fix everything at once. After returning from KubeCon, I worked through these improvements over the following weeks, picking off the most impactful ones first.
Replaced the old relays with Nous D2Z (20A, Zigbee) switches. These have actual physical buttons, so even when Home Assistant is down, someone can go down to the basement and press a button to turn things on. Smart home automation should enhance manual control, not replace it entirely.
Every device in the tech room and around the house now has a label: UPS, mini PC, router, switch, NAS. If I’m not home, anyone should be able to identify what’s what.
The Sonoff Dongle Max (Zigbee coordinator) was plugged into a regular power strip. With the bigger UPS now keeping Proxmox and Home Assistant alive during outages, the Zigbee coordinator needs to stay alive too, otherwise HA has no way to talk to the smart devices. Moved it to the UPS.
The existing UPS simply didn’t have enough runtime. The outage outlasted it, the nodes went down hard, and they didn’t recover. Added a larger UPS (similar to the Anker SOLIX F2000) in front of the existing UPS to extend the total runtime. This serves two purposes:
I already had two AdGuard Home instances for redundancy: one on a Proxmox VM, one on the Synology NAS. The problem was that the Synology instance kept crashing under I/O load, making it unreliable as a backup.
Replaced the Synology instance with a dedicated Raspberry Pi running AdGuard Home. It’s lightweight, stable, and independent of both Proxmox and the NAS. True DNS redundancy now. Hopefully.
Configured the ISP router’s built-in WiFi with a known SSID and password. The WiFi access points aren’t on the UPS (and probably won’t be, to keep the UPS runtime longer for critical gear), so when internal infrastructure goes down, there’s no WiFi through the APs. But since the ISP router is on the UPS, this fallback SSID works even during a power outage. It bypasses all internal DNS and routing, but at least you get internet access.
Proxmox auto-recovery. The bigger UPS buys more time, but if the nodes do go down again, they still won’t come back up on their own. I still have to figure out why and fix it.