Etherion Tech

What a Power Outage Reveals About Your Server Room

The power went out mid-afternoon. Fourteen minutes later, all four servers were back online and the cluster was healthy. No data loss, no major incident.

But while digging through the recovery logs, I found three separate problems that had nothing to do with the outage itself - they were just sitting there, invisible, waiting for the next power event to be worse.

TL;DR: A power outage is a stress test your infrastructure didn't consent to. When everything recovers cleanly, it's easy to call it a win and move on. The more useful thing to do is treat it as an audit. Power events expose risks that stay hidden during normal operations.

What is a Proxmox Cluster?

Proxmox is an open-source virtualization platform that runs virtual machines and containers on bare-metal servers. A cluster is a group of physical servers managed as a single unit, with shared configuration and the ability to move workloads between nodes. Clusters use a quorum system to coordinate decisions - they require a majority of nodes to agree before making changes, which prevents split-brain scenarios during partial failures.

What Actually Happened

Site power failed. All four nodes lost power simultaneously, which meant no node stayed up to maintain quorum. The cluster went down as a unit. When utility power was restored, the nodes came back up in sequence and rejoined the cluster within about ten minutes. By the fourteen-minute mark, quorum was fully restored and guests were auto-starting.

From a pure uptime perspective: successful recovery. But the logs told a more complicated story.

What the Logs Exposed

RAID Configurations That Look Safe But Aren't

Two of the four nodes were running VM storage on single-drive RAID-0 arrays - no redundancy at all. RAID-0 stripes data across drives for performance, but one failed drive means total loss of everything on that volume. This was in a production environment where those volumes held running virtual machines.

One of those nodes also had its backup storage on another single-drive RAID-0. The backup target had the same failure profile as the data it was supposed to protect.

The drives on that node had accumulated a few reallocated sectors - a SMART indicator that the drive has moved data away from bad blocks. Not an immediate red flag, but a sign the drives are aging. On a RAID-1 mirror, a drive with reallocated sectors is a manageable situation - you replace the drive at a controlled time. On RAID-0, it means you're watching a countdown.

Write Cache Without Confirmed Battery Backup

Enterprise RAID controllers use a write cache to improve performance - they acknowledge a write to the OS immediately, then flush it to disk in the background. This is fast, but if power dies before the flush happens, that data is gone.

The protection against this is a Battery Backup Unit (BBU) - a small battery on the controller that keeps the cache alive long enough to complete the flush during a power loss. One of the nodes had drives configured with write-back caching policy, but the BBU status hadn't been verified since the hardware was set up. A failed or degraded BBU with write-back enabled is worse than no cache at all, because the system behaves as if the writes are safe when they aren't.

A Remote Backup Target That Had Been Unreachable for Hours

The logs showed that two nodes had been logging errors against a remote backup destination - a Proxmox Backup Server at another site - for roughly ten hours before the outage. The storage definition was still in the configuration, but the target hadn't been reachable all day. Each failed poll waited for a full timeout before logging an error. Thousands of entries per day, per node.

Nobody knew because nobody was watching those logs. The backup jobs themselves were probably failing silently.

Why This Happens

None of these were the result of negligence. They're the natural result of infrastructure that grows over time without a regular audit process. RAID configurations that made sense for a lab setup carry forward into production. A BBU that was healthy at setup doesn't get rechecked. A decommissioned backup target gets left in the configuration because removing it requires touching the cluster config on every node.

Normal operations don't surface any of this. The cluster performs, guests run, backups appear to complete. It takes an unplanned event to expose the gap between what you think your infrastructure does and what it actually does.

The Right Response to a Clean Recovery

When a power event results in a clean recovery, the instinct is relief. The better use of that moment is a structured post-incident review:

The goal is to convert the surprise audit that just happened into a documented baseline. Then schedule the next one before the next power event schedules it for you.

The Broader Principle

Redundancy that exists on paper is not the same as redundancy that works. A RAID-0 array with "backup" in the name is not a backup strategy. A write cache with an untested battery is a liability dressed up as a feature. These distinctions only matter when something fails - which is exactly when you don't have time to discover them.

The same principle applies across infrastructure: site-to-site VPN configurations that work day-to-day can hide routing assumptions that only break during a failover. Cloud migrations that close cleanly can leave stale hybrid components running for months. Regular audits, not just reactive fixes, are what keep infrastructure from surprising you.

If you want a second set of eyes on your server room before the next unplanned stress test, I can help.

Share on LinkedIn

About the author

Edward B. is an IT infrastructure consultant based in Tulsa, Oklahoma with 10+ years of experience in systems administration, identity and access management, and cloud migration. He holds CompTIA Security+, Network+, A+, ITIL v4, Azure Fundamentals, and Linux Essentials certifications.

LinkedIn · Full bio · Get in touch