So we had this server.
As all servers are wont to do, this one had run successfully for a number of years. Everything worked perfectly until it didn’t.
It ran, to my knowledge, only Hyper-V Server on its system drive, and had a second set of drives for hosting the VM that ran Microsoft Deployment Toolkit to service our depot. Our depot was on its own physical network, sharing with production only an ISP demarc.
I had long since abandoned the depot and its trappings, thinking it someone else’s domain, thinking my time better spent on client systems, thinking that I didn’t need to know what happened in the oft-ignored part of our operation. I assumed that it was set up properly since it had been so stable for so many years. But you know the old saying:
When you make assumptions you make an ass out of you and muptions.
The Problem.
Our monitoring systems reports the two depot servers offline, both the hypervisor and its virtual. I sent our depot technician to take a look. They come back online and he tells me that it needed to be rebooted. Having divested myself of giving a damn about the depot, I barely found the energy to shrug.
Then it happened again. I again sent the technician and promptly got wrapped up in some client-facing issue. I forgot about the servers until:
They went offline a third time. I didn’t have to tell my depot tech; he was watching the same feed as I. He rummaged a bit and came back with a story of defeat and virtual disks not being found.
“The server won’t boot because the Virtual disk can’t be found” he said.
“Ok, so you mean the virtual won’t come up, but what about the physical?” I replied.
“No, that’s what I mean. It won’t get past BIOS. It’s complaining of a virtual drive not being found.”
“Sounds bogus, let’s look.”
He was not wrong; that is what the screen said. And what it meant was RAID failure. I slid off the front of the server case and sure enough, one of the drives had popped.
Oh, did I mention? No backups.
The Rabbit Hole.
Drives pop sometimes, ain’t no thing. We build systems to be resilient. You slap a fresh one in there and it starts re-silvering and you get on with your day. Not this time, gentle reader.
While digging through the RAID controller, I found, to my amazement, horror, and utter confusion, that whatever chucklefuck set up this server put the two system drives in a RAID 0. As I stared at the screen and at the blinking amber drive light, all that could pass my lips was a quiet “Oh my god, why?”
In this scenario, I didn’t see any way forward, but through. So far, it had been demonstrated that the bad drive would behave for about 2 hours, then throw a fit. I shut down the server and took some time to think about how to proceed. In that time, I re-discovered some of the things the virtual machine was serving.
Things like: MDT, DNS, DHCP, PXE boot, but most importantly: the lone DC for depot.local (MDT needs a domain). Oh, and it was the only machine that was set up to manage the hypervisor through the Hyper-V console and Server Manager.
GREAT.
Compounding the issue, the virtual was not stored on the separate set of RAID 1 disks in this server as I had assumed. It was stored on the system drive. Oh joy, oh rapture.
My new mission: Rescue that virtual.
The Struggle.
First things first. I assume I’ll only have one chance to rescue this data before this drive bites the dust for good. I plug in the VGA and keyboard. Take a deep breath.
I turn on the server.
It fails to boot into the operating system. “Come on, you little shit.” Take out the drive and put it back in. Success. We boot into the OS and I’m presented with a log on screen. Password.
The