Transmission

XMission's Company Journal

Keeping the Lights on at XMission

We thought some might appreciate a recent behind-the-scenes story about how XMission maintained uninterrupted service when utility water and then power were lost. Typically, infrastructure is something people only think about when it isn’t working. We take for granted the complex engineering and maintenance required to provide our modern world with real time access to things like water, power, and Internet service and can quickly get frustrated if we lose any of those services since we rely so heavily on them.

The weekend started Friday night with some scheduled maintenance to take an older UPS offline so we could replace it with a new flywheel UPS. The maintenance went well with our vendors handling everything flawlessly. I was back at 8 a.m. on Saturday morning to oversee the riggers move that old UPS out and the new one in place. We had an issue with the riggers struggling to properly mount/secure the new flywheel UPS but I worked out a solution with them so I was around until mid-afternoon. Our data center manager Mike arrived mid-morning and stayed around with our electricians while they wired up the new flywheel UPS. Everything went perfectly but when our electrician and manager were leaving that evening they noticed water shooting through a crack in the road on the street just west of our offices. They promptly called the Salt Lake water dept who came out and eventually shut off the water main around 10 p.m. without first notifying us. Unfortunately, once they shut off the water main our data center cooling towers stopped getting water. The towers hold some water in them but not enough to last for more than an hour or so, depending on conditions.

When we upgraded our data center HVAC (Heating Venting Air Conditioning) infrastructure 3 years ago we replaced inefficient heat exchangers with very efficient water chillers, which are basically huge swamp coolers. The upgrades have brought about dramatic efficiency improvements and subsequent power savings of over 30% but we now require a constant water supply to keep things inside sufficiently cool. Fortunately, we’re prepared for this contingency with a backup reservoir of approximately 1500 gallons of water in our basement with pumps and piping setup to automatically supply water to the towers when the system detects utility water is unavailable. When we put this backup system online over a year ago we tested it briefly and were confident it would work when needed. On Saturday night it kicked on as expected. Unfortunately, within an hour we started to have issues with the filters for the open water loop as they began clogging up with silt from the heavily mineralized water. When the filters get clogged the system automatically starts to clean them by back flushing, which requires vast amounts of water.

Happily, we had another backup contingency solution available that we quickly moved to get in place as soon as the water reservoir started to take a nose dive. Soon after deploying the cooling towers I found a hose bib in the building behind us to the north. After confirming with city water main plans that the State Street water main fed the building I contacted the owner and received approval to use it in an emergency, should the 4th South main get shut off. Once we constructed the backup reservoir solution I assumed we wouldn’t need the hose bib in the alley but nonetheless kept the hoses available just in case. In a few minutes we had water refilling the towers as our HVAC technician started to open up and clean out the filters manually to ensure the water loop would continue flowing between the roof towers and the air handlers in the server and colocation rooms of the facility.

While the cooling towers struggled the HVAC infrastructure wasn’t effectively removing heat from the computer rooms so we posted staff at the doors and placed large fans in the doorways and strategically around the aisles to bring in the cold winter air. A last resort sort of option but it nicely chilled things down until we had the water loop and towers fully flushed out and 100% functional again. Once we had things running normally we chose to wait for the city water department to fix the broken water main. Unfortunately, the huge hole they dug into the street where the water surfaced did not uncover the source so they had to call Blue Stakes back to mark a section uphill from there which took additional hours. By 2 a.m. I asked Mike, our data center manager, to go home so he could get back on site the next morning to oversee the new flywheel UPS setup.

Assuming the city was close to repairing the water main and wanting to ensure the HVAC system continued to run properly, our technician and myself stayed onsite and found ourselves talking about Damascus swords and exploding boilers until 4 a.m. when we again talked with the one remaining water department employee who updated us that they wouldn’t be able to complete the job until morning. We decided to take off at that point and I asked our on site graveyard support technician to keep a close eye on the hose and automated notifications and to immediately call me if any issues manifested.

On Sunday morning, Mike met with our flywheel technicians who continued their work to setup the UPS and perform their diagnostics checklist. The water department fixed the broken water main and restored service. Briefly. As last weekend’s luck would have it the bursts of high water pressure caused the old water main to burst up the street and they shut off the water supply again. At first, we were safely on the backup hose supply but just after 11 a.m. a staffer tasked to help keep a close eye on the hose noticed that it wasn’t running so I was called to come back in. I live nearby and immediately started troubleshooting why this redundant water supply had mysteriously stopped. I quickly confirmed that the entire building, rather than solely the business kind enough to let us use their water, was without water pressure. Originally, the water department worker we had asked about this loss of backup water pressure said no further valves had been shut off. Later that afternoon we learned that another valve had been temporarily shut down.

While looking into alternative water sources the backup feed magically came back online. Soon after feeling relief from that situation a new catastrophic event took place: power to the data center went offline. It is hard to explain the diminished surprise we felt at that moment after everything that had already happened. In part, it was because we felt confident our backup electrical infrastructure would handle the event without issue, which it did. The UPS’ provided temporary power to all computing equipment in the facility and the automatic transfer switch (ATS) sensed the loss of utility power and turned the generators on then transferred load over to them seconds later when they were ready. Well designed and maintained infrastructure saved the day.

Supposedly Blue Stakes marked on snow and ice, which disappeared by the time the backhoe started hammering into 4th South that morning. Perhaps it was due to a very fatigued crew from the water department who’d been there since the evening before. Regardless, we were now on backup water and power but no services had suffered an interruption of any sort so we shared a fatigued laugh. Since the water department crew cut open 12,000 volt wiring they had to sit tight for the rest of the day while Rocky Mountain Power took over and did their best to restore power to as many affected customers as possible. Once RMP’s crew arrived on site I coordinated things with the foreman and the city was able to fix not just one but two more breaks in the water main before they could restore service.

To restore power to our data center, RMP had to run massive gauge wiring under the street from one manhole to the next for a good stretch. This took hours so I left things in the capable hands of our data center manager Mike with the plan that I’d show up for work on Monday and he’d take it off to rest. We called our electrical vendor back in (he’d been around for much of the last week preparing for and then helping with the UPS upgrades) since we needed to measure the rotation of the current, which can be either clockwise or counter-clockwise. Shockingly, due to a miscommunication, when the time came we barely had enough time to check the rotation inside our ATS before utility power was restored. A crucible of electrons and metal, the ATS moves a huge bus bar when switching between utility and generator power. It sounds like a giant’s gauntlet and frightens everyone nearby so having your arm inside moments before such an event is ever so frightening but properly concluded the unwelcome events of the weekend.

All said and done though, the flywheel UPS upgrade went flawlessly and we kept the lights and power on in the data center for XMission’s Internet services.

Facebooktwittergoogle_plusredditpinterestlinkedinmail

Comments are currently closed.

8 thoughts on “Keeping the Lights on at XMission

  • Alison Brown says:

    Your diligence and preparedness help to avoid a disastrous situation. Way to go!

  • So impressive. Thank you.

  • Steve Biggs says:

    Note to Grant and Mike: When the Big One rocks the Wasatch Front I’ll be camping at xmission. I know you will have water and power, but can you cook?

  • Peter says:

    Wow!!! I’m impressed. Thank you for the play by play.

    Contrast that with CenturyLink who recently lost a large chunk of their network for two days due to a single faulty card they couldn’t find. This is why I prefer smaller local companies. I don’t want to be at the mercy/unpreparedness of the big companies who don’t care about their customers.

  • Grant Sperry says:

    We can’t cook but make satisfactory PB&J sandwiches!

  • Grant Sperry says:

    Thank you Peter. We want our customers to know we diligently work to keep everything online for them.

  • RICHARD PEARSON says:

    Wow – and we, as customers, never noticed a thing. Thanks to everyone for the dedication.

  • Grant Sperry says:

    Thank you Richard. Absolute transparency and uptime are our goals with the infrastructure.