Transmission

XMission's Company Journal

XMission Outage 11/11

Hello Everyone,

XMission extends a heartfelt apology to our customers for any inconvenience caused by yesterday’s outage. We are truly sorry, and appreciate your patience.

This is a courtesy post for anyone that is not already on the XMission Announcement email list.

XMission Outage

XMission experienced a serious outage while we were performing some standard UPS maintenance today. The outage affected all services and started at approximately 2:00 p.m. on Tuesday, November 11th. Network services for many were partially restored by about 2:30 p.m. but some other services required a lot of attention and took much longer.

Details

About 40% of our data center, including our server room, suffered a power outage when a technician flipped a mislabeled breaker during some standard maintenance on one of our 3 UPS units. Although the power outage was momentary, servers and routers often respond very poorly to losing power and sometimes take extensive work to come back up. Unfortunately, such was the case today with many systems.

Seriously Affected Systems

  • An important router, which some connections and servers rely on, required extensive attention from our network administrators.
  • DNS (Domain Name Service) was sporadic for some customers for over an hour.
  • Email services were down for over 5 hours.
  • Web hosting suffered the longest outage because our NetApp storage appliance which houses all customer files and web sites lost multiple hard drives. As a result, we are currently restoring files to our new NetApp 2020 from our November 9th backup, which will take many hours yet to complete. We recently purchased this new NetApp and were merely days away from getting it online.

Conclusion

Today‘s outage was exacerbated by multiple systems responding poorly to losing power. In spite of the holiday, our systems administrators were on site within minutes and continue to work tirelessly to restore all services. In the end, we should have performed this maintenance on a day when our systems administrators were on site because problems can arise no matter how carefully you proceed.

Facebooktwitterredditpinterestlinkedinmail

, ,

Comments are currently closed.

9 thoughts on “XMission Outage 11/11

  • I wsa surprised at the bad service provided by XMission in this outage. XMission is usually excellent. When I called XMission to find out what was going on, I was greeted by an automated system that directed me to leave a message to the technician. But the mailbox wouldn’t accept any more messages.

    XMission’s blog was itself offline. XMission’s chat was offline. There was no way to receive news from XMission.

    Our DNS records, hosted with XMission, disappeared from the internet, making our sites apparently unavailable to the world. I was surprised that XMission didn’t have DNS replicated anywhere else in the globe.

    My suggestions:

    1) Create means of communicating with customers. Means that are outside of XMission’s systems, in case a catastrophic event takes them down.

    2) Replicate DNS somewhere else in the globe.

    Having said that, it was the first time in 10 years that I’ve had any problems with XMission.

  • lenka says:

    My connection was gone for several hours yesterday and its still not working well today. It falls in and out a lot. When is it going to be fixed?

  • John W. says:

    Lenka,

    Some clients need to power cycle their modem (DSL/UTOPIA). In some instances where you have a wireless access point for your network, you may have to power cycle that as well.

    It appears that your power cycle this morning corrected your connectivity issue.

  • John W. says:

    Roberto,

    Please accept our sincerest apology for any inconvenience you experienced. I am happy address your concerns.

    Phones – Our phone server was among the many machines affected by the outage, and experiencing the outage on a Holiday only compounded the problem. All primary servers were also unavailable for some time while we corrected the issues the power loss created. This affected all client web hosting, http://www.xmission.com, our blog, and many other servers.

    Off-site notification – XMission is working on a solution to this concern.

    DNS – Our follow-up announcement will be posted later today. It will address this as well as many other concerns. XMission DNS is replicated in another state, as well as locally.

    The good news- While this has been a difficult situation for everyone, we have learned from it and have identified additional improvements which will be implemented in the near future. You can continue to expect exceptional service and support from XMission.

    Please feel free to contact me directly with any further questions you might have. I am happy to work with you to address your concerns. My email is john@xmission.com.

  • Kory Hoopes says:

    Even though I did experience connectivity issues for a bit yesterday, I am still surprised by the transparency of Xmission. I have never had an ISP tell me why an outage happened and then apologize for it. While better communication will help calm fears in the future, I still laud Xmission as the best ISP I have ever had.

  • Mike says:

    John W. and Xmission gang:

    It’s really unfortunate that it was human error (mislabeled breaker) that caused the outage, especially on a day that the staff was supposed to have the day off. (Looking at comment #1, I’m sure this is why he couldn’t get an answer… I remember seeing the Vet’s day announcement.)

    But look at it this way: If you hadn’t shut the breaker off to do your maintenance yesterday, you might never have known it was mislabled until perhaps a more critical time frame.

    I’m on DSL, and I obviously noticed the outage. I work from home and telecommute, so yes, it caused a disturbance in my work flow, but I found other things to work on without a connection. My e-mail was still a little funky today, so I simply used my company’s mail server, since I am usually sending company-related stuff anyway.

    Bad days happen, and with the recent announcements, you all obviously have had extensive discussions about the lessons learned. Good job! The service is already better today than it was two days ago.

    P.S. Be sure to re-label that breaker!

  • uxp says:

    I’m echoing Kory Hoopes, #5, on the transparency and honesty of XMission. This was a very bad outage and I understand the difficulty and stress involved with returning the system back to normal, yet Customer Service, as well as XMission Staff as a whole is honest enough to sit down at the end of the day and admit fault.

    There are numerous service providers that would shrug their customers off, fire the tech, and call it a day. Thank you for being sincere about the issue.

  • Sean Kirkby says:

    I, too, was surprised by the lack of support yesterday. The very first thing you should do in such a case, especially on a holiday when you are “closed”, is modify your phone greeting to indicate that there is a major outage, and that it is affecting everyone, and that there will be no way, nor need, to leave a message for support.

    It’s just communication… a very simple way to ease tensions. As it was, I (and I’m sure many others) simply felt like XMission didn’t care. I thought it was a problem with my router. I had no idea that it was a major problem. And for the brief time I was disconnected from the network, I was quite ticked off. A quick 15 second phone greeting would have assuaged that.

    That having been said, given what I now know about what happened, I am impressed that my service was restored so quickly.. I realize that this isn’t everyone’s experience… but it was mine, so…

    Lastly, I should just say that I am VERY surprised to hear that colo customers lost power. O_O ??? I am not a colo customer of XMission (we are using a different company). But that report caused me to IMMEDIATELY check with our provider to see if something like this could happen to us. I would have expected colo customers to AUTOMATICALLY have redundant power. I admit that I don’t fully understand the power grid design etc., but I certainly wouldn’t have expected THAT to happen.

  • tiassa says:

    I second the previous commenter in thanking you for your honesty and explanations. I understand that things like this WILL inevitably happen, but the fact that I get the real story, rather than some corporate dumbspeak that makes any apology seem entirely trite and insincere, makes me very very happy with xmission.

    When my email went down yesterday and I saw all the red on Nagios I figured something bad had happened! My sympathies to your phone staff, from someone who has been there before.