XMission Outage 11/11/08 – Updates and Moving Ahead
Please note: As of January 1, 2017, XMission no longer sells DSL services.
We at XMission wanted to update our customers with more details and offer a proper apology now that the dust is clearing after yesterday’s outage, on Tuesday, November 11th. This was a big one and all customers experienced problems to one degree or another.
This announcement is very long but we wanted to address questions and concerns that have arisen to restore customer confidence.
Synopsis
While the power only went out for a moment, many systems were adversely affected by the outage and took extensive attention and time to recover. In the case of our primary storage device, we could not bring it back online even after hours of trying so we restored files to the new NetApp we already had on site. Many systems tie into this device, which exacerbated the problem.
We are happy to report that we will be greatly increasing base quota in the near future for customers at no additional charge from 100 MB to 5 GB now that we have the new NetApp storage appliance online. Our web hosting customers will also see significantly increased quota in the very near future.
As of the opening of business hours this morning, file backups had completed and most everything was in working order. We continue to find and address remaining issues, though. Some customers continue to experience delays with sending and receiving email but the queues are clearing.
Additional Technical Details
We didn’t have all the answers last night so here are further details regarding the outage:
- Our primary storage device (a NetApp F801 we were in the process of replacing this week with a new NetApp FAS2020) suffered the loss of 2 drives on one of the volumes, causing us to lose the data on the device entirely.
- We were waiting to get our snapmirror license from NetApp to copy data over but at least we had the new hardware on site and ready.
- Since many systems NFS mount to this gear, which handles /home, other servers required attention to get up and running properly.
- Web hosting was down into the night until customer files were restored to the new hardware. This was completed by morning.
- Our new NetApp 2020 has additional recovery options not available on the older 810.
- We have plans to purchase another NetApp 2020 in the near future to host offsite. While we already have off site backups, the 2020 is an upgrade to that system.
- While email services were down about 5 hours for most customers yesterday, no mail should have been lost although some customers continue to see delays sending and receiving email.
- DSL and UTOPIA customers were offline for up to an hour because our radius server did not recover on its own. Some customers also needed to powercycle their modem before they could reconnect. As a rule of thumb we highly recommend customers powercycle their gear when troubleshooting.
- Although it was a holiday, our systems administrators were on site within minutes and many worked through the night, some up to 18 hours without a break.
- We are sorry about problems with our phone systems. They initially were offline due to the outage, then we maxed out connections to it, and we couldn’t answer the calls because we only had a skeleton staff of phone technicians due to the holiday. We are making some changes to the existing phone system but are also in the process of replacing it by the end of this year. Some expressed concerns that our status messages were not very helpful. Unfortunately, we often did not know when systems would come back up and we also needed to keep the message short due to heavy call volume.
- To clarify, the outage was due to human error while doing maintenance on one of our 3 UPS’s and not any of our equipment. A breaker was mislabeled, which brought about the mistake.
- DNS (Domain Name Service) was sporadic for up to an hour. This was mostly due to a Cisco 6509 that continued to have issues in the beginning but we have since moved our two onsite authoritative nameservers (ns.xmission.com and ns1.xmission.com), as well as most other servers, to a new redundant connection. We should note that we do have a tertiary name server in California (ns2.xmission.com). If you list ns2.xmission.com as a tertiary nameserver for your domain, then your domain will continue to have working nameservice in the event that the two onsite nameservers are offline.
- QMOE customers suffered a prolonged outage due to the same Cisco 6509 that caused problems with our name servers. They have since been moved to a different router with greater redundancy. We will send our QMOE customers a separate email with further details by tomorrow.
- About 25% of our colocation customers suffered a brief power outage since we have customers spread across 3 separate UPS’s. Otherwise, aside from networking being briefly down after the initial outage, colocation services were not widely affected. In case some colocation customers are not aware, they can purchase powerstrips from different UPS’s to have redundant power. If you suffered equipment loss from the power outage or would like details about redundant power, please contact your sales rep for details.
Resolutions and Moving Ahead
We are very sorry for all of the problems that this outage has caused our customers and greatly appreciate all of the kind words and support you have given us. More than anything, we want to assure you that we are taking this matter seriously and proceeding with steps to lessen the chances that something like this can happen again, which include:
- We will more dutifully use our already existing NetStatus page to keep customers informed about our systems: http://stats.xmission.com/netstatus
- As well, we will be announcing all upcoming maintenance on the NetStatus page in the future and emailing those who opt into a list, which will soon be created for this purpose. To be added to the list, please email: support@xmission.com.
- We recognize that we need to handle communication much better in the future. We did setup an outage page with updates but realize that most did not know such a page existed: http://stats.xmission.com/outage/
- We also have our Nagios systems status page, which provides a very good look into our systems: http://stats.xmission.com/nagios/
- We plan to run redundant power from a second UPS up to our server room to feed essential hardware with dual power supplies. That alone would have dramatically minimized the effects of yesterday’s outage.
- While we already perform most maintenance outside of business hours, we have decided to enforce a policy that all systems critical maintenance (i.e., involving power, routers, core systems) must happen outside of business hours. Some additional training is also planned in regards to our electrical infrastructure.
XMission Outage 11/11 Ignite SLC
Comments are currently closed.
Type your comment here.
Whatever happened yesterday is entirely excusable because you all do such good community action at XMission. Things happen. You have never had such a breakdown in our years of valuable association with you. Thanks for all you folks do. And I appreciate the explanation, although my limited techno-speak kept me from fully understanding. That is why you do the job you do so well- you get it, we use it, we all benefit. Get some sleep. Tomorrow is a new day.
You guys are the best! Thank you for your complete accounting of the causes and results of your outage. We missed our email but had no real loss from this accident and we hope your other loyal customers can say the same.
We wish all our service providers were as committed to transparency and accountability as you have shown yourselves to be.
Best wishes for easy upgrades in the future!
I love that you guys are always upfront with this kind of stuff. Technology fails, people make mistakes, we all know that, but a lot of places are loathe to admit it. XMission’s honesty and clarity shows that they really respect their customers. I value that a lot, and it makes me happy to do business with you!
Hey there! It was an earthquake that effected your power. I went to get my mail in our condo and all the mail boxes started to rattle, then the earth shook-gently. Sho’ nuf, when I came back up stairs the internet was down. Here’s the link, see it fit matches.
http://www.quake.utah.edu/req2webdir/recenteqs/Quakes/uu11110813.html
I realize that the technical detail you provided is above and beyond what your marketing department may think your customers can handle. For my part, I am very glad that you gave a full accounting. As a network professional myself, I appreciate your candor and all the technical details you can provide. It may seem counter intuitive, but all the technical detail demonstrates how dedicated you are to the people who are your customers.
Keep up the excellent work!
While all of us in IT do our best to anticipate and minimize serious technical problems, the real test us is how we respond those emergencies that circumvent our best preventative measures.
I’ve had my own 18 hour days and can appreciate the dedication and effort that went into your emergency response. As far as I’m concerned you get a 5 out of 5 for both effort and results.
Harry Heightman
Technical Support Systems