The Datacentre is currently working on their network, which unfortunately means some of our servers are unreachable for the moment.
We have been told that they are working on the network upgrades as fast as they can and hope there will be very little downtime.
We apologise for any inconvenience caused.
11.46: The technicians are moving some of the VM's to another cluster. Once the root drives have been moved across they can then start booting up the servers again.
14.40: The Datacentre have just got back with a reply and unfortunately they can't give us a definite ETA...
Hi Chae, We're looking through logs to see if we can find an ETA. I just spoke with my coworker and want to assure you that this is a top priority. ROOT is still migrating over and we're doing what we can to try and get your servers up and running again. I will update the ticket as soon as possible to give you an update once we look through the logs.
17.00: All the datacentre technicians can tell us is that they are still waiting for data to migrate
18.00: This is now more and more looking like hardware issues, I'm concerned about the integrity of our customers data and the data centre have replied with:
Your data is not lost. Chae, the only thing I can tell you right now is that I'm sorry that this happened, I'm still monitoring what's going on but I think we might have found a huge bug that had never manifested itself before today. I wish that this hadn't happened, trust me the last thing we wanted was for something like this to occur. We'll continue to monitor VM's though and will let you know once everything is backup.
21.00: Customers who already have a ticket open regarding the outage please refrain from continually opening new tickets, this only chokes up the HelpDesk and makes it more difficult to track who we have corresponded with and who we haven't
06.10: Web Sites are starting to resolve again. If you have any issues with your sites/emails please let us know via the Helpdesk. The Design Centre apologies for any inconvenience caused but this was totally outwith our control. The Datacentre have been working throughout the night to resolve the problem and as of yet no firm explanation as to why a simple maintenance event could pull down servers. Once I get some answers I'll update this bulletin.
06.37: All customers who corresponded with us regarding the outage should have had all their tickets and emails answered, if not again apologies, any emails or tickets sent through today will be answered as usual. All tickets regarding the outage have been closed.
07.05: From the datacentre...
We are working on a post mortem now. Please understand that this situation is not acceptable practice for us. We and you depend on the system to work and we know how to handle unforeseen cases like this. But, we failed in escalating this is a timely fashion. Once we have an internal review, someone will update you regarding what happened and the process changes we have made to make sure we never fail in this way again.
The datacentre have now finished their internal review and below is a copy of their findings and resolution:
Well the upgrade didn't go as easy as it should have and we sincerely apologize for that.
My understanding of what happened was that the system tried to upgrade you in the current cluster but was unable to do so due to lack of resources there. It then tried to move you to a different cluster that had the resources needed but only a partial move was made and the trouble began. We had never experienced this before so it was quite unexpected.
We considered trying to move the part that did move back, but it was no more work to just complete the process. At no point was your actual data in jeopardy regardless of the confusing ticket updates. We had to manually move the part that did not move over to the new cluster, make some changes to the system and then everything was back to normal.
Finally, we have met with the entire team to discover what actually happened and we have reinforced our process to ensure this doesn't happen again. More specifically, we are making sure that when a customer is down for even a short period of time, that we are escalating the issue more quickly.
Again, we are sorry for this experience and we will be treating your account with extra care going forward.
Several points have been brought up by our customer, such as why didn't we email everyone...the simple answer to that was at least 90% of customers register and use domain specific email, this outage/downtime meant that even if we did 90% of you wouldn't have received the email.
Why couldn't we contact The Design Centre by telephone...If you go through the solution section of the Helpdesk you'll see a long time ago we moved away from phone to online 3rd Party Helpdesk, this meant we weren't going to spend all our time on the phone in cases like this and our time could be better spent trying to resolve the issue. We also found on asking major customer/resellers what they would like to see and the clear winner was an online Helpdesk. At first we had our own helpdesk running within our network but we then chose to use an independent 3rd party system to provide customers access in case a total meltdown happened, the helpdesk also gave us an avenue of posting network/maintenance announcements and updates on ongoing problems as we did on Tuesday.
Again we are finding no matter how much redundancy is put into place things can still go wrong and even the most up to datacentres can get caught out.