Tuesday, Apr 20th, 2010 by Troy Kelly |
Leave a Comment
Filed under:
work
Today was a bad day for Purple Oranges technology wise. We had a major outage of our VoIP infrastructure, and then some call difficulties.
This little adventure has three players, and none of the outages are in any way related to each other – it just so happens that they all occurred on the same day. Firstly – NetRegistry, where we currently host our VoIP servers; Faktortel formally a VoIP carrier for us – now just a legacy connection because of some old DID’s that we can’t port away from them; and NetSIP our primary voice carrier.
With ten days notice NetRegistry advised that we were having both of our servers relocated at the same time. As soon as we were made aware, I called to discuss the relocation – and was told there was nothing to discuss. The server relocation has been planned “for months”, and the lack of notice was simply at the discretion of NetRegistry. Boiling down the conversation – I was basically told the server relocation had to happen, and NetRegistry were not prepared to work with us to ensure up-time for our clients.
When the relocation occurred, instead of contacting us when they were unable to shut down one of our servers – a decision was made by technical manager Jonathan Gleeson to simply pull power to the server. It’s at this point in time that things started to go very pear shaped.
Because of this (very) ungraceful shutdown of the server, we have had to spend hours rescuing it this morning. NetRegistry have our emergency contact details – but not even an attempt was made to contact us. Not even an email to advise us that the server had to be shutdown in such a fashion.
Obviously once the server was revived, it wouldn’t restart properly – causing our first issue.
Once we were revived, we find out that Faktortel are not delivering calls via our DIDs to our gateway. We trace the call traffic, and see no attempt to deliver the call. A ticket was lodged with Faktortel – and our infrastructure was immediately blamed. Coincidentally, shortly after our ticket is acknowledged – the calls resumed. In the interest of full disclosure – Faktortel insist that they did nothing, and when they checked our DIDs there was no problem. I only have my experience and time line to work from.
Now our infrastructure is all back online, at around lunchtime today – a few outbound calls failed. As soon as we were aware of the fault, I contacted NetSIP to find the issue had been detected and already resolved. I was apologised to, I was happy – and that was the end of it for me.
It wasn’t the end of it for NetSIP however. Less then 10 minutes after my call, I get an email from NetSIP to again apologise for the outage and explain in detail what caused the issue. And also – that we have access to a backup gateway that we should have been using, which would have negated the effect of the outage in the first place.
To briefly jump back to NetRegistry – after several phone calls, and emails today – I was called “difficult” by their staff (internal email that I was accidentally CC’d on). And another staff member was wished “good luck” in having to deal with me. It took a lot of communicating with NetRegistry just to get a decent explanation about what precipitated the urgent relocation, I still have no explanation about why our server was just pulled and no contact made with us.
To review..
NetRegistry caused what I consider a major outage for us, were obstinate and rude – and still have not explained their actions. I have however had an email conversation with Larry Bloch (CEO NetRegistry) which has calmed me down a little – but the whole experience is exceptionally disappointing.
Faktortel immediately blamed our equipment, and then somehow our DIDs started working without us doing anything. Their management has since followed up and suggested a different way of connecting that may resolve some of the issues we experienced today, but this is after the fact, and after a lot of my pointing out issues.
NetSIP have left everybody for dead. Not only have we had fault free service from them since they became our primary carrier, the first fault we experience is dealt with before we even detect it. A decent explanation is given for the fault, along with what actions have been and are being taken to avoid it in the future. And an immediate solution put in place to ensure the risk of any future outage is mitigated.
Perhaps I just expect too much of companies. Or maybe providers like NetSIP “spoil” me, by providing such excellence in customer care that other providers just aren’t equipped to live up to.
It leaves me wondering, when there are companies providing such good care – why are the other companies still around?