Archive for March, 2006

Monday morning outage

March 27, 2006

This morning from approximately 9:30 a.m. to 11:00 a.m. EST, prwdot.org was inaccessible due to a server crash at Site5's facilities. The cause of the crash is unknown at this time; however, Site5 is looking into it.

Site5 Resolution

March 14, 2006

For now, I’m satisfied with Site5’s resolution. I believe that they will take my suggestions seriously, and hopefully in the future they will begin monitoring individual virtual host IP addresses. This should provide a much higher level of monitoring and error detection, which in turn should make Site5 a much better web host.

Also, I did receive a service credit for the outage. A whole $1.50. 🙂 Of course, seeing as I pay only $8.77 per month for the service, I think it’s fair.

Site5’s latest response

March 13, 2006

Here’s Site5’s response:

#####

Dear Peter,

You are very welcome. I thank you for your well thought out and detailed response and I would be more than happy to address your additional questions. First of all, I agree with nearly everything that you have said and all of your suggestions are very reasonable. With that being said, I would like to explain that (at this time) Site5 currently does not monitor individual customer’s IP addresses and instead minitors the server’s IPs via multiple scripts and third party solutions for maximum redundancy. I was not part of Site5’s team when this policy was developed, however, I assure you that I will personally speak with our COO (Todd Mitchell) directly about your suggestions for a viable IP monitoring system. I guarantee that steps will be taken to prevent this issue from happening again.

Unfortunately I cannot provide you with anymore details at this time, but your highly valued feedback has been received and will be reviewed by Site5’s management team for further consideration. Thank you again for your continued patience and understanding with Site5 while we sort through these issues for you continuously looking for ways to provide a higher level of service to our customers. If you have any additional questions, please feel free to simply reply to this email and I would be happy to discuss this situation further with you. Thank you and have a fantastic week!

– Show quoted text –

Best regards,

Brendan Diaz
Customer Service / Retention Lead
Site5 Internet Solutions, Inc.
http://www.Site5.com

“,0] ); D([“ce”]); //–>

– Hide quoted text –

Best regards,

Brendan Diaz
Customer Service / Retention Lead
Site5 Internet Solutions, Inc.
http://www.Site5.com

#####

It is troubling to me that Site5 does not monitor the IP addresses of individual virtual servers. Hopefully they will look into this issue and develop a better monitoring system that can take into account individual virtual server IP addresses, because it seems like a major oversight to me.

My response to Site5

March 13, 2006

So, I did receive a response from Site5, but first I’ll post the text of my email to them:

#####

Hi Brendan,

I appreciate you taking the time to look into this. I do want to address a few items:

– Essentially, the on-duty systems administrator (David K.) reported that he needed to reboot the server due to restart services after making some tweaks to lower the server’s load. When the server was rebooted there was a very odd error where the IPs didn’t bind to the server. He noticed this almost immediately and went to work repairing it. He restarted ipaliases, which is how you would normally fix this issue, however only some IPs came online and it showed no errors whatsoever. Therefore, David believed the problem was resolved because there were no errors and most of the IP addresses did bind to the server without a problem.

This strikes me as a great learning opportunity. My experience has shown that such automatic scripts can not always be trusted to work in a failsafe manner, and that they should each have at least one additional, separate layer of error checking. For this reason, my recommendation would be to include a standard suite of tests with system administration tasks such as this. For example, after running the ipaliases startup script, I would run a test of all of the bound IPs on the machine, preferably from another machine. This could be done via another script that could iterate through each address and attempt to establish a TCP/IP connection to a given port on the remote host. While it is true that this second automated script could also fail to detect an error, it would certainly be a lot more effective than running no error checks at all.

3) Was this issue detected by Site5’s monitoring systems?

– Yes, David K. detected and reported this issue almost immediately.

It sounds like *one* issue was detected by Site5’s monitoring systems, namely “a very odd error where the IPs didn’t bind to the server”. However, the monitoring systems did fail to detect the second error, which was that not all of the IPs were properly bound when the system came back online.

If Site5 had a monitoring system that periodically attempted to establish a TCP/IP connection to each IP address, say, every 15 minutes, then this second error would have been detected. However, the fact is that the IP address was out of commission for a period of 12 hours. This points to two possible issues: One, Site5’s TCP/IP connectivity monitoring was not working properly and failed to notice that it could not connect to the prwdot.org IP address. Two, Site5 does not have a TCP/IP connectivity monitoring solution in place.

If One is the case, then some additional development and testing need to be put into the monitoring system to ensure that it is able to detect a loss of TCP/IP connectivity. If Two is the case, then Site5 should certainly invest in a good TCP/IP monitoring solution. Ideally this would run on a set of high-availability monitoring servers, separate from the production hosting servers.

I would be interested to know what Site5’s position is as far as monitoring TCP/IP connectivity, and why it was not able to detect this extended loss of connectivity.

6) What will you be doing in the future to ensure that this type of issue does not come up again?

– This should never have happened in the first place. If the server does need to be rebooted again, the server had a comment on it from our staff explaining what happened the first times so that this specific problem can be avoided in the future. Whoever reboots the server will be responsible for making sure that the IPs properly binds to the server.

I am glad to hear this. It is definitely a matter of responsibility to double- and triple-check that a given tool has worked in the way it was intended.

7) What will you be doing in the future to ensure that this type of issue is properly monitored and responded to in a timely manner?

– Unfortunately, due to the nature of this incident, even though the problem was properly monitored and we responded very quickly, some clients experienced a prolonged outage. Besides noting what happened the first time and making sure that the staff team knows what to do if it happens again, there is not much more that we can do. On the other hand, we have recently hired two additional support staff members to help cover our late night and early morning weekend shifts which are usually stretched pretty thin. This means that all tickets and problems that happen over the weekend will be responded to faster and more thoroughly which will help prevent delayed responses to support tickets and will allow us to quickly resolve more and more issues. We are working very hard on improving Site5’s overall level of customer service and this is a huge (necessary) step in the right direction.

I would recommend having folks review my above questions and recommendations in order to implement a better monitoring solution at Site5.

Thanks very much for your help.

Peter R. Wood
prwdot.org administrator

#####

See the next post for their response to this email.

Response from Site5

March 13, 2006

I just received a response from Site5 regarding the recent unplanned outage. There are still a few points of contention I’d like to iron out with them, but for now I’ll post our most recent interaction:

#####

Dear Peter,

Thank you for taking the time to contact the Site5 management team with your concerns. I am happy to hear that the issue has already been resolved and I am extremely sorry for any inconveniences that this server’s recent downtime may have caused for you. As per your request I have reviewed your ticket and I will do my very best to thoroughly address all of your questions:

1) What was the cause of this issue? Specifically, what “problem” occurred with the IP bindings and why did it occur?

– Essentially, the on-duty systems administrator (David K.) reported that he needed to reboot the server due to restart services after making some tweaks to lower the server’s load. When the server was rebooted there was a very odd error where the IPs didn’t bind to the server. He noticed this almost immediately and went to work repairing it. He restarted ipaliases, which is how you would normally fix this issue, however only some IPs came online and it showed no errors whatsoever. Therefore, David believed the problem was resolved because there were no errors and most of the IP addresses did bind to the server without a problem.

2) Did this issue affect only prwdot.org, or were other customers’ domains affected?

– Well when the server was rebooted, everyone was down. But the reboot itself doesn’t take very long. When the server came back online and had the IP bind issue, I believe most clients were down for about a total of an hour.

3) Was this issue detected by Site5’s monitoring systems?

– Yes, David K. detected and reported this issue almost immediately.

4) If this issue was detected by Site5’s monitoring systems, why was it not addressed immediately?

– It certainly was, but it was an odd problem where the system showed no errors and only certain clients still experienced problems until they reported them.

5) If this issue was *not* detected by Site5’s monitoring systems, why was it not detected?

– Not applicable.

6) What will you be doing in the future to ensure that this type of issue does not come up again?

– This should never have happened in the first place. If the server does need to be rebooted again, the server had a comment on it from our staff explaining what happened the first times so that this specific problem can be avoided in the future. Whoever reboots the server will be responsible for making sure that the IPs properly binds to the server.

7) What will you be doing in the future to ensure that this type of issue is properly monitored and responded to in a timely manner?

– Unfortunately, due to the nature of this incident, even though the problem was properly monitored and we responded very quickly, some clients experienced a prolonged outage. Besides noting what happened the first time and making sure that the staff team knows what to do if it happens again, there is not much more that we can do. On the other hand, we have recently hired two additional support staff members to help cover our late night and early morning weekend shifts which are usually stretched pretty thin. This means that all tickets and problems that happen over the weekend will be responded to faster and more thoroughly which will help prevent delayed responses to support tickets and will allow us to quickly resolve more and more issues. We are working very hard on improving Site5’s overall level of customer service and this is a huge (necessary) step in the right direction.

Once again Peter I am extremely sorry for the downtime that this server problem has caused for you. I see that you already have a ticket open with the billing department and that is good – you should receive credit for this outage. If you have any additional questions or comments about this issue, or if you would simply like to discuss the situation further, please do let me know. Thank you kindly for your continued patience and understanding, I hope that you have a wonderful week!

Best regards,

Brendan Diaz
Customer Service / Retention Lead
Site5 Internet Solutions, Inc.

#####

I’ll keep you posted. Hopefully they will respond positively to some suggestions I’ve made in regards to their monitoring practices.

Network Outage

March 13, 2006

On Sunday, March 12 at approximately 8:45 p.m., the server that hosts prwdot.org at Site5 began having issues with its networking. This made the server unavailable from outside of the network, so prwdot.org users would not have been able to send or check email, view web pages, or transfer files. We have a backup mail server outside of the prwdot.org network which catches any email sent to the prwdot.org domain during an outage. Because of this, no emails to prwdot.org should have been lost. The outage ended at approximately 8:45 a.m. on Monday, March 13. I am currently investigating the cause of this outage, as well as the reasons why the outage was not detected earlier.

I apologize for any inconvenience this outage may have caused.