My response to Site5

So, I did receive a response from Site5, but first I’ll post the text of my email to them:

#####

Hi Brendan,

I appreciate you taking the time to look into this. I do want to address a few items:

– Essentially, the on-duty systems administrator (David K.) reported that he needed to reboot the server due to restart services after making some tweaks to lower the server’s load. When the server was rebooted there was a very odd error where the IPs didn’t bind to the server. He noticed this almost immediately and went to work repairing it. He restarted ipaliases, which is how you would normally fix this issue, however only some IPs came online and it showed no errors whatsoever. Therefore, David believed the problem was resolved because there were no errors and most of the IP addresses did bind to the server without a problem.

This strikes me as a great learning opportunity. My experience has shown that such automatic scripts can not always be trusted to work in a failsafe manner, and that they should each have at least one additional, separate layer of error checking. For this reason, my recommendation would be to include a standard suite of tests with system administration tasks such as this. For example, after running the ipaliases startup script, I would run a test of all of the bound IPs on the machine, preferably from another machine. This could be done via another script that could iterate through each address and attempt to establish a TCP/IP connection to a given port on the remote host. While it is true that this second automated script could also fail to detect an error, it would certainly be a lot more effective than running no error checks at all.

3) Was this issue detected by Site5’s monitoring systems?

– Yes, David K. detected and reported this issue almost immediately.

It sounds like *one* issue was detected by Site5’s monitoring systems, namely “a very odd error where the IPs didn’t bind to the server”. However, the monitoring systems did fail to detect the second error, which was that not all of the IPs were properly bound when the system came back online.

If Site5 had a monitoring system that periodically attempted to establish a TCP/IP connection to each IP address, say, every 15 minutes, then this second error would have been detected. However, the fact is that the IP address was out of commission for a period of 12 hours. This points to two possible issues: One, Site5’s TCP/IP connectivity monitoring was not working properly and failed to notice that it could not connect to the prwdot.org IP address. Two, Site5 does not have a TCP/IP connectivity monitoring solution in place.

If One is the case, then some additional development and testing need to be put into the monitoring system to ensure that it is able to detect a loss of TCP/IP connectivity. If Two is the case, then Site5 should certainly invest in a good TCP/IP monitoring solution. Ideally this would run on a set of high-availability monitoring servers, separate from the production hosting servers.

I would be interested to know what Site5’s position is as far as monitoring TCP/IP connectivity, and why it was not able to detect this extended loss of connectivity.

6) What will you be doing in the future to ensure that this type of issue does not come up again?

– This should never have happened in the first place. If the server does need to be rebooted again, the server had a comment on it from our staff explaining what happened the first times so that this specific problem can be avoided in the future. Whoever reboots the server will be responsible for making sure that the IPs properly binds to the server.

I am glad to hear this. It is definitely a matter of responsibility to double- and triple-check that a given tool has worked in the way it was intended.

7) What will you be doing in the future to ensure that this type of issue is properly monitored and responded to in a timely manner?

– Unfortunately, due to the nature of this incident, even though the problem was properly monitored and we responded very quickly, some clients experienced a prolonged outage. Besides noting what happened the first time and making sure that the staff team knows what to do if it happens again, there is not much more that we can do. On the other hand, we have recently hired two additional support staff members to help cover our late night and early morning weekend shifts which are usually stretched pretty thin. This means that all tickets and problems that happen over the weekend will be responded to faster and more thoroughly which will help prevent delayed responses to support tickets and will allow us to quickly resolve more and more issues. We are working very hard on improving Site5’s overall level of customer service and this is a huge (necessary) step in the right direction.

I would recommend having folks review my above questions and recommendations in order to implement a better monitoring solution at Site5.

Thanks very much for your help.

Peter R. Wood
prwdot.org administrator

#####

See the next post for their response to this email.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: