net-at-hand™

The official Net-at-hand blog
all about what we are doing to make Net-at-hand a great web publishing system.

A report about the downtime

Published on 03/11/11.

Net-at-hand and all the sites that are running on it were down for two hours last night from 8:25 pm (CST) until about 10:15 pm. I am really sorry for the downtime. There is no excuse for being unprepared and taking so long to solve this problem. I know many of you rely on your websites for your livelihood and more than that, use your websites as a tool for greater good. I take the responsibility of keeping your websites running smoothly very seriously and view it as a big part of my calling in this life. I let you down and I could never communicate my regret adequately.

Apologies are good and necessary, but I am sure you want to know what is being done to fix it. Keep reading for an explanation of what happened and what I am doing to keep it from happening again.

One of the websites on Net-at-hand was linked to from http://fark.com which has enough visitors on it to put it in the top 100 English-speaking websites in the world. Net-at-hand was overwhelmed by all the traffic trying to view that page and was unable to keep up. I knew this type of traffic scenario would happen at some point, and I tried to prepare against it; obviously, I failed to do so adequately.

So I am putting some changes into effect to minimize or eliminate the dissruption. They are:

emergency plan update—There are specific strategies that I have in place to deal with situations that might come up and cause disruptions. The strategy used last night didn’t get things turned around quickly enough, so I had to come up with something new on the fly. I learned much dealing with this last night so I am changing my plan to get everything back up very quickly if this happens again.
updating the blog comments plugin—the blog comments plugin is used extensively on the website that was getting all the traffic and it’s performance is turning out to be very sub-par on a website with this kind of traffic. I will be making some changes to it to make it more efficient and less likely to be the cause of bogging the system down.
improvements for page-not-found—the page that received all the traffic referenced a website image that did not exist and this caused Net-at-hand to keep trying to look for the resource. This alone more than tripled the load on Net-at-hand and played a big part in taking it down. I’ll be working on a fix for this.
improved plugin performance—I am in the middle of updating the Net-at-hand plugin system to make it more efficient. When these updates get done it should help speed things up overall.
accelerated time-table for server upgrade—I have been working on a plan to upgrade the server that runs Net-at-hand, and will work to get that in place as soon as possible. This upgrade will be moving the server to a completely different provider which will give me even more flexibility in how the Net-at-hand infrastructure is built.

So that is what I will be working on over the days and weeks ahead. Again, I sincerely apologize for the downtime. Let me know if there is anything I can do for you.

If you are not a techie (nerd, geek, etc.) you can stop here. If you are (or want to try to be) and you want more information then keep reading.

Further explanation of the situation for techies

The front end web server for Net-at-hand is nginx a world-class, high-performance web server that is used to serve many high-traffic websites. Nginx serves all static content (cached stylesheets and images used in the stylesheets).

When I found the web page that was getting all the traffic, I put a copy of the cached page in a directory that nginx could get to so it could bypass the application servers and start serving this page directly. After doing so I found there was another “page” that was trying to be served through the application servers and was causing more slow-downs (this was the “page-not-found” item referenced earlier). I did the same thing for this and everything returned to normal.

The really great news in all this is that once the work-around was in place, everything returned to normal. The high-traffic page was still available for all the people who were coming to it to see, and every other site and webpage was available as normal.

This, to me, reinforces the choice of nginx as the front-end web server to use, and with the changes made above to make the application servers perform better we will do better the next time there is a sudden spike in web traffic.

As a side note, the single resource that did not exist and was trying to be served through the application servers was probably the biggest culprit in causing the downtime last night. The caching system for web pages on Net-at-hand works well and can serve many concurrent requests (I don’t have the numbers from our last stress test here in front of me). The page not found page though isn’t cached and so these pages take 6-10 times longer to serve than a cached page does.

Let me also say, that we have done some pretty major stress testing of Net-at-hand and have been satisfied with the results. However, in spite of this, there is no generated stress test that is as good as actual traffic. Net-at-hand didn’t pass the real-world test that was placed on it last night, but we didn’t exactly fail either. It is not uncommon for websites that undergo sudden, temporary spikes in traffic like this to be completely gone until the traffic is gone. I was able to put something in place that served the page, with full images, while it was still under the heavy traffic load. With what I learned form last night, I will be able to implement a similar solution in a matter of minutes (instead of two hours).