Thursday, September 3, 2009

Why Gmail went down

"We had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers -- servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system "stop sending us traffic, we're too slow!".

Wonder why the Google Network Operation Center (NOC) did not catch on that the routers were becoming overloaded. Could be the performance threshold levels were not set properly, or somehow the alerts did not get reported or were just disregarded (someone on the west coast was on coffee break, and ignored their PDA alert?). It could be said that more proactive monitoring is in order. I noticed they didn't mention "who's" routers were at the center of the issue...? I have a feeling Google will be working with their vendor on this. Wonder if my friends from the East Coast Interoperability lab at UNH are in deep discussion over this one.

No comments:

Post a Comment