3DOWN/Email Outage (7/23/2009)

Detailed Information for Email Outage of July 23rd, 2009


Early Thursday morning (July 23, 2009) The Storage Area Network (SAN) that provides mail storage for most of our Post Office servers went off-line.

SAN devices are in theory supposed to be internally redundant and fault tolerant. However they are also complex devices... and complex devices can fail in complex ways! That said, staff arrived within an hour of the initial failure and began the process of getting the SAN back up and in operation. The vendor provides a documented trouble shooting procedure, and this was followed.

After several parts were replaced, most of the SAN was available and we were able to restore service to po9, po12 and po14. However the SAN volumes which store the mailboxes for some users on po10 and po11 would not go online. The SAN indicated that the data on these volumes was corrupted.

At this point we have started on two different approaches for restoring service to po10 and po11. The first approach involved working with the SAN vendor to get the original volumes on-line. Although the SAN considers them corrupt, it is most likely that the data is fine and that the SAN lost track of some critical meta-data when the original electrical failure occurred. The vendor indicated that they had seen this problem before and that the data was salvageable.

The second approach involved attaching fresh data storage (from a different SAN) to po10 and po11 and restoring the mailboxes to it from our nightly backups.

As of this writing [4:00PM on 7/23/2009] we are proceeding on both approaches hoping that the first approach will result in an operational SAN with no mail lost.

Update: 10:40PM

The first approach paid off and we had all mailboxes back in service around 8:15pm.

Slightly Technical Details (for those interested)

So we were in fact correct, the data was still present, but the storage array would not recognize it as valid. Fortunately there was a set of maintenance commands that would permit us to in effect re-build the array (i.e., set it up as if we were setting up a new empty array!) but not really over-write our data. The vendor provided us with an 18 step procedure. It came complete with the admonition to get it exactly right or the data we had would be lost [but we still have our backups]. Needless to say, this was a bit of a nail biting procedure to follow.

So the way you handle a situation like this involves using two people. One types in the commands and the other verifies that they are in fact correct (compared to the printed procedure). One command at a time.

This almost worked, except that the procedure was not exactly right and the resulting rebuilt arrays were not correct! Fortunately we were able to jointly with the vendor figure out the problem and re-write the procedure to compensate. Unfortunately we needed to re-do the entire procedure... more nail biting etc.

This got three of the four failed arrays back on-line. This permitted us to completely restore service to people on po10. However half of the people on po10 (not half of everyone on po10, but half of the effected people on po10) were still not back. The disk drives for that fourth array had failed to come back on-line after the electrical failure.

After more consultation with the SAN vendor, we power cycled the array cabinet (which unfortunately required us to shutdown the 3 working recovered arrays). This required about 15 minutes, but did bring back the off-line drives. After this we were able to rebuild the fourth array and all service was restored

Back To Top

Related Links