I guess I should not call this “one of the worst crashes”. I have experienced many in the pass. This is normal, when working in an Information Technology department, especially working with exchange server.
Who was it that said that an Exchange Server administrator knows everything about exchange server? Well, I for sure do not, or did I ever confessed to have known everything. Thanks to Microsoft, who always assist me when I am unable to resolve a situation as quickly as I should have. Many situation can normally be resolved, but when time is against you, researching the error thoroughly is not sometimes possible.
My Server Basic Configuration
Let me explain a little bit about my configuration first, so you understand what I am trying to explain regarding our system failure. I have four servers in my environment. Server A and B are being used in my DAG configuration. Are you familiar with Dags? I will briefly explain.
DAG stands for Database Availability Groups. It is a build in component within Exchange Server created especially for redundancy. Having a DAG setup in place is one of the best ways to minimize downtime and protect your system, in the event of a failure.
Most persons using a DAG configuration will say that backups are not required, as long as you have redundancy in the form of a DAG. What if corruption of you data had replicated to all members of the DAG. The only way to recover would have been from a backup.
Server C is used as an archive server while Server D is just there in the event I need to test an installation. Each server is setup with two controller cards so we can separate the log files from the databases.
Drive “C” on each server has the operating system with a raid one configuration. Drive “D” is a partition on the same disk. Drive “E” is where the databases resides in raid five (5) configuration. This is where the failure originally took place. Server “A” was the server that experience the problem.
The Issue / Problem
Drive “E” on Server “A” where the databases resided were accessible, but we were unable to see any data. I immediately went through a series of test including walking physically to the server room. On arrival I noticed that only one drive on partition “E” where the database resided on was the only one showing a green light. The remaining four drives on Drive “E” was showing no light at all.”
As I have mentioned earlier drive “E” was on its own controller card with a raid 5 configuration to protect in the event of a failed drive. Raid 5 requires that three or more physical drives are required to create the configuration. My server had four drives in place. As I said earlier only one drive in the configuration showed a green light, all others were dead.
On further investigations of the failed server, our system logs continually showed error messages in the form of event id 51 and 57. These events are normally related to disk failures. These errors were later confirmed by Microsoft as being related to hardware disk failures. Microsoft does not deal with physical system issues, so I had to open a case to HP, who created the server hardware.
HP later advised that one of the controller card had failed that the four drives were attached to. We ran a series of tests along with HP but could not find any physical errors with the controller card.
At first I suggested that we boot the server, but I hesitated because I wanted HP to physical check first. Eventually we booted the server and that did resolved the error. On boot of the server there was an error message acknowledging that the controller card had locked. This could have also been as a result of the server running out of resources. The server had about 12 GB of Ram. I eventually doubled the amount of Ram. The server is presently up and running.
I eventually rebuild the DAG on Server A, but did not fail-over the databases because I wanted to ensure that the server was okay. My organization will continue running on Server B until I am satisfied that all has been resolved.
This day October 20, 2012, and a Friday eventing at that, will go down in history of one of my not to good days, as an Exchange Server Administrator.
The comforting part about this day is that everything happened for a reason. If you have no problems with Exchange Server, then your will never increase your knowledge base as an Administrator. Of course as Administrators, we should never welcome problems.
Below is a final letter that I had received from Microsoft acknowledging my issues. You may notice several links to documentation from Microsoft in relation to backups. this is because I asked the technician to send me more information.
It was my pleasure to work with you on case# 1121———-. As per our discussion, I will be archiving your case today as resolved . If you have any comments or questions regarding the handling of your case, please feel free to contact my manager Shivraj Chopra at 425-000-0000 Ext- 64228 or Email at:
Also, please remember that if you have any additional problems that are directly associated with your original issue, you may call back and have this case reopened at any time within the next 90 days.
Andrew, have a great day, and thank you for your continuing support of Microsoft products!. I am providing you with a summary of the key points of the case for your records.
Passive database copies failed and change state to FailedAndSuspended on DAG member “Server A ”
>Found all databases on passive node were FailedAndSuspended except one database Exec_VP\Server A.
>Identified that healthy database was store on different drive g:\ and all effected was on E:\
>Checked event found disk related events 51&57.
>Identified issue is with hardware.
>Fixing the hardware issue resolved the issue .
End Microsoft Letter.