The Exchange Story
Nov. 5, 2007 by ravishan
We were all very disappointed that the Exchange server had a major issue last Tuesday and the E-mail was down for those on this server for the entire work day. We have tried to reconstruct the events in an attempt to understand what went wrong and how we can avoid something like this in the future.
After considerable discussion about the appropriate time for transferring from the old system to the new system, the October break was chosen. Mainly because it gave us the Monday and Tuesday of the week of Oct 16, when the students were not here, to roll this out and correct any problems. As we were planning the rollout, several last minute questions came up and we were successful in addressing almost all of them. We collectively determined that we had everything ready to go – including the testing of backup and restore of the Exchange database, individual mailbox retrieval and such important backend tasks which, we were all new to.
When we began operating the new system in production, with the exception of initial hiccup, which required our users to reconnect to the server, things started out well. Given how complex a conversion this was, we were happy with the outcome. However, a couple of days later, we had a problem with the Exchange information store. This is the database where all information is maintained.
We reported this to Microsoft and set the system to restart automatically if this were to happen in the future. This was mainly to avoid major interruptions to the clients. This problem continued and we needed to get a process dump when the system was experiencing the problem – this required a half hour or so of downtime. Since the problem was unpredictable, we decided to turn on the process dump after hours because we could afford the half hour down time. It happened at 3 AM or so one day and James had it set up to send a message to his phone when this happened. So, he made sure that the process dump was captured and we sent it to Microsoft.
In the meantime, we had a second machine which was prepared and ready to be deployed. On Sunday, Oct 25, we had a crash, but the system refused to come up. We deployed the second computer, thinking that there was something wrong with the first machine itself. During this time, we have been in touch with Microsoft. They were stumped because there was no consistency to the crash – it was reporting problems with different DLLs.
Then the problem occured on Tuesday morning and the second machine also was unable to recover. Microsoft deployed all their resources to look at the process dumps and we were receiving mixed signals on what could be the problem. When two separate machines were causing instability, one usually suspects the backend database. But all errors were reporting problems with the DLLs on the servers.
We decided at that point to deploy a new machine starting from scratch. So, we began the installation of operating system, patches, exchange server software etc. and removing all unwanted software and processes. Believe it or not, this took us over 6 hours to accomplish! While this was happening, Microsoft determined that based on the data they had, the only plausible explanation to this problem was defective memory.
Their explanation was based on the fact that the dump showed that the system was accessing portions of memory that the exchange server should not be accessing. This could be explained ONLY by defective memory. When this happens, corruption of files on the dis becomes rampant. The technical explanation of this made sense, though we would all question the wisdom of an operating system that does not protect operating system files from being corrupted this way!
Then we went back and analyzed what we did. Before we went into production, we increased memory needed to go into production on machine 1 as well as machine 2. Both of these were memory from Dell but belonging to the same batch. We immediately contacted Dell and ran the diagnostics. Predictably, the diagnostics showed no errors and Dell feels there are no problems with the memory. We plan to press the issue with them.
Despite the fact that we prepared two machines and several other redundancies, this caught us because we may have used the defective memory from the same batch on both the machines. Next time around, we should avoid using the memory from the same lot…
In the meantime, we have prepared two other machines with original factory installed memory and brand new installations of operating systems as backups.
We have also purchased additional hardware (at an unbelievable price) where we will move the whole exchange operation later and set them up with a lot more redundancy than we currently have.
