Today I receive a call from a user. She was cleaning her mailbox to get under the 2 GB quota we are about to enforce on all the users. She says that supposedly Entourage (Exchange mail client for Macs) just started deleting all of her e-mail in her Inbox. It did this from today all the way back to 9/15/10 before she closed the program.
Personally, I don't believe that these e-mails deleted themselves, but that is besides the point. Because we can't find them in the recoverable items section, it's looking like we are going to have to restore her mailbox from backup.
Now, our IT organization is very wary about doing Exchange restores because it's like pulling teeth to get the upper management to give us approval for even an hour of e-mail downtime a quarter. The CEO in particular is ridiculous with his BlackBerry. In general, there is a culture at my employer where e-mail is like ancient Chinese pottery or something. Don't mess with it! So, we are always afraid that something bad is going to happen on a restore and pretty much refuse to do them. But in this case the user is missing months worth of recent e-mail so it looks like we should do it.
Well, there are a few problems. First of all, we do have GRT/brick-level backups enabled (this lets you restore individual items in user mailboxes, an individual's entire mailbox, or entire mail databases. Without that feature enabled, you have to restore the whole database and then there is some finagling to do to get what you need out of it) and we do full backups on the weekend, but do differentials daily. You can only do brick-level restores if you do full backups with incrementals throughout the week. Well, we can do brick-level restores from the full backup, but we can't do simple restores from the middle of the week. It's something I've been meaning to change forever, but just never have.
In any case, I decide that we should just do a brick-level restore of her Inbox's missing items from last Friday. Trouble is, Symantec Backup Exec's GRT restore selection feature blows. I can basically view her mailbox by folder and see the e-mails by subject, but can't see the sender and can't see the date sent/received. There is a "modified date", but it didn't really seem to correspond at all to the actual sent/received date. Anyway, I was able to find the last e-mail still in her inbox from 9/15/10 by subject, and decide to highlight and select everything after that. Great, except that the Backup Exec interface hangs. I see that it's using 50% CPU on the backup server, so I let it go. 45 minutes later and it's still hung. I kill the Backup Exec GUI.
I have never done a GRT restore on a mailbox before, so I ask my manager if I were to just restore her whole mailbox if it would create duplicate items or create any issues. He wasn't sure and said we should avoid finding out and advised that I do an old-fashioned Exchange restore using the Recovery Storage Group.
So I set up a Recovery Storage Group on our Exchange mailbox server, then go ahead and select the mailbox database her mailbox is in to restore, redirect it to the Recovery Storage Group, and start the restore job. It loads the tape, processes approximately 130 kB of data, then fails out with the error "failure to query the Writer process" or something like that. Great. I try again and it fails before processing any data stating "unable to connect to the resource". Great.
So I start checking this error and find that I need to have certain components installed on both the Exchange server and the backup server and they need to be at the same version. We run Exchange 2007 SP1 update rollup 4. I find the Exchange 2007 update rollup 7 management tools on the backup server. Also, the MAPI Client and CDO package is installed on the Exchange server, but not on the backup server. I go to download the package but I cannot find the version that is installed on the Exchange server, only the newest version. Things are not going well.
I decide the best place to start is by getting the management tools on the backup server updated to the correct version. Since I'm not doing a GRT restore, I don't need MAPI anyway. I make the mistake of thinking that Exchange 2007 SP1 update rollup 7 is installed and think I need to downgrade to 2007 SP1 update rollup 4. Actually 2007 update rollup 7 was installed (note the missing "SP1"). If I had known this, I could have applied 2007 SP1 and then 2007 SP1 update rollup 4 without a reboot. But no, thinking I need to downgrade rather than upgrade, I uninstall 2007 update rollup 7 and it requires a reboot. So I reboot the server.
Whenever I reboot a server, I always start pinging it to make sure it goes down and comes back up. I usually do 300 pings and this is overkill. However, I notice the command finished unusually quickly and I get 0% packet loss. 0% packet loss is good when a server is running but when a server reboots there should be some packets lost. I remember that sometimes if you reboot Windows Server 2003 from a Remote Desktop session, sometimes it gets screwed up. Basically Terminal Services breaks and you can't reconnect with Remote Desktop, but other than that the server stays up and running and still servers files or web pages or whatever it does. I've always been able to open up a command prompt and run a remote shutdown command and that gets the server to restart. So I do that, but the server still doesn't restart. I try resending the command a few minutes later, but it just tells me a system shutdown is in progress. However, when I check out the server's System log remotely, I see no events stating any services have stopped. Hell, the very fact that I can still see the System log remotely is not a good sign. I'd love to go to the console and see what's going on, except I'm at my house, which is 40 minutes away. And we don't have iLO set up on this server.
Meanwhile, right before I rebooted the backup server, I decided to call Symantec for help on the restore problem. Last time I got tech support from them, I waited over an hour to speak to someone. Figured I'd get a head start. Wouldn't you know it, it only took five minutes this time. He can't even see the backup server because it's in that limbo state. But I ask him to check our Exchange server and make sure we have everything good on the Exchange server to allow for restores. I'm hoping the backup server restarts while we are doing that, but at worst I can always call back later after I get it working again. So they see that we are all setup okay on the Exchange side, but that the VSS writer for Exchange is in a failed state and this is why Backup Exec can't connect. He tells me they see this all the time and there is a Windows patch for it. He links me to the patch and I check and we do not have it installed. He also says we need to upgrade the MAPI Client and CDO package we have on the Exchange server to match the version we downloaded for the backup server. Of course it isn't a simple upgrade and requires the original copy to be uninstalled first, requiring a reboot.
Unfortunately, to install the Windows patch and to fix the VSS writer issue, we have to reboot the Exchange server. This thing is like the most important server in our organization and people freak out when it's not up because that means e-mail isn't working. It could be 3:30am on Christmas morning and someone would be upset that e-mail isn't working. We should really have it clustered, but don't.
So yeah that's my fun story, and as you can see if you've read this far, it isn't over yet. The only real bright spot of this situation so far is that the backup server did eventually restart. I guess it just hated that we kept it running for the last eight months or so.