T-SQL Tuesday #128: Learn From Others

Introduction

T-SQL TuesdayThis month’s T-SQL Tuesday is being hosted by Kerry Tyler. The subject is a recent mistake that was made and what was required to fix it. These don’t have to be SQL Server mistakes. In my case I’m going to talk about I worked at a company where the infrastructure on the backend of the database was not properly set up and our transaction logs became corrupt.

The Problem

One fateful night while I was not on call, I got a call around 3:30 AM.  The computer operator in the server said a particular system was done at the company could I take a look at it.  I hopped in sleepy state of mind and discovered that all six of the transaction logs where corrupt for the databases on the server.  But I was never dear we are log shipping every 15 minutes so worse case we fail over and we lose at the most 15 minutes of data.  This company was a bit old school, you had to have manager approval to do such an action.  I got my manager on the phone and explain the situation, that both servers the transaction logs were corrupt.

The Fix

This was the company’s most important system, without it they didn’t ship or make product, so my manager was very keen to have a fix for the problem.  Lucky for him we had a server sitting around that the system administrators never took offline that was just sitting like begging to be used or shutdown with the right version of SQL Server already installed.  So I restored all the databases to this new server and scripted out the logins to the new server and msdb jobs.  We were up and running in about 2 hours after I got approval to perform the restores to the new server.

Conclusions

Hopefully, by now in my story you are asking well how to both servers end up with all six transaction logs become corrupt.  Well come to find out the two servers in question shared drives between themselves and the system administrators configured it so they were on the same drive.  You would expect this in a cluster but we are log shipping here.  And what about RAID, well evidently an alert was missed when something critical happened to one drive by the system administrators.  Then just the they were after all in the same server room still to this day baffles me for DR reasons because we had two datacenters.  So never share even RAIDed drives between two servers, when corruption hits it will take the whole configuration and cause an outage.

Recommendations

Plan for real redundancy, and make your system administrators when you start a new job tell you how the hardware is configured.  At this particular job, the system administrators didn’t share information even when we got new servers in and made my life h****.  Everything from the network, to the server hardware, to the database creates the system and working as a team is the only way to make sure things are configured to perform and not fail. So if a team isn’t playing the team sport talk to your manager about doing something so we can all make sure our environments are running at their peak and we don’t have an outage.

 

Related Posts

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.