T-SQL Tuesday

T-SQL Tuesday

This month’s T-SQL Tuesday is brought to us by Jason Brimhall.  We are writing about times someone took chances with the databases that the DBAs felt were not worth it.  I am going to talk about taking chances with your backups, our #1 job as DBAs is to have backups.

Let’s start with the scenario.  All full backups of over 50 servers are landing to a central location on the SAN.  Before this move, they were all landing locally to the server until enough eyebrows were raised to get a central location.  So, in theory, we had made progress, at least they weren’t local.  They are spaced out, so they don’t all hit at the same time. Our setup is taking a full backup every day and deleting any backups over 23 hours.  So this would leave us one backup on disk.  Due to the double space required to store to backups, our director declared we had to the delete yesterday’s backup before taking a new backup because in theory we back the drive up to tape and can have it brought back in from our offsite storage location.  My coworker and I protested to our manager that this was a really bad idea.  He talked to his boss, the director; the director would not move from the stance to save 100 GB of SAN space.  This dialogue went back and forth for a few months before our boss said: “just do what the director wants.”  You edit 50 jobs to loop through and delete any *.bak files in the backup folders.

Fast forward a year and at 3:00 AM the computer room calls because the backups for a system are failing.  You log in and discover that the msdb database (the database that would have alerted you to corruption is corrupted, running over 100 jobs, and handling transactional replication to a reporting server).  You go the network share and sure enough no backups.  Your immediate boss is out of town, so you have to call the director to get someone to go to the offsite facility and pull a tape.  Thirty minutes later your director calls you back and asks you how this could have happened and if the database “was really corrupted”.  You arrive at work 8:00 AM and the SAN administrators responsible for the restore of files is just getting starting on recovering the file you need.  Around 11:30 AM, the three-quarters of GB file is finally ready for you to do a restore.  The restore takes 17.59 seconds.  Luckily, transactional replication is not broke and the all the jobs critical for the business were still running.

The director comes over to your desk and asks you what could have done to prevent this.  You tell them the same thing you been saying for a year don’t delete your on-disk backup until you have another backup stored on disk.  At this point, the director says we will discuss this more when your boss gets back because they do not want to dictate how the department should run their processes.

This was a big gamble for the company that could have been much worse, say if a more critical database was corrupt or the part of msdb that ran the business jobs was corrupt.  This is a bet that if taken in Vegas you might just lose your shirt and not a bet I would be willing to take.  It most shops this would have probably been an RGE for the DBA.  Moral of the story always have a backup, test your backups, and don’t cave to upper management on something so critical to the survival of the business.

Related Posts

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.