February 12, 2010

Accidental DBA

There seems to be a lot of small companies out there who have appointed some unwilling developer as their DBA. Being a DBA comes with a lot of responsibility. For the most part, being a DBA means that you are in charge of making sure that the company's data is secure, backed up and accessible 24x7.

Accidental DBA doesn't mean "wait for things to break, then do your best to fix them." Save your sanity and do some proactive work now to keep things flowing later. You won't be sorry, and you'll be a hero to your company and/or clients.

- Check maintenance plans, make sure they're running as you expect. Check them again next month. And the month after. This is something you want to know is working.

- Restore your database backup to another server (or another database). Test it, make sure it works, make sure it has all of the information you expect it to.

- Check on index fragmentation for about 25% of your tables. If a table isn't showing issues, set it aside and check it again in 6 months. If it is in need of defragmentation, correct it, then check it again in 2 weeks.

- Check your free disk space on all drives. Do it monthly, log the results, watch the trend, plan for when you'll need more disk space. A simple Excel worksheet will do for keeping track, don't make it more elaborate than it needs to be.

Once you aquire DBA skiils, concentrate on following things.

Know your SLA - SLA stands for service level agreement. In most companies the SLA is determined before a project goes live. The SLA describes the minimum performance criteria that a provider promises to meet while delivering a service. It also outlines any remedial action and any penalties that will take effect if performance falls below the promised standard. In a nutshell this means "how long does the server have to be down before someone loses their job?". In general, all decisions regarding backups, database maintenance and etcetera should revolve around the criteria set forth in the SLA. That is why it is first in this list.

Develop a Disaster Recovery Strategy - If the SLA says that you can only be down for a maximum of an hour in the event of an emergency then you better make sure that you have a restore strategy to fulfill the requirements. There are many high availability technologies to chose from but each one has its limitations.
Clustering - A database cluster consists of two or more physical machines (nodes) which utilize a shared disk on a SAN in order to keep the data files accessible at all times. If one of the nodes fail then the drives hosting the SQL data will "failover" to the other node and the SQL services will start up on the secondary node. When the services restart on the secondary node, SQL Server will go through the auto-recovery process and all transactions that were active before the failover will be rolled back. The drawbacks to clustering are that you have a single point of failure at the SAN. Therefore in order to be safe you will probably need SAN level replication. For small companies a SAN may be too expensive to implement and having a secondary SAN for replication may be completely out of the question.

Mirroring - There are two different configurations available for mirroring. The first is high-safety mode which uses a two-phase commit. This means that the transaction has to safely be applied on both instances before a transaction is considered complete. The second option is high-performance mode which uses an asynchronous style of transactions. In high-performance mode a transaction is considered complete as soon as it is finished on the principal server. The records will eventually make it to the mirror server but there could be a lag. Mirroring is a good option if you do not have a SAN. In addition, many people use the secondary "mirror server" for reporting purposes. While transactions are being applied to the mirror server the database is is inaccessible, but you can create a database snapshot of the mirrored copy which will be available at all times. Reporting on the snapshot will take some strain off your principal server but you will have to make sure your end users do not require "real-time" data.

Log Shipping - Log shipping is similar to mirroring in the fact that transactions are replayed on another server. However, Log shipping only gives you the warm standby option whereas mirroring can do both hot and warm standby depending on the configuration. The one drawback to mirroring is that it allows a single destination database where log shipping can support multiple.
Replication - Replication is yet another technology for replaying transactions on a separate server. There are three different kinds of replication which include merge, transactional and snapshot. While replicating data the databases can be accessed on the secondary server for reporting in "real-time". The drawback to replication is that it is not going to provide you with a automatic failover. If the primary server goes down you will need to manually redirect your client applications to the secondary server. For the most part, replication is really not intended to be a disaster recovery tool. It was really meant as a way to distribute data between different servers for reporting and accessibility purposes. You can replicate an entire database, a table or only certain rows from a table. If I was planning a DR strategy I would probably stick with Mirroring or Clustering.

Pick a Recovery Model - Do you need point in time recovery up to the second? If so then you should probably use the full recovery model. Bulk-logged offers a little better performance than Full but it does come with the risk of data loss. Under the bulk logged recovery model, a damaged data file can result in your end users having to manually redo some transactions. The simple recovery model is the best for performance but it only will allow you to recover to the most recent full or differential backup. Simple recovery is ideal for development and test servers but hardly ever used for production databases.

Backup - Backup your databases to a separate location. At my company we keep a few days worth of "online" backups on the NAS (do not put backups on the same machine that you are backing up!) and then after a few days they get written to tape. Make sure that you test your backups on a regular basis. Keep track of how long the restore process works and also make sure that if you need point in time recovery that you are familiar with how to apply transaction logs. When an emergency occurs you want to complete confidence in your backups and your restore process.

2 comments:

Bill said...

Thank you for reviewing several business continuity process options.

Have you checked out the High Availability software from Stratus-Avance?

SAN-less, proactive, server downtime prevention; decreasing service interruptions to approximately an hour a year.

30 day Free Trial:
www.stratus.com

Paul said...

You make some very good points about proactive monitoring as well as the various options available for DR, which unfortunately most people don’t think about until it is too late. Hopefully this article will save people the time and headaches of having to restore from backup should they encounter a disaster.
We have found that virtualization really has helped us out in regards to DR and it is very cost effective. At our main office we are lucky enough to have SAN replication in place but as you mentioned this was not a cost effective solution in some of our remote offices. Depending on the criticality of the DB we have found the free version of Vmware’s VMServer or ESXi running on a Dell box if it’s not as critical or VMServer running on a Stratus ftServer if we need high availability to work very well.
One other technology that we have found useful was running Stratus Avance on a couple of Dell boxes utilizing split-site technology which gives us a nice local DR solution by allowing us to run a box in 1 lab and another in another local site while they are running in parallel. This way we are protected if we should lose 1 box and don’t have to worry about clustering or mirroring our DB’s.