Recent Posts
Most Viewed Posts
- Exchange 2007 Availability Tool Kit
- Forrester Research on High Availability
- Q & A for the June 24th Webinar: SQL Availability - Protecting Your Database and Applications
- everRun and Exchange 2007 Mailbox Servers
- Q & A for the January 2009 Webinar - Customer Spotlight: How the Sullivan Group Got Reliable High Availability without Breaking the Bank
Thursday, June 24th, 2010 - 11:37 am EDT
Tech Tip: Common Ways to Tell You Are Not Prepared to Recover from a Disaster
Today's tip comes to us from author Eric Beehler via our friends at Realtime Publishers.
Disaster recovery is somewhat of a buzzword in the IT industry, and IT professionals have all been exposed to their share of great disaster recovery ideas from business managers. These ideas are often based on the industry buzz and seem to only make more work for you with little gain overall. This is usually because the idea is not backed up with a real plan. The actual implementation of disaster recovery is usually a big chore to undertake correctly, but in the end, it is well worth the trouble.
It's important to be ready to recover your data and systems when a disaster strikes, but it is rarely a top priority in the grand scheme of IT projects when crisis has yet to strike close to home. Unless your company has decided to make disaster recovery a high-level objective, it's usually the front-line administrator that will be saddled with the responsibility of implementing some sort of plan to save the day -- but you will likely be short changed on training and resources to get the job done.
There are many ways to deal with a disaster, from having a set of cold standby machines to employing a fully redundant hot data center. In reality, as the administrator, your job doesn't change much based on the scenario for recovery; it has to be up and available to keep your business running. You likely have some kind of plan now, but if you haven't been through the real thing, you really don't know if your plan will hold water. For Windows administrators, there are several problems that seem to expose themselves when it's time to exercise a disaster recovery plan, or worse yet, go through the real thing. Here are some common ways to tell that you are not ready for a disaster.
Plan for an Alternative Site
You are not ready for a disaster if you don't have a place to go, which requires planning for a full on-site disaster in which your site is down or inaccessible. There are several methods to address this issue if you don't have a solution today, from having an alternative site with servers waiting to be loaded up for operation to a warm site that is always ready and waiting to take traffic. These decisions are not usually made by you but by the CIO. All you can often do is consider the solution given to you and how that will impact your ability to recover. A cold site, for example, will allow you to have hardware and connectivity available, but you will need to account for operating systems (OSs), drivers, configuration differences, and data center differences. In a warm site, you have to ensure that changes to configurations and data remain synched across the two sites.
Plan for Downtime
You also have to consider whether the site solution will support the Recovery Time Objective (RTO) required by the applications and business. Simply put, the RTO is the amount of time your users will be without the functions supported by your server, which could be a Web site, a mailbox, or the ability to log on to the domain. You should have this time defined per application or function supported by your server. This, of course, in a bigger effort for disaster recovery, may be defined for you, but don't be surprised if the business people you support have no idea that your server supports the functionality they require. You may need to interject with your personal knowledge of how your server functions in order to get this definition correct.
There are generally accepted categories for RTO that fall into tiers, as Figure 1 shows. Use these as a guideline but feel free to create standards within your own organization to meet your needs. If you have a need to recover applications with 2, 4, and 8 hours, redefine the tiers so that they make sense to your business through an analysis of the business impact of downtime. Just be sure that you can apply the standards as broadly as possible across the organization.

Plan Your Tolerance to Data Loss
You are not ready for a disaster if you don't know your tolerance for data loss. Let's start with the basic foundation of the backup. Whether you use simple tape backups or an advanced nearline solution, you have to consider that most solutions are put in place to account for day-to-day operational needs. First, the exercise you went through with RTO must be done for the Recovery Point Objective (RPO), which is the amount of data that can be lost. You have to understand what the business can afford to lose; this value is not necessarily tied to an RTO tier. Take, for example, a point of sale system. If the system is down for 5 hours, the business may be able to recover by entering the orders taken while the system was down, but data loss of 5 hours may mean millions of dollars in lost sales.

The gut reaction for your RPO on some of your systems may be that no data loss is acceptable. In other cases, 24 hours of data loss may be acceptable. The goal is to understand what can be tolerated, not what is desired. Everyone will desire no data loss, but put a realistic perspective to the real value of the data. If you define Tier A RPO as no data loss, then you have to put systems in place that allow for that reliability. This means copying transactions as they happen to a backup site, which is an expensive solution that should be used only on your critical business applications, depending on your budget. If you have Tier B systems as defined in Figure 2, you will need some sort of solution that will be separate from your nightly backups, as you cannot count on having your last nightly tape backup at your recovery site.
Considering the Loss of a Backup
You are not ready for disaster if you rely on your daily backup for a recovery scenario. You may have in your head that you can rely on the last tape backup in the event of a disaster. Whether such is the case depends on a key question: can you get your restore process to work offsite? Don't be so quick to answer this one. If you take advantage of offsite storage either through a vendor or your own in-house process, it is an excellent step, but offsite storage doesn't necessarily guarantee you can restore at your disaster recovery site within the specified RPO and RTO.
Tape drive compatibility, backup software, delivery time, drivers, and OSs are all considerations that you must address prior to saying your solution is ready. This is especially true for a third-party backup site that will provide you with "like" hardware. That equipment will not be your equipment, and even if it is, expect aspects of the infrastructure to be different, such as IP address schemes, firmware (which can be a nightmare when working with SANs), and simple access to the hardware.
You also have the issue of archive requirements and the fact that you likely rely on these tapes for your day-to-day restores. If you perform restores for file recovery and other issues, you likely want to keep those tapes close by. If you ship them away for maximum protection, it's going to cost a pretty penny in order to request tapes from your offsite storage vendor.
You also have to consider how those tapes make it to the recovery site. If you make full backups only once a week and you only do offsite storage once a week, you might only get a restore from 2 weeks prior. Why? Because if you are lucky enough to get your tapes offsite a day or two after the full backup and you get the shipment to your disaster recovery site 4 to 8 hours after they are requested, you can almost bet that Murphy's Law will strike and you will get a bad tape somewhere in the set. Then you have to move back in the chain, and with most full backups run weekly, you might be taking you system back 2 weeks or more if Murphy continues to strike. Now, the RPO of your plan that you expected to meet with your existing backup plan is not being met.
Even if you do recover your servers with no issues, how long will it take to recover them all? Consider the queuing on the tape drives, with multiple servers waiting for those tapes to be loaded. It could take quite a long time before you even get a chance to try a restore to your server depending on the technology present at the recovery site. What can you do? Well, time to restore will be reduced if you can restore large chunks at one time. Consider putting systems with like RPO and RTO requirements in the same backup set.
Better yet, host them on a LUN or set of LUNs on your SAN or other logical storage method in your situation so that a restore can be done all at once. You might even consider booting from the SAN, which might save you from having to restore the local disk of many servers. If you have a blade server solution, this may even be baked into your infrastructure.
Using Disk-Based Backup
Let's also consider disk-based backup. This solution has become increasingly popular because of the low cost of hard disks and the ease of backup and restore. In addition, disks often take minutes to back up and restore what used to take hours. The software supported by these systems even has versioning, much more frequent backups, and nifty utilities that make life much easier on the administrator. This is usually all handled by complex backup management software such as Microsoft System Center Data Protection Manager. When using this kind of solution, consider employing these often-integrated features to support data replication of some sort, although vendors name these types of features differently.
You can even copy your live data to your recovery site using a SAN/NAS vendor's Failure Resistant Disk Solution (FRDS). You should, however, consider the fact that this kind of solution will be much more expensive than tapes because it will require duplicate equipment with data replication happening across a wide area network (WAN).
You should refer to your RTO and RPO tiers to determine whether certain servers and data sets could stand to be away from your disk replication and rely more on a tape solution. You should also consider your disaster site and understand whether it can support this kind of solution. You should treat your server restores as a form of triage. You need to know, based upon RTO and RPO, what you are going to recover first and what can wait.
Considering Configuration
If you can't identify the full configuration of your servers, you are not ready for a disaster. Realistically, can you keep track of 300 shares on a terabyte SAN served by a load-balanced Windows cluster server? Do you know which shares go to which directories on which LUNs? You have to document configurations. This is true whether you have a basic bare metal restore plan or a full redundant data center. The luxuries of a production environment won't be at your disposal. A normal production environment allows you the opportunity to compare configurations when something goes wrong and work through a problem. A disaster affords you no such luxury.
No matter how familiar you are with your systems, you need to have everything documented that can be changed. For any applications, you should have a guide for their installation in your environment. You should have the servers documented with everything from IP addresses and patches to database connections and configuration files. If you run IIS for Web applications, you should have that configuration documented as well. Some sort of context diagram is often useful to determine how your server interacts with other systems.
Utilize configuration management systems, such as SMS, to do some of the heavy lifting for you. Create reports and keep them up to date in an alternative location, either a paper copy offsite or an electronic one. Configuration problems seem to be a killer when recovering because changes sometimes get applied without strict control. What seems like a small change can kill you in a disaster when it hasn't been documented.
Documenting the infrastructure goes beyond your own servers, but is just as important when it's time to troubleshoot. You can bring your file server back and you can bring your application servers back, but if you don't have proper DNS or connectivity, no one will be connecting to those systems you've recovered. If you have dependencies on other systems, you need to identify them. Know what names should be in DNS, what IP addresses and subnet you are on, what systems you interact with such as database servers or other back-end services such as the DMZ or Internet access. When you tell a database administrator that your application is taking SQL errors, you should know what database server, database, port, connection type, and authentication type you are using. You should also know the user name and password being used, if there is one. Does the server break down into pieces? Does it have multiple applications or functions? Document those functions separately.
You can't think of server as a single system if your customers don't see it as a single function. Remember that restoring an infrastructure is many pieces to a whole, and you should not expect any of those pieces to work correctly as you can in a production environment. In fact, when you face an issue in production, it usually has a single root cause, but a disaster recovery will usually experience several major issues at the same time. You need to know where you stand in the ecosystem of your environment to understand how to identify and help fix those issues.
Identifying Single Points of Failure
If you have a single point of failure, you are not ready for a disaster. A single point of failure can ruin your nicely laid out plans. Although not a requirement for a disaster recovery, the ‘N + 1' definition used when considering disaster recovery is many components backed up by a single component. You can still run into problems using N + 1, especially at a cold site where you have not been exercising your disaster recovery equipment to ensure its health. You might consider having additional servers of a similar capacity available above the minimum number required to recover just in case you experience a failure at your recovery site.
An optimal solution will have redundancy built-in to your recovery site the way you have it outfitted at your production site. If you have a failover cluster in one location, you would do the same in the recovery site, even though you could technically get by with a single server, assuming that server functions as expected. You should also consider the interdependencies of your infrastructure, such as network, when you think of this issue. Single switches, routers, domain controllers, and sources of power can also be points of failure.
Single point of failure doesn't stop at the system level. You might have that one guy or gal who knows everything about your environment. When you're at his desk and something goes wrong with the system or a specific application, he always has the answer. This gal is a good person, but when it comes down to it, you can't rely on a single person. When a disaster strikes, the go-to person may not be available during the recovery phase-yourself included. When everyone looks around and throws up their hands because such and such is down, what do you do? You wish you could go back in time and document that ingrained knowledge. This is also true for day-to-day operations, but especially necessary when everything is going wrong because of a disaster. The person who knows it all is not what you need, you need full documentation of the knowledge that person possesses. Your go-to should really be your documentation.
Integrating Disaster Recovery into Daily Life
If you don't integrate disaster recovery into your daily operations, you are not ready for a disaster. Organizations that plan for disaster recovery as a single project with a start and an end will fail. Don't let the hard work go to waste. When you put these plans in motion, get all that documentation done, have recovery solutions in place, and continue to update your documentation and test your systems. If you don't test you disaster recovery process regularly, how do you know it will work? If you don't update your documentation day-to-day when changes are made, your documentation is outdated and may even be detrimental to your recovery efforts. Don't let apathy or a disconnected process of change management get you in the end. Not only does integration help your readiness, it reduces the dedicated time necessary to getting disaster recovery ready. Find a way to make what you use in disaster recovery a part of daily life.
Eric Beehler has been working in the IT industry since the mid-90s and has been playing with computer technology well before that. From Help desk technician to solutions provider, he has been involved at many layers of enterprise solutions from the desktop to the network to the server and the SAN. He currently has certifications from CompTIA (A+, N+, Server+), and Microsoft (MCITP: Enterprise Support Technician and Consumer Support Technician, MCTS: Windows Vista Configuration, MCDBA SQL Server 2000, MCSE+I Windows NT 4.0, MCSE Windows 2000, and MCSE Windows 2003). He also holds a Master’s degree in Business Administration from the University of Colorado at Colorado Springs. His experience includes more than nine years with Hewlett-Packard’s Managed Services division, working with Fortune 500 companies to deliver network and server solutions and, most recently, IT experience in the insurance industry working on highly available solutions and disaster recovery. He has co-authored books, including MCITP: Microsoft Windows Vista Desktop Support Enterprise Study Guide (Sybex/Wiley Publishing), authored several white papers, and co-hosts the "CS Techcast" podcast aimed at IT professionals. He provides consulting and training through Consortio Services, LLC.
For additional information about Disaster Recovery and High Availability topics, be sure to check out Marathon's Resource Center which has an extensive library of white papers, webinars and eBooks availabile for download.
Show Discussion / Comments (0)
Disaster Recovery
Availability
Business Continuity
Disaster Tolerance
High Availability
| More