SharePoint 2013 Disaster Recovery- Creating, Testing and Maintaining the DR Plan- Part 3

This article is 3 of 4 articles that cover creating, testing and maintaining a SharePoint DR plan. These articles are from a chapter extract from the book: Microsoft SharePoint 2013 Disaster Recovery Guide.

 

Creating an Effective Disaster Recovery Plan

Now that you have an inventory of each of the components of your SharePoint environment identified threats to the key components of your environment, the next step is to develop your SharePoint DR plan.

One thing you should never do is create your SharePoint DR plan in a vacuum. This means you should not develop your SharePoint DR plan without input and feedback from other key stakeholders whether they are IT stakeholders or business stakeholders.

Your SharePoint DR plan should be part of a larger business continuity plan (BCP) which is typically driven by business stakeholders. The BCP will identify what sites and components within the SharePoint environment are most critical and what the acceptable levels of downtime are for these items. The BCP should also contain the plan for communicating any downtime to the end users.

Identifying Key Stakeholders

The first step in creating an effective SharePoint DR plan is to define the overall scope of the plan and to define the key components that must be restored in the event of a disaster. Having a well-defined scope and knowing the key components will allow for a more productive process of developing your SharePoint DR plan. You begin by working with the key stakeholders within your organization that would be affected if your SharePoint environment were to become unavailable as a result of a disaster. In addition to the individuals responsible for administering and maintaining SharePoint itself, other key stakeholders will play a key role in developing the SharePoint DR plan.

IT

From an IT perspective the key stakeholders are typically represented by individuals from the three key components of the physical architecture described earlier; Servers, Database and Network. In addition, there may be stakeholders from Messaging and Development depending upon the configuration of your SharePoint environment and if there is any custom developed code running in your SharePoint environment.

Servers

The server support team will typically consist of individuals that are responsible for installing and maintaining the servers in your SharePoint environment from an operating system and hardware perspective. This holds true for both physical and virtual servers however there may be other individuals responsible for installing and maintaining the virtual hosts if you have virtual servers in your SharePoint environment.

Database

The database support team will typically consist of DBAs who are responsible for installing, configuring, and maintaining the SQL Server databases in your SharePoint environment. The DBAs are often responsible for monitoring the health of the SQL Servers as well as creating and maintaining database level backup schedules.

Network

The network support team consists of individuals responsible for the connectivity between the systems and services that make up your SharePoint environment. Included in this group are the individuals responsible for configuring and maintaining DNS, hardware load balancers and associated virtual IP (VIP) addresses.

Messaging

SharePoint supports both incoming and outgoing e-mails therefore if your SharePoint environment is using either or both then there should be a stakeholder from the messaging support team to make sure all matters regarding SharePoint and messaging are covered.

Development

The features and functionality of SharePoint can be enhanced and extended through custom developed features, solutions and apps. If your SharePoint environment makes use of custom developed code then your stakeholders should include representation from the development support team.

Business

From a business perspective, the number of key stakeholders can range from just one or two individuals in a small company to a significant number of individuals in a very large company. Regardless of the number of key business stakeholders, they will play the biggest role in defining the scope of your SharePoint DR plan. The key business stakeholders will identify the individual SharePoint sites and services that will need to be available and how soon they will need to be available in the event of a disaster. The results of this Business Impact Analysis (BIA) will be the foundation upon which your SharePoint DR plan will be developed.

The above examples of key stakeholders are typical of very large and large organizations. If you are a small or medium size organization these key stakeholders will most often be roles filled by the same person or may not even be filled by anyone. For example in a small organization there may not be a DBA. SQL Servers may have been set up by an outside consulting organization that is no longer engaged with your organization.

Regardless of whether your organization has these key stakeholders or not this section provides an example of the types of individuals and roles that will play a part in developing your SharePoint DR plan.

Developing the Plan

The BIA will have a direct impact on the RTO and RPO goals of your SharePoint DR plan as it will define the recovery targets of the individual components of your SharePoint environment.

Defining Recovery Targets

Recovery targets are defined as the key pieces of functionality and data identified in the Business Impact Analysis (BIA) that need to be restored in the event of a disaster in relation to the components of your SharePoint environment identified earlier in this article. You will need to work with the key stakeholders to establish the recovery targets for each component of your SharePoint environment in order to build your SharePoint DR plan.

In some cases, such as when you have a large SharePoint farm, your recovery targets may only represent a subset of the total functionality of your SharePoint farm. This will certainly be the case if the RTO is very aggressive and a full recovery will take more time that what the RTO will allow.

Understanding Costs

It is important to remember that each decision made during the development of your SharePoint DR plan will have an associated cost. One must understand the cost of downtime in order to understand the cost impact of how you handle a SharePoint disaster. If your SharePoint farm contains mission or business critical data or applications then the cost of downtime should be considered high. This means that the investment in developing a SharePoint DR plan that includes investments in additional hardware can be offset by the impact of downtime.

If you have a high RPO then you might need additional space for more frequent backups. This might mean an additional investment is storage space, third party backup/restore software or a SQL Server with failover clustering, database mirroring or AlwaysOn Availability Groups.

AlwaysOn Availability Groups is a feature of SQL Server 2012. It is a high-availability and disaster-recovery solution that provides an enterprise-level alternative to database mirroring that maximizes the availability of a set of user databases for an enterprise.

The following diagram shows a SharePoint farm configured for high availability. The diagram shows how complex a SharePoint farm can get if you are building for high availability. Looking at the number of servers and components involved in a high availability farm shows just how costly this kind of solution could be.:


 

Planning for high availability (HA) requires redundancy in your physical architecture such as failover SQL clusters and redundant application servers. It is important to know that HA could have some significant costs associated with it depending upon which components of your physical architecture you will build for HA.

The following diagram shows a SharePoint farm with a redundant farm in a secondary datacenter. The diagram shows the redundant farm and how it is physically and potentially geographically separated from your primary farm. The advantage if this kind of redundancy comes in to play in the event of a disaster in which you lose your primary datacenter. An example of this would be in the event of a natural disaster such as a hurricane or flood where you could lose your primary datacenter and your physically and geographically separated datacenter is unaffected so it would then become your primary datacenter until your primary datacenter is back up and running.


Virtualization

Virtualization provides a workable cost effective option for a recovery solution. You can use virtualization technology such as Hyper-V or VMware as an on-premise solution or tools such as Windows Azure or Amazon Web Services (AWS) as a hosted solution to provide necessary infrastructure for recovery.

You can create virtual images of the production servers and ship these images to a secondary datacenter. By using the virtual standby solution, you have to make sure that the virtual images are created often enough to provide the level of farm configuration and content freshness that you must have for recovering the farm in order to meet your recovery targets and RTO and RPO goals.

A Virtual Standby Solution maintains up-to-date standby virtual machines for fast push button disaster recovery. The bootable virtual machine is an exact clone of the production server as of the last snapshot or backup.

Service Level Agreements

A Service Level Agreement (SLA) is a written agreement that specifies the requirements for server or application uptime and the penalties for not meeting those requirements. Two of the most specific and important components within an SLA are RPO and RTO. Both components are extremely important in developing your SharePoint DR plan

The RPO is retroactive from the moment of actual failure. It can be set in seconds, minutes, hours, or days but must correspond to the amount of tolerable lost data.

The RTO is typically based on lost revenue or productivity measured in seconds, minutes, hours, or days and corresponds to the measurable uptime (99.99%, 99.999% etc.) within an SLA.

The following table shows a very basic sample SharePoint SLA.

SERVICE ITEM SERVICE COMMITMENT
Availability 99.9%
Recovery Time Objective (RTO) < 5 Hours
Recovery Point Objective (RPO) 30 Minute Data Loss Window

Although a SharePoint SLA contains many more service items, the above sample shows the availability components only.

Planning for Recovery

Now that you have set the RPO and RTO for your SharePoint environment, established the recovery targets and understand the costs associated with these recovery targets it’s time to begin planning for recovery. Recovery is defined as the steps that must be taken in order to get the SharePoint environment back to an acceptable level of functionality as defined in the BIA.

Be sure your plan for recovery includes a communications plan. It will be important to keep the key stakeholders as well as end users up to date on the recovery process especially if there are mission critical applications that have been affected by the disaster.

Recovery Resources

In order to begin planning for recovery you must begin by identifying the resources, such as people, hardware and software, that will be needed to start the recovery process.

People

Your SharePoint DR plan should include a list of key individuals and stakeholders that will be part of the recovery process. This list should include the following:

  • Name
  • Department
  • Role
  • Primary Phone Number
  • Backup Phone Number
  • E-Mail Address
  • Recovery Responsibilities

Hardware

Once you have established what additional hardware will be needed for recovery you should begin the process of acquiring the hardware so it is on hand as soon as possible. Whether the hardware is dedicated hardware or shared hardware make sure it is clearly identified as hardware associated with the SharePoint recover process.

Software

If any additional software is required for a secondary datacenter or failover farm then you must acquire and maintain a sufficient amount of licenses to support the secondary datacenter or failover farm in accordance with the software vendor’s licensing policy.

You also need to ensure that you maintain a copy of each Service Pack, patch, hotfix, and Cumulative Update (CU) installed on your farm. Sometimes hotfixes are superseded later by other hotfixes or even retracted and can no longer be downloaded in the original form so maintaining copies will ensure you can return back to the exact patch level you had before you had the disaster if you find yourself in a DR scenario.

Dependent Services

SharePoint depends on a number of services that may not be covered by the SharePoint DR plan. Services such as SQL Server, Active Directory (AD), DNS, SMTP might have their own individual DR plans. It is important to make sure the RTO and RPO values for these services are in line those of the SharePoint environment. If they are not in line with each other then you must look at what needs to be done to get them in line even if it means adjusting the RTO and RPO of the SharePoint environment or increasing the budget set aside for the SharePoint DR plan.

Establishing and Documenting your Recovery Procedures

The next step in developing your SharePoint DR plan is to establish and document the procedures required for recovery. It is important to document these procedures as clearly and concisely as possible with the understanding that individual or individuals executing these procedures may not have been a part of developing the SharePoint DR plan. They are also most likely to be under a great deal of pressure during the execution of the plan so the more clearly the procedures are written, the better the chance for success within the expected timelines.

It is important to socialize your SharePoint DR plan so those that have a stake in the plan or will be a part of the testing or execution of the plan are aware of it and know where to go to see the latest copy as you SharePoint DR plan will be an ever evolving and changing document.

Define Success Criteria

How do you know your recovery is a success without defining success criteria? The criteria for determining whether your recovery was a success or not should be clearly defined in your SharePoint DR plan. Typically success criteria are derived from the recovery targets established during the development of the plan. For example if you have a recovery target for the corporate Intranet defined as one business day, when executing or testing your SharePoint DR plan you are able to have your corporate Intranet up and running in one business day or less the recovery is considered a success.

Success criteria can vary for different applications and sites that are part your SharePoint farm. As you are defining your recovery targets when developing your DR plan you will identify the various components of your SharePoint farm including individual application and sites and what will define a successful recovery in the event of a disaster.

Reviewing the Plan

Once you have completed your SharePoint DR plan it is important that the plan is thoroughly reviewed for accuracy and clarity. This review should be completed by a third party, or if that’s not possible, a qualified person or group that was not involved with creating the plan.

You should never consider your SharePoint DR plan complete until it has been checked and verified by parties that were not involved with creating the plan.

For more information please refer to:” Plan for high availability and disaster recovery for SharePoint 2013″ http://technet.microsoft.com/en- us/library/cc263031.aspx.