विषय पर बढ़ें

Disaster Recovery

Disaster Recovery enables you to minimize downtime for virtual servers if the Compute Resource hosting them experiences a disaster scenario (for example, hardware failure, data center outage, etc.). This is done by restoring the latest backups of the affected virtual servers to a different, healthy Compute Resource.

In this topic, you'll learn how to do the following:

  • Offer Disaster Recovery to users.
  • Configure Disaster Recovery for Compute Resources.
  • Perform Disaster Recovery.
  • Clean up once a Disaster Recovery process has finished.

Overview

When there is an outage and the users' virtual servers are down, time is of the essence. Prolonged downtime can result in violated SLAs, lost money, and damaged reputation. Disaster Recovery enables you to bring the affected virtual servers back online as soon as possible, and with minimum effort.

Here is what Disaster Recovery does:

  1. Recreates a virtual server hosted on an offline Compute Resource (called the source Compute Resource) on a different, healthy and compatible Compute Resource (called the destination Compute Resource). The virtual server's configuration is exactly the same, including the same IP and MAC addresses it had on the source Compute Resource.

  2. Restores the latest backup of the virtual server being recovered, including both data and configuration, to the destination Compute Resource.

Compared to restoring virtual servers from backup by hand, Disaster Recovery has the following benefits:

  • Any number of virtual servers can be recovered with a few clicks. Once Disaster Recovery is enabled and configured, the process itself is very straightforward.

  • Disaster Recovery allows restoring virtual servers' backups to Compute Resources other than those the virtual servers were originally hosted on.

  • Disaster Recovery can be offered to users as an incentive or a paid add-on, improving your value proposition.

Note

Before advertising Disaster Recovery to your users, we strongly recommend that you read this documentation topic in its entirety and familiarize yourself with the currently existing limitations. You must also configure Disaster Recovery for your Compute Resources (you will learn how to do so later in this topic).

Known Issues and Limitations

At the moment, the following limitations exist:

  • A virtual server can only be recovered if it has at least one valid backup.

  • Any changes to the virtual server's configuration and data made since the creation of the latest backup are lost.

  • Performing Disaster Recovery requires at least one compatible Compute Resource.

  • Virtual servers can only be recovered to a Compute Resource with a matching virtualization type. KVM virtual machines can only be recovered to Compute Resources that have the KVM virtualization type. VZ containers can only be recovered to Compute Resources that have the VZ virtualization type.

  • Virtual servers from Compute Resources imported from SolusVM 1 can only be recovered to other Compute Resources imported from SolusVM 1.

  • Disaster Recovery does not start automatically in case of an outage. It must be performed manually by the administrator.

  • The support for recovering virtual servers using VPC networks will be added in one of the upcoming updates.

  • To enable users to add Disaster Recovery when ordering virtual servers via WHMCS, you need to update the solusvm2vps module to version 1.0.55 or later.

Offering Disaster Recovery to Users

You can offer Disaster Recovery for free. You can also include it as a bonus feature in your more expensive plans, or use it as an upsell opportunity, charging extra. If you would like to make Disaster Recovery a paid feature, you need to include it in your plans.

To include Disaster Recovery in a plan:

  1. Go to Compute Resources > Plans.

  2. Click Add Plan to create a new plan, or click the corresponding button to edit an existing plan.

  3. Select the "Offer backups" checkbox (this is necessary for Disaster Recovery to function), and then select the "Offer disaster recovery feature" checkbox.

  4. (Optional) To make Disaster Recovery a paid feature, specify what percentage of the base virtual server price to charge by entering it in the "Disaster recovery price in %" field. You can enter a value between 0 and 100 percent. By default, the value is zero, making Disaster Recovery free.

  5. Once you are done, click Save.

Note

If you are selling virtual servers via WHMCS, learn how to make Disaster Recovery available to users in WHMCS.

Note

Disaster Recovery is possible even for virtual servers that do not have Disaster Recovery enabled. The only requirements for recovering a virtual server are that it has at least one valid backup, and that there is at least one healthy and compatible destination Compute Resource.

Warning

Enabling backups and Disaster Recovery for a plan does not mean that Disaster Recovery is now possible. You must also configure Disaster Recovery for your Compute Resources (you will learn how to do so later in this topic). This must be done in advance to make Disaster Recovery possible.

Configuring Disaster Recovery for a Compute Resource

To make it so that the virtual servers hosted on a Compute Resource can be recovered, the following requirements must be met:

  • There is at least one healthy and compatible destination Compute Resource.

  • That has enough available resources (RAM, VCPU, and free disk space) to host the virtual servers being recovered.

  • Every virtual server you want to be able to recover has at least one backup. The more recent the backup, the greater the chances of the virtual server being recovered in a state as close as possible to that at the moment of the outage.

If even one of these requirements is not met, Disaster Recovery may be impossible. We will show you how to verify that these requirements are met further in this section.

Verifying the Availability of Destination Compute Resources

To make recovering the virtual servers hosted on any given Compute Resource possible, there has to be at least one other healthy Compute Resource the virtual servers can be restored on. Moreover, the destination Compute Resource has to be compatible - we will explore what that means further in this section. Thus, at any given moment, some Compute Resources in your cluster may be ready for Disaster Recovery, and others may not be. To learn what Compute Resources are ready for Disaster Recovery, and also what prevents others from being ready, you can visit the Disaster Recovery > Resilience Status page.

On this screenshot, the virtual servers on the "Alpha" Compute Resource can be recovered, which is denoted by the icon in the "Can be recovered?" column. This means that there is at least one other healthy, compatible Compute Resource. The "Bravo" Compute Resource cannot be recovered, which is denoted by the icon.

Troubleshooting Issues Preventing the Availability of Destination Compute Resources

To learn what prevents a Compute Resource from being ready for Disaster Recovery, and also what virtual servers hosted on that Compute Resource can be recovered, click its name.

Here, you can see the list of all other Compute Resources in your cluster, both compatible and incompatible. The symbol shown in the "Overall" column indicates whether the corresponding Compute Resource is compatible (denoted by the icon) or incompatible (denoted by the icon). As long as at least one other Compute Resource is compatible, Disaster Recovery is possible for the selected source Compute Resource.

A Compute Resource is compatible as long as it meets all the necessary criteria. If one or more criteria are not met, the Compute Resource is not compatible. The criteria are separated into a number of categories to make it easier to identify the issues preventing any given Compute Resource from being compatible with the selected source Compute Resource. Some of those issues can be fixed to make an incompatible Compute Resource compatible. Some can make a Compute Resource fundamentally incompatible with the selected source Compute Resource.

Note

You can click any icon to learn more about the specific issue(s) that prevent a Compute Resource from being compatible with the selected source Compute Resource. For example, if the icon is shown in the "IP blocks" column, you can click it to learn what IP blocks are missing from the corresponding Compute Resource.

Column Issue Resolution
Status The destination Compute Resource is not running or its SolusVM Agent is not reachable. Make sure that the destination Compute Resource is online and available over the network. Also, make sure that the SolusVM Agent on the destination Compute Resource is running and can be accessed from the management node.
IP blocks The destination Compute Resource is missing one or more IP blocks present on the source Compute Resource. Add the missing IP blocks to the destination Compute Resource.
Virtualization The source Compute Resource hosts one or more VZ virtual servers, but the destination Compute Resource only supports KVM virtual servers, or vice versa. Select a different destination Compute Resource with the matching virtualization type.
Storage One or more types of storage present on the source Compute Resource are missing from the destination Compute Resource. Add the missing types of storage to the destination Compute Resource
Source The source Compute Resource was imported from SolusVM 1. Select a different destination Compute Resource, one that was also imported from SolusVM 1.
Architecture The architecture of the source and destination Compute Resources does not match (for example, x86-64 and ARM). Select a different destination Compute Resource.
Capabilities Some capabilities of the `libvirt` daemon present on the source Compute Resource are missing from the destination Compute Resource. Make sure that the destination Compute Resource is online and available over the network. Change the `libvirt` configuration on the destination Compute Resource.

To be able to use Disaster Recovery, make sure that there is at least one healthy and compatible destination Compute Resource for each Compute Resource you want to be able to recover.

Verifying the Availability of Virtual Server Backups

Just because a Compute Resource is shown as ready for Disaster Recovery does not mean that all virtual servers hosted on it can be recovered. A virtual server can only be recovered if it has at least one backup.

The current state of backup coverage for a Compute Resource can be found on the Disaster Recovery > Resilience Status page.

Here, you can see how many virtual servers hosted on the selected source Compute Resource have backups, and are thus eligible for Disaster Recovery. This information is shown separately for virtual servers that have Disaster Recovery enabled, and for those that do not.

  • The "Status" column shows the icon if every virtual server in the category has a backup, and the icon otherwise.

  • The "Available" column shows the number of virtual servers that have at least one backup.

  • The "Expired" column shows the number of virtual servers that have at least one backup, but no backups created within the last seven days.

  • The "Total" column shows the total number of virtual servers.

  • The "Without backups" column shows the number of virtual servers that do not have any backups.

Verifying the Availability of Resources

Even if a Compute Resource is shown as ready for Disaster Recovery and every virtual server hosted on it has a backup, that does not mean that all virtual servers hosted on it can be recovered. A virtual server can only be recovered if at least one compatible destination Compute Resource has enough resources (RAM, VCPU, and free disk space) to accommodate the virtual servers being recovered.

The current resource availability can be found on the Disaster Recovery > Resilience Status page. The information is shown separately for virtual servers that have Disaster Recovery enabled, virtual servers that do not have Disaster Recovery enabled, but have at least one backup (and thus can also be recovered), and all virtual servers combined.

On this screenshot, you can see that Disaster Recovery is not possible because the compatible destination Compute Resources do not have the necessary amount of free disk space between them.

Configuring the Maximum Number of Concurrent Restores

When a Disaster Recovery process is started, the virtual servers selected for recovery will begin being recreated and restored from backups, one by one. You can speed up the recovery process by allowing the recovery of multiple virtual servers at once. This is done by setting the desired "concurrent restores" value for both the destination Compute Resource and the backup node the backups are stored on.

To configure the maximum number of concurrent restores for a destination Compute Resource:

  1. Go to Compute Resources, locate the desired destination Compute Resource, and then click the button.

  2. Under "Concurrent Backups", set the "Restore" value to the desired maximum number of virtual servers that can be restored concurrently, and then click Save.

  3. Go to Backups > Backup Nodes, locate the backup node storing the virtual server backups for the source Compute Resource, and then click the button.

  4. Under "Concurrent Backups", set the "Restore" value to the desired maximum number of virtual servers that can be restored concurrently, and then click Save.

The maximum number of concurrent restores should now be equal to the number you entered. You can see the currently configured "Max. concurrent restores" value for every destination Compute Resource by going to the Disaster Recovery > Resilience Status page and clicking the name of a Compute Resource.

Here, the currently configured "Max. concurrent restores" value is two. If the value being shown is not the one you expected to see, repeat the steps above, keeping in mind that the "Max. concurrent restores" value is the lowest one between those configured for the destination Compute Resource and the backup node.

Performing Disaster Recovery

Compute Resources that are not running or whose SolusVM Agent is not reachable, likely due to some outage or malfunction, are eligible for Disaster Recovery. You can see the list of Compute Resources eligible for Disaster Recovery on the Disaster Recovery page.

Note

"Eligible for Disaster Recovery" does not mean that any or all virtual servers hosted on the affected Compute Resource can be recovered. For successful recovery, a compatible destination Compute Resource must exist, and the hosted virtual servers must have backups.

You can also make a healthy Compute Resource appear on that page in two ways:

  • (Recommended) By shutting the Compute Resource down.

  • (Not recommended) By either stopping the SolusVM Agent on the Compute Resource or filtering the port it listens on (6767 by default).

Note

The latter option is fine, for example, when you want to test the feature. However, if you plan to actually recover one or more virtual servers, restarting the SolusVM Agent will not make SolusVM register that the virtual server(s) is now hosted on a different Compute Resource, which is likely to result in network conflicts, as the virtual server(s) will be assigned the same IP address(es) on both the source and the destination Compute Resources. To avoid this, stop the source Compute Resource, and then start it again.

If one or more Compute Resources are shown on the Disaster Recovery page, and it is not because you manually shut them down or stopped their SolusVM Agent, there may have been some outage or malfunction. Before beginning disaster recovery, we recommend trying to access the affected Compute Resource(s) via SSH, and if successful, verifying that the SolusVM Agent is running and is reachable from the management node.

If you know for a fact that there has been an outage or failure, or if you are unable to access the affected Compute Resource(s) via SSH, we recommend that you start a Disaster Recovery process.

Starting a Disaster Recovery Process

To start a Disaster Recovery process:

  1. Go to Disaster Recovery.

  2. Click the name of the Compute Resource you want to recover.

  3. Select the checkboxes corresponding to the virtual server(s) you want to recover. You can also click Disaster recovery enabled to select all virtual servers with Disaster Recovery enabled that have at least one backup, or Backups to select all virtual servers with at least one backup.

Note

Remember that a virtual server without Disaster Recovery enabled can still be recovered as long as it has at least one backup.

  1. Click Recover selected. The "Recover Compute Resources" window will open.

  2. If desired, set the "Destination Compute Resource selection method" toggle to "Manual", and then select the desired destination Compute Resource from the menu.

  3. Under "Available resources", verify that the selected destination Compute Resource has enough RAM, VCPU, and free disk space available to accommodate all virtual servers being recovered.

  4. If desired, select the "Notify server owner(s) via email" checkbox to have SolusVM send an email to the owners of the virtual servers being recovered. You can use the default email template, edit it, or write your own and paste it into the "Notification text" window.

  5. Click Start disaster recovery, and then click Start to begin recovering virtual servers.

Note

You will not be able to click Start disaster recovery unless there is a sufficient amount of resources on the destination Compute Resource.

The Disaster Recovery process has started. You can monitor its progress on the Disaster Recovery > Activities page.

If the Disaster Recovery process fails for any reason, we recommend that you troubleshoot the root cause. To learn about the specific error(s) that caused the failure, click the button next to the "Failed" status.

Once you have removed the root cause, click Retry to start a new Disaster Recovery process.

During the Disaster Recovery process, keep in mind the following:

  • The maximum number of virtual servers being recovered at the same time cannot exceed the "Max. concurrent restores" value. The greater the number of virtual servers being recovered and the smaller the "Max. concurrent restores" value, the longer the recovery process is going to take.

  • The virtual servers being recovered will remain unavailable until the recovery process has finished.

  • Once the Disaster Recovery process has finished, the recovered virtual server(s) will automatically become available. They will have the same IP address(es) they were assigned on the source Compute Resource.

Cleaning up After Disaster Recovery

Once the Disaster Recovery process has finished, there is a small bit of clean up that should be done. On the Disaster Recovery > Activities page, for every virtual server that has been recovered, there is an opportunity to view its original configuration, and also to remove its data from the source Compute Resource.

To view and remove the virtual server's original data:

  1. Go to the Disaster Recovery > Activities page.

  2. Locate and click the desired Disaster Recovery process.

  3. Select one or more virtual servers, and then click View original data.

If the source Compute Resource has been brought back online and is reachable from the management node, make sure that remove the virtual servers' data from that Compute Resource to avoid potential issues and conflicts.

  1. Select one or more virtual servers, and then click Delete original data.