High Availability and Disaster Recovery

This article outlines the approach Flowgear uses to support disaster recovery and the steps that should be taken to ensure continuity in the event of an outage event.

Data Redundancy and VM Availability

Flowgear data is stored in a dedicated Azure Storage account and Azure KeyVault for each subscription. The data for these resources is replicated within the primary region as well as to a secondary region. In the event of a failure in the primary region, a failover is performed to the secondary where the data is accessible in read only mode.

Flowgear workload is processed on one or more Virtual Machines (VMs) dedicated to each subscription.

Subscriptions on our Standard plan and higher include two or more VMs. VM's are placed in separate fault and update domains to minimize the chance of multiple VM's failing simultaneously.

During normal operation, workload is routed to the least-loaded VM. The health of each of the VMs in the cluster is monitored and if it is determined that a VM is no longer processing workload, requests will not be routed to it. In most cases, a request that has not been serviced due to a non-responsive VM will be seamlessly routed to another VM.

This mechanism ensures the service remains operable during an event such as a software or hardware failure.

Disaster Recovery

Flowgear Professional and Enterprise plans include one or more Virtual Machines (VMs) in a secondary region that process workload if the VMs in the primary region are unable to.

Under the Professional plan, the VMs in the secondary region are provisioned but kept deallocated until a failover is required (we refer to this as cold DR).

Under the Enterprise plan, the VMs in the secondary region are provisioned and running (we refer to this as hot DR).

We use Azure Traffic Manager (ATM) to probe the primary and secondary regions periodically. If ATM determines that the primary region is not responsive, a DNS-based failover will be performed. This means that the tenant hostname will begin to resolve to the secondary region rather than the primary.

To understand how this works, consider a tenant named testcorp.

The highly available hostname for this tenant will be testcorp.flowgear.net.

Regional hostnames are also created:

  • p-testcorp.flowgear.net is the primary region
  • s-testcorp.flowgear.net is the secondary region

When the primary region is online, testcorp.flowgear.net resolves to p-testcorp.flowgear.net. If the primary region is not responsive, testcorp.flowgear.net will begin to resolve to s-testcorp.flowgear.net instead. When the primary region recovers, it will begin to resolve to p-testcorp.flowgear.net again.

Preparedness for Disaster Recovery

Take these actions to ensure that your tenant continues to operate smoothly in the event of a failover. Note that these steps only apply to Professional and Enterprise plan subscriptions.

Connect to Secondary tenant

Check that you are able to sign in to the secondary tenant by prefixing s- to the hostname. For example, if your hostname is testcorp, specify s-testcorp.

Review DropPoint Endpoints

For Enterprise plans, you can choose between having DropPoint connections fail over to the secondary region (default option) or having them connect to both the primary and secondary regions continuously.

Having the DropPoint connect to both regions continuously is preferable because it will be immediately able to service workload from the secondary region. To configure this, specify the primary and secondary endpoints in the DropPoint app rather than the highly available endpoint.

For example, instead of:

testcorp.flowgear.net

Specify:

p-testcorp.flowgear.net
s-testcorp.flowgear.net

Once you have made this change, click Update & Restart, then sign in to the secondary tenant in the Console and confirm that the DropPoints are showing online there.

We do not recommend explicitly specifying primary and secondary endpoints for tenants running on a Professional Subscription since the secondary VMs will not be running until a failover takes place. Although the DropPoint will continuously attempt to connect to the secondary, it uses an exponential back-off so in the event of a failover, it may take up to 30 minutes for it to connect to the secondary region. By contrast, when only the highly available endpoint is specified, the DropPoint should be able to connect to the secondary within a few minutes of the failover occurring.

Test Connections from the Secondary Tenant

The secondary tenant is located in a different Azure region to the primary and therefore has a different outbound IP address.

You can confirm the IP address for a tenant from the Console from ⚙ → Cluster. In the Cluster screen, click Tools → Get Request Info. Note the Outbound IP Address.

Run a Connection Test on all Connections that do not use a DropPoint to be sure they are not dependent on a whitelisted source IP.

The Outbound IP Address is only stable for Enterprise plans due to a dedicated NAT gateway being provisioned for this plan. For all other plans, the IP address may change at any time and should not be relied on. We recommend using a DropPoint to avoid reliance on an IP address.

Test HTTP Invokes to the Secondary Tenant

Due to the secondary tenant being located in a different Azure region to the primary, it will also have a different (inbound) IP address.

If you have Workflows that are published as REST APIs, the hostnames that are allocated to their Site will begin to resolve to the secondary tenant during a failover.

Test that HTTP invokes can be successfully serviced on the secondary tenant by using one of the approaches below.

Test method 1: override DNS resolution on local server

On the server that invokes the Flowgear Workflow via API call, edit the HOSTS file (Windows) or /etc/hosts file (Linux) so that the hostname resolves to the secondary region.

  1. Determine the IP address of the secondary region by PINGing it (ping s-testcorp.flowgear.net). We'll use 172.190.61.2 as an example of the returned IP address.

  2. Confirm the hostname that is being used to invoke Workflows (Top Nav → Edit Site Settings, Click for the Environment you intend to invoke and confirm the value in Hostname. We'll use an example hostname of testcorp-api.flowgear.net.

  3. Create a temporary DNS record to cause the hostname to resolve to the secondary region. On Windows, open %windir%\system32\drivers\etc\hosts, on Linux, the file is /etc/hosts. Create an entry like this:

    172.190.61.2 testcorp-api.flowgear.net
    
  4. When you invoke the API from the local server, the request should route to the secondary tenant.

  5. Don't forget to remove the entry from your hosts file when you have finished testing.

Test method 2: Change the hostname in the request URL and add a Host header

Adjust the code that invokes the Flowgear Workflow via API so that it uses the secondary endpoint in the URL but specifies the correct hostname in the Host request header.

  1. Confirm the hostname that is being used to invoke Workflows (Top Nav → Edit Site Settings, Click for the Environment you intend to invoke and confirm the value in Hostname. We'll use an example hostname of testcorp-api.flowgear.net.

  2. Change your code so that instead of calling https://testcorp-api.flowgear.net/someapicall, it uses the secondary tenant name (e.g. https://s-testcorp.flowgear.net/someapicall).

  3. Change your code so it also explicitly adds a custom Host header (e.g. Host: testcorp-api.flowgear.net)

The domain provided in the URL is used to route the request to the secondary tenant while the Host header is used by Flowgear to determine which Site and Environment is being invoked when the request reaches the secondary tenant.

For example, instead of:

GET https://testcorp-api.flowgear.net/someapicall
Authorization: Bearer yourapikey

Use:

GET https://s-testcorp.flowgear.net/someapicall
Host: testcorp-api.flowgear.net
Authorization: Bearer yourapikey

Remember that you can also automate testing of secondary tenant API Invokes by building a Workflow that uses the Web Request 2 Node to invoke the Workflow, adjusting the domain and Host header as described above.