Zum Hauptinhalt wechseln
Cloud

A myth about high availability and data recovery, and a plan for it

Artikel 30.06.2023 Lesezeit: min
By Duncan Bradley

A series of recent customer conversations has made clear to me that there is a misperception about investments in high availability and data replication.

Many organizations seem to believe these investments will save them in the event of a cyberattack resulting in a logical data corruption event.  

They won’t.

It doesn’t matter if your critical data is held on premises, in the cloud, or on SaaS systems. In most cyberattacks that result in logical data corruption events, the corruption will be replicated across your high availability and data replication solutions.

The replication corrupts all online versions of your data. All servers become encrypted because the high available, replicated solutions act as a vector to accelerate the propagation of attacks. The scenario drives a need to recover server images and data from backup.

Adding immutability to backups is better, but it’s often not enough to avoid a significant outage. Most backup solutions aren’t built to enable recovery of hundreds of servers or terabytes of data within hours.

I meet with many organizations that haven’t attended to this reality. Many don’t realize they only have capacity to restore their environments over the course of weeks—not hours. Their current backup approaches require that after they clean an infection, they will have to rebuild many of their servers before they can begin recovering data. And they’re ill prepared to do so.

From what I’m seeing, now is the time for many organizations to re-assess their data and server recovery capabilities to ensure the capabilities meet business needs.

From what I’m seeing, now is the time for many organizations to re-assess their data and server recovery capabilities to ensure the capabilities meet business needs.

Disaster recovery scenarios have changed

Most organizations’ infrastructure direction over the past 20 years has been toward infrastructure consolidation, with significant investments in clustering and implementing storage replication for critical systems to meet increasingly aggressive availability SLAs.  

While this approach may protect them from the fire, flood, and network outage events, disaster recovery scenarios over the past few years have changed drastically—and most organizations have not changed to accommodate the threat of ransomware.

To recover from most ransomware events, organizations need their servers and data to be recoverable from backup. Doing so often requires them to restore a version of a server image from many weeks prior, before the attack was seeded.

To do this, they need a data backup protection solution with the right attributes to position them to recover from many of the common ransomware events.

  • Air gap: The backup data has physical or logical separation from the production data to keep it safe in the event of cyber breach.
  • Immutability and retention lock: Once written, the backup data cannot be changed or expired by a cyberattack.
  • Ransomware detection and anomaly scans: The backup data is checked regularly to improve safety from known compromises.
  • Mass recovery: Recover the organization’s critical systems within business impact tolerance.

While some customers are embarking on a journey to add additional protection and increased recoverability, too many others still are exposed. More worryingly, they do not know they are exposed.

Case in point: specs from a recent RFP

Earlier this year, I received an RFP for a new cyber backup protection solution. The requestor stated they wanted a replacement for their current backup solution with a cyber-protected backup solution.  

Requirements were to provide their critical systems with:

  • 15-min recovery point objective (RPO)—in other words, virtually zero data loss
  • 30-mins recovery time objective (RTO)—virtually instant recovery, though they stated that their current backup schedule is a single, weekly backup with daily incremental backups. 

While these requirements may be possible to meet, it wouldn’t be without major changes to  the supporting compute, storage, and network infrastructure, as well as a total redesign of backup policies.

I’ve seen numerous data recoverability advisories for customers that showed server and data recovery timelines of more than 10 days for their minimal viable company infrastructure.  

What stops them from realizing this, especially when their backup and restore service has all Green SLA’s? Well…

For the last 20 years, their nightly backups completed within SLAs, and monthly restore tests showed the data backed up with consistency. But most daily backups are incremental and only back up a small percentage of server data.

To recover all server data, it will often take days—and mass recovery of multiple servers puts huge contention onto the backup environment and networks, making it even slower.

To recover all server data, it will often take days—and mass recovery of multiple servers puts huge contention onto the backup environment and networks, making it slower.

How to improve data recovery capabilities

I mentioned this challenge has come up in recent customer conversations.

Here’s what I’ve told them: you need to understand your organization’s capability to recover data and servers in the event a mass corruption event in order to make an informed business decision about whether it’s a risk you need or want to mitigate.   

If you’re looking to improve your recovery capabilities, first consider a holistic approach over simply launching an RFP:

  • Identify your critical server and data volumes and which systems store them. It’s also important to understand how much data loss can be endured.
  • Pinpoint which systems need to be recovered first, and in what business tolerance timelines. Can simple changes be made to business processes or applications to lessen the criticality of underlying systems, thus reducing the volume of systems that have critical business impact tolerances and require expensive recovery techniques?
  • Don’t forget to consider your data in cloud and SaaS solutions. These are just as vulnerable!

The more data you need to protect, the faster you need to recover it. Your organization’s tolerance for data loss will have massive effects on the costs of implementing a sufficient server and data recovery solution.

Duncan Bradley is Director of Customer Engagement and Country Practice Leader for Kyndryl’s UKI Cyber Resiliency Practice.