When (if ever) is High Availability “good enough”? Do I always need disaster recovery or business continuity plans?

July 24, 2020 |

When businesses discuss how to minimize or mitigate risk with IT folks, the topic can be a challenging one – but these conversations are even more critical in today’s climate. One area of confusion that sometimes comes up is the difference between high availability, disaster recovery, and business continuity solutions. These are actually three distinct ideas, each with its own similar but distinct focus.

High Availability (HA) – is a per-application environment which is “available” (operating) almost all the time (typically defined as “five nines”, 99.999% or better) and which relies on automation to avoid user impact

Disaster Recovery (DR) – focuses on getting IT infrastructure and processes up after a catastrophic failure

Business Continuity Plan (BCP) – focuses on planning for adverse conditions, how a business responds, how it returns to normal, and what the new normal is. 

Target use caseScopeFailover timingAutomationLoss of dataUser experience
High availabilityApplicationCriticalRequiredNot permittedNo impact (seamless)
Disaster RecoveryIT InfrastructureImportantNot typicallyMaybe permittedImpacted
Business ContinuityBusinessImportantNot typicallyMaybe permittedImpacted

Key words to remember with a high availability solution are “automated” and “seamless”. In other words when there is a failure in a high-availability system, the recovery is automatic and users should not notice. Well designed HA systems avoid single points of failure.

High Availability (HA) notes

So, the big question is “When (if ever) is an application’s HA implementation good enough?”. I would say when it’s a “well-designed and geographically redundant” system, such as in the example diagram.

Most of the challenges with implementing high availability in a corporate or enterprise environment relate to the product vendor or application design – especially when dealing with newer or more niche products. I don’t think it’s a stretch to imagine that few programmers or product managers in most small organizations have had to support or understand the rigorous requirements of supporting a 24×7 mission critical application. That lack of experience and institutional knowledge can (in my experience) lead to product architectural decisions that are incompatible with enterprise high availability solutions and lack the robust response we typically want and require.  For example, I’ve worked with more than one vendor solution that runs on a single server and won’t run on a second server without the first server down. This automatically means high availability is out of the question. And, since a key tenet of HA is lack of user impact, this means if (when) there is a failure of such a system, there WILL be user impact. (even if it’s easily mitigated)

So, if you are evaluating a smaller or niche vendor application and you have a requirement for HA or DR, I would focus a lot of energy on this topic because it’s almost always a second-rate solution.

A robust HA system is created from the ground (code) up. The best systems will often isolate discrete components from each other, then provide a method of interconnection to allow the process load to balance and have redundancy. When roles are separate and connections are stateless (or resilient), it allows seamless (or near seamless) failover when any host or component is unavailable.

Taking this a step further, geographic load balancing can provide an extra layer of safety that can avoid the need to execute a disaster plan in some cases. This is what I alluded to earlier, saying that an HA system with geographic load balancing can sometimes avoid the need for DR. The key requirement is that geographically diverse infrastructure needs to be sufficiently far apart that it would be nearly impossible for an event to impact all locations. This is easier when you think about floods or fires, but becomes more challenging when considering threats such as grid power outages such as the 2003 blackout of the northeast USA which reportedly affected 8 states and 45 million people.

The best apps are (among other things) highly resilient, or high availability by nature – this could be summed up from a user’s perspective as it “just works”.

Disaster Recovery (DR) notes

A typical failure requiring the exercise of a DR plan in America today would be a datacenter power or network connection being cut – but natural disasters such as floods, tornadoes and hurricanes are the historical examples. Today there are even more nefarious circumstances that could trigger a DR event, such as malicious hacking or ransomware that encrypts user or server data.

When a DR “event” occurs, impact has already happened. High availability has hit its limit and the application(s) or servers are down and/or unavailable.  The terms that will be discussed related to such an event are most often “RTO” and “RPO”.  These relate to the recovery time objective, and recovery point objective, respectively. In other words, what time are we coming back online, and which point of the day will our restored data come from?

Putting it all together

Hopefully it’s obvious that a good plan for business continuity could include both HA and DR, and needs to cover both natural and man-made events.  For most scenarios a well-architected and engineered product will not need to execute a DR plan because the HA components will seamlessly provide continued function across geographic regions. When the DR plan is required though, a lack of planning – and TESTING – will seriously hamper your recovery. I’ve heard it said your backups are no good, only the restores matter – so don’t skip this step.

For more ideas around business continuity, read our recent post – The New Business Continuity in the Age of Pandemics

Systems Flow Can Help

We help organizations both plan and execute. We identify needs, strengths and gaps, define solutions, and reduce risk. We can help you maintain your calm even when the waters rise – through practical, effective application of best practices in enterprise architecture, vision and strategy.

The following two tabs change content below.

Systems Flow

uses an organic strategic approach that > focuses on analysis, architecture, and leadership > flexes methodologies to fit your need > applies best-practice tools and methods > blends strategic business thinking with practical application of technology


Comments

Comments are closed.