Application Resiliency in the Hybrid Cloud: A Primer

Mission-critical applications in the cloud need to adapt and keep functioning in the face of disruptions and unforeseen events.

By Dipti Parmar May 19, 2022

Over four in five IT leaders in small and large organizations across the world agree that a hybrid, multicloud environment is the ideal operational model to keep the IT infrastructure running smoothly and help mitigate challenges, as per the Nutanix Enterprise Cloud Index report.

Further, application mobility, interoperability and control is top of the mind for these organizations. The way mission-critical applications are developed and deployed is key to improving overall hybrid cloud performance and achieving the operational and strategic goals of the business.

More IT Leaders Say Hybrid Multicloud is Their Future

However, organizations making the transition to a hybrid IT infrastructure are in various stages of their journey. Moving to the hybrid cloud involves significant challenges in interconnectivity, integration, and data protection and management compared to traditional IT models. Many organizations are struggling to adapt legacy processes, resources and capabilities to hybrid, multicloud environments and post-pandemic workplace realities such as remote work.

A study by 451 Research that explored the transition of over 1,000 organizations to the hybrid cloud found that most are seriously lacking in business continuity planning, cloud management strategy, and data and application resiliency.

About 24% companies in the study said they’ve moved their key workloads and applications to the public cloud, around 20% have their complete IT environment on-premises, and the rest are leveraging a mix of on- and off-premises cloud or hosted resources with various levels of interoperability. Those that are taking a more ad-hoc approach are more likely to experience additional operational overheads due to the duplication of efforts in running separate environments.

The study also found that resiliency strategies employed by most organizations are immature at best – more than a third of the respondents reported a significant application outage in their hybrid cloud environment in the past 12 months.

There is a serious gap between organizational perceptions on preparedness and the actual impact of outages on the business. When traditional on-prem systems and applications are migrated to the cloud without adequate testing, the rate of failure and number of availability issues shoot up.

Building resiliency into business-critical applications ensures that IT teams are able to quickly identify and resolve typical issues that plague application performance in the cloud, while continuing to drive business outcomes.

What is application resiliency and why is it important?

High availability of applications and platforms is one of the much-touted business benefits of the cloud. When an application successfully leverages fundamental capabilities of the cloud or hyperconverged infrastructure – such as automated and continuous monitoring, clustering, load balancing, and automatic failover – to deliver uninterrupted services at all times and under all conditions, it fulfills the promise of the cloud.

Application resiliency is the ability of the application to maintain a minimum viable or acceptable level of service in the face of various disruptions and challenges to usual or optimal operating conditions.

4 Database Automation Innovations to Turbocharge DevOps

In the early days of business computing, IT admins struggled to ensure stability of the systems because of frequent server crashes. In monolithic environments, both hardware and software components like servers and databases were expected to fail. Resiliency was achieved only through primary/secondary configuration structures, depth and redundancy. Availability and system uptime were achieved by distributing business-critical applications across servers in different locations and balancing the load among them. Further, servers, workstations, terminals, OSes, databases – all were rebooted at periodic intervals as a means of ensuring maximum availability.

So while “reliability” requires a system to function as expected, “resilience” builds on the expectation that things will go wrong. The application is structured and tested to adapt to and correct unexpected or “wrong” events. Along the way, it goes up from stability to availability to reliability to resilience.

Resiliency is not just about avoiding failure – it also involves accepting the failure and building and automating next-steps that allow the application to respond to the event and return to a fully-functioning or optimal state as quickly as possible.

A fully resilient application can adapt to unforeseen events that disrupt the IT environment and automatically initiate fault recovery or graceful degradation processes as defined. It continues to function normally (or as close to normally as possible) despite the failure of multiple or core components of the whole system.

The extent of application resiliency in the cloud and its relative importance to business continuity depend on various goals, requirements, and constraints that are influenced by the type of workload, the role of the users, and the scale and technical capabilities of the organization.

There are three different kinds of drivers that motivate an IT organization to build resilient apps:

Business drivers:

Cost savings on IT infrastructure, deployment and operations
Best user experience and minimal app downtime
Meeting user demands at times of peak and extended usage
Maximum QoS and availability
Retaining user trust
Flexibility to adapt to changing market demands

Development drivers:

Maximizing time spent on adding new features
Reducing time spent on troubleshooting
Following latest industry practices and trends in development

Operations drivers:

Optimal resource consumption
Reduce frequency and impact of disruptions and failures
Ability to recover quickly from failures
Increasing automation

All said and done, resilient applications serve to improve the availability of the system, which is the primary indicator of the health of the IT deployment.

Factors that affect application resiliency

Application resiliency mandates a well-thought out hybrid cloud strategy and planning at all levels of the architecture. It influences and is influenced by how the IT infrastructure and network is laid out and how the data and storage systems are designed.

“Access to shared infrastructure, data and application resources in the cloud play a critical role in helping organizations navigate disruptions,” said Rick Villars, Group VP, Worldwide Research at IDC.

“In the coming years, enterprises’ ability to govern a growing portfolio of cloud services will be the foundation for introducing greater automation into business and IT processes while also becoming more digitally resilient.”

There are a few constraints that limit the ability of the app to scale and deliver high performance. Developers, product designers and system architects must take care to minimize and not to introduce or worsen these constraints:

Hardware and software dependencies
Dependencies on other apps
Licensing restrictions
Lack of skills in development teams
Organizational resistance to change

Apart from these, there are challenges in planning for application resiliency that are specific to cloud environments. While the strategies used to build resilience could be similar to those used for traditional data centers, the implementations differ quite a bit.

Cloud systems favor scaling “out” to a larger number of nodes as compared to scaling “up” to a bigger, more powerful node in traditional IT architecture. This means that developers can code in a graceful degradation of the application in case of a node failure. They can avoid large service buys and provision resources by adding capacity in smaller units. In an on-premises private cloud deployment, VMs and load balancers might provide enough support.

Cloud Is Revolutionizing How Industries Develop Software

However, in an infrastructure that spans multiple geographic regions, things are complicated by requirements such as DNS session management, request routing, and persistent storage. Specific implementation and vendor support varies greatly in such scenarios.

Further, cloud-native applications operate over a distributed architecture unlike traditional monolithic apps, where everything runs together as a single process. Distributing the presentation layer and executing the business logic across multiple pieces of hardware needs detailed planning around state management, load balancing and latency correction.

In a cloud architecture, each of the microservices and cloud-based backing services that an application uses executes in a separate process and communicates via network-based calls. There are multiple challenges to application resiliency that can arise in this scenario:

Hardware failure
Network latency – the time taken for service requests to travel to the receiver and back
Transient faults – momentary loss of network connectivity
Crashed host processes
Blockage by a synchronous operation that has been running for a long time
Overloaded microservices that don’t respond
In-flight orchestrator operations such as moving a service from one node to another

While the underlying cloud platform have built-in protection to detect and mitigate many of these issues, applications must be designed to automatically and dynamically handle these events.

How to Build Application Resiliency in the Cloud

Cloud infrastructure and application development are two fields that are perennially evolving. There are a ton of best practices that help companies build and optimize resilient applications in the cloud:

Leverage Infrastructure as Code (IaC)

IaC allows admins to configure the infrastructure and handle its provisioning in similar ways to application code. The configuration and provisioning logic is stored in a code repository with source control, enabling versioning, discovery and auditing. This lets architects, admins and developers set up CI/CD pipelines with automatic testing and deployment of changes.

Automated infrastructure provisioning minimizes human error in configuration and provides consistent environments that can be replicated, increasing app resiliency in the process.

Use physically distributed resources

When cloud workloads are deployed across geographic regions, there are greater chances of latency and availability problems in the application. Deploying the app to multiple regions within a cloud gives it redundancy and enables it to withstand service disruptions in a particular region.

Monitor the whole environment

Understanding the behavior of the application and how it interacts with various other components of the hybrid cloud helps identify the metrics that are most critical to monitoring its performance. Tracking metrics at all levels – infrastructure, app and service – increases the chances of unearthing potential issues before they cause a disruption or outage, diagnose the cause and resolve it.

Infrastructure-level metrics include CPU load, I/O rate, memory usage and so on. They indicate if the hardware on which the app runs is overloaded or functioning as expected. App-level metrics provide information like the time taken to execute a query or perform a sequence of service calls. These metrics capture a snapshot of the conditions in which the app runs and can reveal issues in the workflow. Developers have full control over defining and monitoring these metrics. Service-level metrics help watch out for latency and errors in the interactions between various services and components that the app uses.

Finally, end-to-end or “black box” monitoring provides a holistic picture of the app’s health by analyzing its externally visible behavior just as users perceive it. The speed and ease with which users are able to perform core tasks and actions within defined thresholds reflects the availability of the application.

Consider managed services

Another practice that improves overall system availability in the hybrid cloud is to use managed services for certain parts of the application stack. This saves in-house admins the trouble of installing, operating, managing and supporting services or platforms that are central to the smooth functioning of the workload.

This could be something in which they don’t have enough expertise or something that is available off-the-shelf as a service, like a MySQL database. The availability of this resource is then guaranteed by the managed service provider or cloud service vendor and the onus of managing data replication and backups falls on them.

Test, Deploy, Scale

The rising popularity of cloud-native applications and adoption of Infrastructure as a Service (IaaS) in the enterprise has set new benchmarks and best practices for application resiliency. More and more workloads are shifting to the hybrid cloud. Thin provisioning and auto-scaling are enabling rapid deployment of new applications. Newer and better technologies are simplifying active/active setups, secondary and tertiary disaster recovery environments in the cloud, and multi-region load balancing.

As the divide between applications, platforms and infrastructure continues to blur, designing apps for resiliency will require in-depth planning and collaboration between IT and operations.

“The reality of enterprise IT now is that it's a multicloud world said Steve McDowell, Sr. Analyst at Moor Insights & Strategy. “Workloads can live anywhere and often do, and they ping pong back and forth.”

He also brought into focus the technology-driven advantages that organizations have today.

“What the software-defined world has done is enable companies to start and scale very quickly. It allows you to deploy new applications rapidly and enables businesses to leverage technology to provide more tailored services for customers.”

Dipti Parmar is a marketing consultant and contributing writer to Nutanix. She’s a columnist for major tech and business publications such as IDG’s CIO.com, Adobe’s CMO.com, Entrepreneur Mag, and Inc. Follow Dipti on Twitter @dipTparmar or connect with her on LinkedIn for little specks of gold-dust-insights.

Subscribe