The Sacred Scale-out Principle and the Art of Building Distributed Systems

| min

The age of monolithic scale-up systems is coming to an end. As applications span multiple hosts and multiple datacenters, the underlying infrastructure needs to the redesigned. Building large-scale distributed systems right is a notoriously tough challenge, the pursuit of which keeps every engineer at Nutanix excited about coming to work everyday. The key challenge in building distributed systems is to follow the Sacred Scale-out Principle:

“Resources needed in any node should be proportional to the size of that node and not to the size of the cluster.”

It implies that as we add more nodes to a cluster, the services running on the existing nodes should not see an increase of burden on their resources. Responsibility, therefore, should inherently be distributed and the new nodes added to a cluster should proportionately contribute to all the existing services in the cluster.

In other words, there cannot be a master node that keeps track of all data or all metadata in any cluster. As the cluster grows it would hit its scale up limits. No single node knows everything, but as an aggregate the cluster knows everything – like Eywain Avatar’s Pandora.

This notion of giving up control on an individual basis in favor of distributed control — where any given node owns only a subset of the data — is the essence of scale-out design. Once you crack this problem for every (yes, every) aspect of your software, you have a linear scale-out technology that has predictable behavior irrespective of scale.

In late 2009, we embarked on a journey to re-invent storage. We built the Nutanix Distributed File System (NDFS) as the first hyperconverged storage system that would adhere to the Sacred Scale-out Principle. The Nutanix cluster is built using a micro-services architecture, where each micro-service spans all the nodes in the cluster. For example, data, metadata, statistics, analytics, MapReduce and alert services all have agents on every node. Since, no one node owns all the data for the service, we needed to add the notion of “consensus” to make the nodes “agree” and always give a consistent answer. Starting with an eventually consistent platform of Cassandra, we added an implementation of distributed paxos on top of Cassandra serving as the foundational layer for all our services. A special service based on Zookeeper kept track of the state of the cluster with a constant overhead (a few megabytes) sufficient for even large clusters with thousands of nodes. Every aspect of the system needed to be designed ground up for scale. Even when Prism shows a bunch of graphs for analytics, each graph processing is done using real-time map-reduce and hence the GUI layer’s responsiveness is maintained kept irrespective of the cluster’s size.

As soon as we thought we had solved the pain points in the datacenter, we realized there was another mountain to climb! The true potential of a web-scale storage system can only be achieved when it is accompanied with a truly web-scale computing fabric. Otherwise, it’s like a Ferrari stuck behind a 1918 Buick!

VM management in the datacenter needed an overhaul. The virtual computing platform that enjoys web-scale storage should itself scale out linearly, have the ability to self-heal, and always be available. Just like AWS, the private cloud for the enterprise needs to be “always on”, and be able to handle any bursty workload.

It is no more acceptable to have a VM provisioning layer that goes down during an upgrade, or one that needs to be reprovisioned as the number of deployed hosts increases. Management and analytics are two sides of the same coin and co-ordinating between separate tools and technologies is a Band-Aid that won’t last very long in the agile future of the datacenter. As we looked into the world of vCenter, vCops, Openstack, System Center, etc., we observed that the issues that plagued dual-controller storage architectures were also plaguing the virtualization management fabric in the enterprise. We needed to take a fresh look at virtualization — a project called “Acropolis” was born.

Rinse, Repeat with hard work! After re-leveraging Cassandra, Paxos, MapReduce, and Zookeeper, we arrived at a cluster-wide virtualization platform. We could spin up hundreds of VMs in a matter of seconds, thanks to the years of religious work on the underlying scale-out metadata layers that we borrowed from the storage stack. Going forward, combining management with analytics meaningfully is going to be the holy-grail in modern web-scale computing. We have built a layer for real-time analytics on top of the VM orchestration layer that also now extends to managing containers. This opens up some very exciting avenues for informed provisioning and self-managing systems of the future. We are proud to announce the beginnings of this next chapter in our journey.