10 Design Considerations While Building a Storage System for Virtualization

| min
Elegance of Nutanix

Virtualization is quickly taking over data centers. Gone are the days when IT admins worried about managing operating systems running directly on physical server hardware. The manageability and cumulative performance advantages of virtualization has led to a growing trend where consumer operating systems like Microsoft Windows are run within virtual machines. These virtual machines are managed by a hypervisor (such as VMware’s vSphere) that mediates access to the physical hardware in the server node. Clusters of such server nodes are being put together to host several hundred to even thousands of virtual machines. Such clusters afford high availability and load balance by permitting migration of virtual machines between server nodes.

Just like the rest of the physical hardware, the hypervisor also virtualizes the underlying storage for the virtual machines. Thus, each virtual machine may present a virtualized SCSI disk to the guest operating system running inside it. The data written to this virtualized disk cannot be simplistically mapped to an underlying physical disk. This is because this data needs to remain accessible once the virtual machine is migrated to another server node (e.g., upon a hardware failure). A sophisticated storage subsystem is therefore needed, one that can keep the data accessible despite the movement of virtual machines across server nodes.

The Nutanix Complete Cluster offers such a sophisticated storage subsystem that was designed specifically for virtualization workloads. This storage subsystem can be accessed by the hypervisor through industry standard iSCSI/NFS protocols. This blog talks about 10 key considerations that went into the design of this storage subsystem.

Elegance of Nutanix

1. Converged and distributed: Hardware trends in the past ten years indicate that disk capacities and speeds are growing at a much faster pace than network speeds. A cost-effective solution needs to be converged to leverage these trends – i.e., the storage needs to be placed close to the computation that accesses that storage and not across an expensive network fabric. The Nutanix offering epitomizes this by building a distributed storage subsystem using the local disks in the server nodes themselves. This is in sharp contrast to the single-headed SAN/NAS solutions that require expensive networking to deliver the high performance required by server clusters running virtual machines.

Legacy Design 2

2. Incremental scalability: As compute/storage needs grow, it should be possible to grow the system incrementally rather than requiring a complete hardware refresh as is typical with centralized SAN/NAS solutions. The Nutanix Complete Cluster is designed to be incrementally scalable, with no single point of bottleneck. While near linear scalability has been demonstrated in a cluster of 50 nodes, the design affords limitless scalability.

3. Performance: A storage system that considers performance as an after-thought opens itself up for one or more expensive architectural redesigns. The Nutanix Complete Cluster was designed for delivering high performance from the very outset. It combines traditional wisdom in distributed system design with new techniques to deliver high performance. These include a pipelined architecture, asynchronous request handling, extensive caching, and judicious use of Fusion-io ioMemory to keep frequently accessed data as well as metadata. The design specifically caters to virtualization workloads. For example, the NFS server implementation in the Nutanix Complete Cluster was designed to deliver high data IOPS (both random and sequential) rather than high namespace IOPS (which is what outdated benchmarks like SpecFS primarily measure). This is specially suited for virtualization as the bulk of the IO requests from guest VMs are converted into NFS read/write requests by the hypervisor when accessing the underlying storage subsystem through the NFS protocol.

Nutanix Direct Data Path

4. Random IO: With potentially hundreds of virtual machines simultaneously issuing IO requests, the data access patterns appear random by the time they are incident on the underlying storage system. In contrast to traditional storage subsystem designs, Nutanix was designed with the intent of delivering high random IO performance from the very start. It uses techniques such as a distributed operation log to absorb random writes, careful placement of metadata indexes in high performance SSDs for quick lookups, and extensive use of caching and deduplication to absorb boot/login storms. Recently, a 40-node Nutanix cluster successfully ran VMware’s RAWC benchmark with a record-breaking 3000 virtual machines. More details on this VDI reference architecture can be found @ http://bit.ly/yN9S01.

5. Fine-grained tiering: Gone are the days when the predominant form of persistent storage were magnetic disks with similar performance characteristics. Today the data can be stored on a wide variety of media e.g., SSDs, SAS/SATA drives etc, each affording different capacities and performance at a given price point. The storage subsystem in Nutanix recognizes these as separate tiers of storage and places data on them based on its temperature. Thus, hot data is placed on the faster SSDs while colder data might be placed on the slower SATA drives. As the temperature of data changes, the Nutanix complete cluster supports water-falling of data between tiers. To avoid polluting the SSDs with cold data, data is divided up into fine-grained units of a few megabytes that form the basis of data placement and migration. Such fine-grained management of data across tiers also enables Nutanix to quickly adapt to changing workloads.

Information Lifecycle Management

6. Consistency model: The Nutanix Complete Cluster can manage petabytes of data written by guest VMs. Just like other storage subsytems, metadata is maintained to enable the quick location of any data. Since losing data or returning stale data is not an acceptable option, a strict consistency model is supported. While relational database abstractions such as transactions can be used to implement strict consistency, this approach is known to be unscalable and slow. On the other hand, typical noSQL approaches that maintain structured information as a set of key/value pairs are know to be highly performant, but typically only afford eventual consistency. The Nutanix Complete Cluster adopts a novel two-fold approach for delivering high performance despite supporting strict consistency. First, the metadata is kept in a noSQL key/value store that was enhanced with the Paxos algorithm to provide strict consistency for updates of any given key’s value. Second, all metadata operations involving multiple keys are carefully sequenced in way so as to always keep the overall metadata tree completely consistent at all times. This approach provides the best of both worlds – delivering high performance while supporting strict consistency.

7. Congestion management: Every major function in the Nutanix Complete Cluster is handled by a different component. A key aspect of the design is that flow/congestion control is built into each of these components. Without proper congestion management, a distributed system can come to a grinding halt by entering situations where useful work can no longer be done. As an example, the component that manages writes to a disk might become clogged with requests. As a result, a remote sender may timeout its outstanding requests to the congested component and re-send them – causing further continued congestion. To avoid such situations, every component in the Nutanix Complete Cluster exerts appropriate flow control to ensure it accepts only as many requests as it can reasonably execute. In addition, stale or low priority requests are quickly dropped when congestion is detected.

8. Designed for high-availability: A highly available storage subsystem does not have the luxury of going offline when a few of its components fail. These components might be either software components, or hardware ones. The storage subsystem in the Nutanix complete cluster was designed for fault-tolerance. There is no single-point of failure and any component can fail and stay down for extended periods of time. Thus, any disk, node, network card etc may fail without affecting availability. All data is both replicated as well as checksummed to protect against faults. The number of replicas kept for the data is configurable – thus permitting simultaneous failure of one or more components without sacrificing availability.

Anatomy of A Write IO; 10,000 ft. view

9. Replication fan-out: Distributed storage subsystems are often designed by mirroring one disk onto another. With disk capacities running into terabytes, this implies that failure of one disk would require reading all the data from the other healthy disk in order to restore replication. Not only does this create a hot-spot in the system by making one disk the bottleneck while others might be idle, it also increases the chances of data loss because the intense workload on the healthy disk might also cause it to fail. The Nutanix Complete Cluster avoids this by replicating each unit of data (comprising a few megabytes) on a disk to a random disk in the rest of the cluster. On a disk failure, the corresponding replicas can be read to restore replication – the restored second copy can also be placed on any disk in the cluster. Thus, recovering from a failed disk utilizes all of the cluster’s resources and avoids the formation of any hot-spots.

10. Continuous healing: Nutanix’s highly available storage subsystem cannot freeze to run a data consistency check (akin to the fsck found in Unix filesystems). The distributed nature of the system coupled with the petabytes of data it can potentially manage implies that faults will happen sooner or later – for example due to failed components. To discover and recover from such problems, the Nutanix Complete Cluster continuously heals itself by running a MapReduce over its metadata and taking appropriate corrective measures based on the issues found. For example, if a data unit is found to be under-replicated due to a failed component, a replication will be kicked for that component. The MapReduce computation runs as a low-priority background job so as to not affect the performance of higher-priority IO requests emanating from the guest VMs. The use of MapReduce, whose use is predominant in Big Data analytics today, lends the Nutanix Complete Cluster the scalability to manage large amounts of data, while affording high availability at the same time.

To summarize, the Nutanix Complete Cluster bridges the gap between computation and storage by converging these in a compact rackable unit, one or more of which can be stacked together to build a powerful virtualization appliance. The new demands imposed by virtualization workloads required an architecture that was built ground-up to specifically meet these requirements. The yardsticks of availability, performance, and scalability indicate that the Nutanix Complete Cluster is delivering on its promise, and is stretching the horizons of what was earlier possible in the realm of virtualization. Despite everything that has been delivered so far, there are lot of more exciting things that are in the pipeline. So stay tuned.