The Nutanix Solution

The Nutanix Virtual Computing Platform is a web-scale converged infrastructure solution that consolidates the compute (server) tier and the storage tier into a single, integrated appliance.

Nutanix uses the same web-scale principles and technologies that power the IT environment at innovative web companies and cloud providers such as Google, Facebook, and Amazon. Nutanix makes web-scale accessible to mainstream enterprises and government agencies without requiring an overhaul of their IT environments.

The Nutanix solution is radically simple compared to traditional datacenter infrastructures.

  • Rapid time to value: deployment in under an hour
  • No disruption to ongoing operations
  • Predictable, linear scaling
  • Powerful off-the-shelf, non-proprietary hardware
  • Lower cost and complexity of storage
  • Advanced, enterprise-class storage capabilities

How Nutanix Works
Watch Now!
  Nutanix in 2 Min.
Watch Now!

The modular building-block design allows your organization to start with small deployments and grow incrementally into very large cluster installations. With one appliance, you can move from a small operation to handling large enterprise deployments including server virtualization, virtualized business applications, virtual desktop initiatives, test and development environments, big data (e.g. Splunk, Hadoop) projects, and more.

The Nutanix Virtual Computing Platform integrates high-performance server resources with enterprise-class storage in a cost-effective 2U appliance. It eliminates the need for network-based storage architecture, such as a storage area network (SAN) or network-attached storage (NAS). The scalability and performance that the world’s largest, most efficient datacenters enjoy are now available to all enterprises and government agencies.

Nutanix Distributed Filesystem

The Nutanix Distributed Filesystem (NDFS) is at the core of the Nutanix Virtual Computing Platform. It manages all metadata and data, as well as enables all core features. NDFS is the software-driven architecture that connects storage, compute resources, controller VM, and the hypervisor. It also provides full Information Lifecycle Management (ILM), including localizing data to the optimal node.

Data availability

NDFS was designed from the ground up to be extremely fault-resilient. It ensures data availability in the event of a node, controller, or disk failure. NDFS uses a replication factor (RF) that keeps redundant copies of all data. Writes to the platform are logged in the high-performance SSD tier, and are replicated to another node before the write is committed and acknowledged to the hypervisor. If a failure occurs, NDFS automatically rebuilds data copies to maintain the highest level of availability.

The platform is self-healing. Leveraging distributed MapReduce jobs, it proactively scrubs data to resolve disk or data errors. If a controller VM on a node fails, all I/O requests are automatically forwarded to another controller VM until the local controller becomes available again. This Nutanix auto-pathing technology is completely transparent to the hypervisor, and guest VMs continue to run normally. In the case of a node failure, an HA event is automatically triggered and VMs fail over to other hosts within the cluster. Nutanix ILM localizes I/O operations by migrating data to the virtual machine’s local controller VM. Simultaneously, data is re-replicated to maintain RF and overall availability.

NDFS provides built-in converged backup and disaster recovery (DR). The converged-backup capabilities leverage array-side snapshots and clones, which are performed using sub block-level change-tracking at the VM and file level. The snapshots and clones are instantaneous, and dynamic disk/thin provisioning maintains very low overhead. These capabilities also support hypervisor array offload capabilities, such as VMware API for Array Integration (VAAI).

Snapshots can be configured on a standard schedule to align with RPO and RTOs, and can be replicated to remote sites using array-side replication. This replication is configurable at the VM level, and only the sub-block-level changes are shipped to the remote replication site.

Smart metadata

Metadata is distributed among all nodes in the cluster in order to eliminate any single point of failure and to allow scalability that increases linearly with cluster growth. The metadata is partitioned using a consistent hashing scheme to minimize the redistribution of keys during cluster-sizing modifications.

The system enforces strong consistency using Paxos, which a distributed consensus algorithm. Quorum-based leadership election eliminates potential “split brain” scenarios (e.g. network partitions), which ensures strict consistency of data.

Data efficiency

A core design principle of the Nutanix platform is data localization. It keeps data proximate to the VM and allows write I/O operations to be localized on that same node. If a VM migrates to another host in an event such as DRS, vMotion (VMware) or PRO and live migration (Hyper-V), the data automatically follows the VM so it maintains the highest performance. After a certain number of read requests made by a VM to a controller that resides on another node, Nutanix ILM transparently moves the remote data to the local controller. The read I/O is served locally, instead of traversing the network.

Nutanix incorporates data tiering, which leverages multiple tiers of storage and optimally places data on the storage tier that provides the best performance. The architecture was built to support local disks attached to the controller VM (SSD, HDD) as well as remote (NAS) and cloud-based source targets. The Nutanix system continuously monitors data-access patterns to determine whether access is random, sequential, or a mixed workload. Random I/O workloads are maintained in an SSD tier to minimize seek times. Sequential workloads are automatically placed into HDD to improve endurance.

The most frequently accessed data (i.e., hot data) resides on the highest performance tier (SSD tier). That tier is not just a cache – it is a truly persistent tier for both read and write operations as well as QoS-controlled data. Cold data sits on hard disk drives, the highest-capacity and most economical tier.

The Elastic Deduplication Engine is a software-driven, massively scalable and intelligent data reduction technology. Nutanix Deduplication performs inline deduplication in RAM and flash tiers, and will perform background deduplication in the storage tier (hard disks) to maximize efficiency. Unlike traditional deduplication technologies, which focus only on the storage tier, Nutanix Elastic Deduplication Engine spans memory, flash and disk resources simultaneously in a natively converged platform.

NDFS array-side compression capabilities work in combination with Nutanix ILM. For sequential workloads, data is compressed during the write operation using in-line compression. For batch workloads, post-process compression adds significant value as data is compressed once it becomes idle and ILM has moved it down to the highest capacity tier (HDD). All compression configurations are carried out at a container level, and operate at a granular VM and file level. Decompression is done at the sub-block level to ensure precise granularity. The operations are monitored by the ILM process, which proactively moves frequently accessed, decompressed data up to a higher performance data tier.

Learn more: read the Nutanix tech note on reliability.

Next-Generation Datacenter Platform

solution_productoverview_Converged-Platform-1

Converged platform

The Nutanix Virtual Computing Platform converges compute and storage into a single system, eliminating the need for traditional storage arrays. The Nutanix 2U appliance contains two to four independent nodes, each optimized for high-performance compute, memory, and storage. Each node runs an industry-standard hypervisor, and a Nutanix controller VM. The controller VM handles all data I/O operations for the local hypervisor. 

All storage is directly mounted into the controller VM using a device pass-through mechanism. Storage resources are then exposed to the hypervisor through traditional interfaces, such as NFS or iSCSI. As new Nutanix nodes are added to the cluster, the number of controller VMs scale 1:1 to provide linear performance. Storage capacity from all nodes is aggregated into a global storage pool, which is accessible by all Nutanix controllers and hosts in the cluster. Containers are then defined from the storage pool, creating a logical datastore. Containers represent the main access point for the hypervisor, and are accessed using traditional interfaces.

The Nutanix platform uses industry-standard hardware. It does not rely on custom FPGAs, ASICs, RAID controllers, or disk drives. As a software-defined solution, Nutanix maintains the control logic in the software, and enables new features through simple software upgrades. NDFS is extensible. The Nutanix platform does not require a shared backplane for communication. Instead, it leverages standard 10GbE for all communications between nodes and controllers, as well as for VM traffic.

Scale-out architecture

The Nutanix platform is based on the same architectural precepts that enable the world’s largest datacenters to scale. Google, Facebook, and Amazon all use a similar design. The Nutanix Distributed File System (NDFS) scales to thousands of nodes and maintains performance and availability as your system grows. Modular, converged building blocks (nodes) allow datacenter managers to start small and scale seamlessly to support future growth.

The Nutanix n-way controller model scales the number of storage controllers with the number of nodes. This design eliminates the performance bottlenecks common with traditional dual-controller storage arrays. Each Nutanix node that is added to a cluster uses its local controller VM as its gateway to NDFS and as its primary I/O point. Nutanix takes a big-data approach with a distributed MapReduce framework to manage cluster-wide operations. Nutanix distributes tasks and operations for self-healing and the redistribution of data for high availability. 

IT can mix various Nutanix node types, whether they are compute-heavy or storage-heavy. So, your team can construct an infrastructure with the right balance for a particular environment or workload. Once they are powered on, new Nutanix nodes are automatically discovered using the Linux Avahi protocol and IPv6 link-local addresses. They are then added through the dynamic add-node process with zero downtime. Cluster metadata is distributed to new nodes as they are added, and storage resources are added to the cluster’s storage pool. This process extends the container’s capacity transparently. VMs are provisioned on the new hosts and cluster-balancing features such as DRS or performance and resource optimization (PRO) move VMs to the new hosts.

Learn more: read the Nutanix tech note on scalability

Architectural Design

Converged Platform

The Nutanix solution is a converged storage + compute solution which leverages local components and creates a distributed platform for virtualization aka virtual computing platform.

The solution is a bundled hardware + software appliance which houses 2 (6000 series) or 4 nodes (1000/2000/3000/3050 series) in a 2U footprint. Each node runs a industry standard hypervisor (ESXi, KVM, Hyper-V currently) and the Nutanix Controller VM (CVM). The Nutanix CVM is what runs the Nutanix software and serves all of the I/O operations for the hypervisor and all VMs running on that host. Dependent on the hypervisor’s capabilities, the SCSI controller, which manages the SSD and HDD devices, is directly passed to the CVM leveraging Intel VT-d.

Below is an example of what a typical node logically looks like:

Together, a group of Nutanix Nodes forms a distributed platform called the Nutanix Distributed Filesystem (NDFS). NDFS appears to the hypervisor like any centralized storage array, however all of the I/Os are handled locally to provide the highest performance. More detail on how these nodes form a distributed system can be found below.

Below is an example of hows these Nutanix nodes form NDFS:


Cluster Components

The Nutanix platform is composed of the following high-level components:

Medusa

  • Key Role: Distributed metadata store
  • Description: Medusa stores and manages all of the cluster metadata in a distributed ring like manner based upon a heavily modified Apache Cassandra. The Paxos algorithm is utilized to enforce strict consistency. This service runs on every node in the cluster.

Zeus

  • Key Role: Cluster configuration manager
  • Description: Zeus stores all of the cluster configuration including hosts, IPs, state, etc. and is based upon Apache Zookeeper. This service runs on three nodes in the cluster, one of which is elected as a leader. The leader receives all requests and forwards them to the peers. If the leader fails to respond a new leader is automatically elected.

Stargate

  • Key Role: Data I/O manager
  • Description: Stargate is responsible for all data management and I/O operations and is the main interface from the hypervisor (via NFS, iSCSI or SMB). This service runs on every node in the cluster in order to serve localized I/O.

Curator

  • Key Role: Map reduce cluster management and cleanup
  • Description: Curator is responsible for managing and distributing tasks throughout the cluster including disk balancing, proactive scrubbing, and many more items. Curator runs on every node and is controlled by an elected Curator Master who is responsible for the task and job delegation.

Prism

  • Key Role: UI and API
  • Description: Prism is the management gateway for component and administrators to configure and monitor the Nutanix cluster. This includes Ncli, the HTML5 UI and REST API. Prism runs on every node in the cluster and uses an elected leader like all components in the cluster.

Data Structure

The Nutanix Distributed Filesystem is composed of the following high-level structs:

Storage Pool

  • Key Role: Group of physical devices
  • Description: A storage pool is a group of physical storage devices including PCIe SSD, SSD, and HDD devices for the cluster. The storage pool can span multiple Nutanix nodes and is expanded as the cluster scales. In most configurations only a single storage pool is leveraged.

Container

  • Key Role: Group of VMs/files
  • Description: A container is a logical segmentation of the Storage Pool and contains a group of VM or files (vDisks). Some configuration options (eg. RF) are configured at the container level, however are applied at the individual VM/file level. Containers typically have a 1 to 1 mapping with a datastore (in the case of NFS/SMB).

vDisk

  • Key Role: vDisk
  • Description: A vDisk is any file over 512KB on NDFS including .vmdks and VM hard disks. vDisks are composed of extents which are grouped and stored on disk as an extent group.

Below we show how these map between NDFS and the hypervisor:

Extent

  • Key Role: Logically contiguous data
  • Description: A extent is a 1MB piece of logically contiguous data which consists of n number of contiguous blocks (varies depending on guest OS block size). Extents are written/read/modified on a sub-extent basis (aka slice) for granularity and efficiency. An extent’s slice may be trimmed when moving into the cache depending on the amount of data being read/cached.

Extent Group

  • Key Role: Physically contiguous stored data
  • Description: A extent group is a 4MB piece of physically contiguous stored data. This data is stored as a file on the storage device owned by the CVM. Extents are dynamically distributed among extent groups to provide data striping across nodes/disks to improve performance.

Below we show how these structs relate between the various filesystems:

Here is another graphical representation of how these units are logically related:


I/O Path Overview

The Nutanix I/O path is composed of the following high-level components:

OpLog

  • Key Role: Persistent write buffer
  • Description: The Oplog is similar to a filesystem journal and is built to handle bursty writes, coalesce them and then sequentially drain the data to the extent store. Upon a write the OpLog is synchronously replicated to another CVM’s OpLog before the write is acknowledged for data availability purposes. All CVM OpLogs partake in the replication and are dynamically chosen based upon load. The OpLog is stored on the SSD tier on the CVM to provide extremely fast write I/O performance, especially for random I/O workloads. For sequential workloads the OpLog is bypassed and the writes go directly to the extent store. If data is currently sitting in the OpLog and has not been drained, all read requests will be directly fulfilled from the OpLog until they have been drain where they would then be served by the extent store/content cache. For containers where fingerprinting (aka Dedupe) has been enabled, all write I/Os will be fingerprinted using a hashing scheme allowing them to be deduped based upon fingerprint in the content cache.

Extent Store

  • Key Role: Persistent data storage
  • Description: The Extent Store is the persistent bulk storage of NDFS and spans SSD and HDD and is extensible to facilitate additional devices/tiers. Data entering the extent store is either being A) drained from the OpLog or B) is sequential in nature and has bypassed the OpLog directly. Nutanix ILM will determine tier placement dynamically based upon I/O patterns and will move data between tiers.

Content Cache

  • Key Role: Dynamic read cache
  • Description: The Content Cache (aka “Elastic Dedupe Engine”) is a deduped read cache which spans both the CVM’s memory and SSD. Upon a read request of data not in the cache (or based upon a particular fingerprint) the data will be placed in to the single-touch pool of the content cache which completely sits in memory where it will use LRU until it is ejected from the cache. Any subsequent read request will “move” (no data is actually moved, just cache metadata) the data into the memory portion of the multi-touch pool which consists of both memory and SSD. From here there are two LRU cycles, one for the in-memory piece upon which eviction will move the data to the SSD section of the multi-touch pool where a new LRU counter is assigned. Any read request for data in the multi-touch pool will cause the data to go to the peak of the multi-touch pool where it will be given a new LRU counter. Fingerprinting is configured at the container level and can be configured via the UI. By default fingerprinting is disabled.

Here we show a high-level overview of the Content Cache:

Extent Cache

  • Key Role: In-memory read cache
  • Description: The Extent Cache is an in-memory read cache that is completely in the CVM’s memory. This will store non-fingerprinted extents for containers where fingerprinting and dedupe is disabled. As of version 3.5 this is separate from the Content Cache, however these will be merging in a subsequent release.

How It Works

Data Protection

The Nutanix platform currently uses a replication factor (RF) to ensure data redundancy and availability in the case of a node or disk failure. As explained in the Architectural Design section, OpLog acts as a staging area to absorb incoming writes onto a low-latency SSD tier. Upon being written to the local OpLog the data is synchronously replicated to another one or two Nutanix CVM’s OpLog (dependent on RF) before being acknowledged (Ack) as a successful write to the host. This ensures that the data exists in at least two independent locations and is fault tolerant. All nodes participate in OpLog replication to eliminate any “hot nodes” and ensuring linear performance at scale.

Data is then asynchronously drained to the extent store where the RF is implicitly maintained. In the case of a node or disk failure the data is then re-replicated among all nodes in the cluster to maintain the RF.

Below we show an example of what this logically looks like:

NDFS_OplogReplication


Data Locality

Being a converged (compute+storage) platform, I/O and data locality is key to cluster and VM performance with Nutanix. As explained above in the I/O path, all read/write IOs are served by the local Controller VM (CVM) which is on each hypervisor adjacent to normal VMs. A VM’s data is served locally from the CVM and sits on local disks under the CVM’s control. When a VM is moved from one hypervisor node to another (or during a HA event) the newly migrated VM’s data will be served by the now local CVM. When reading old data (stored on the now remote node/CVM) the I/O will be forwarded by the local CVM to the remote CVM. All write I/Os will occur locally right away. NDFS will detect the I/Os are occurring from a different node and will migrate the data locally in the background allowing for all read I/Os to now be served locally. The data will only be migrated on a read as to not flood the network.

Below we show an example of how data will “follow” the VM as it moves between hypervisor nodes:


Scalable Metadata

Metadata is at the core of any intelligent system and is even more critical for any filesystem or storage array. In terms of NDFS there are a few key structs that are critical for its success: it has to be right 100% of the time (aka. “strictly consistent”), it has to be scalable, and it has to perform, at massive scale. As mentioned in the architecture section above, NDFS utilizes a “ring like” structure as a key-value store which stores essential metadata as well as other platform data (eg. stats, etc.). In order to ensure metadata availability and redundancy a RF is utilized among an odd amount of nodes (eg. 3, 5, etc.).

Upon a metadata write or update the row is written to a node in the ring and then replicated to n number of peers (where n is dependent on cluster size). A majority of nodes must agree before anything is committed which is enforced using the paxos algorigthm. This ensures strict consistency for all data and metadata stored as part of the platform.

Below we show an example of a metadata insert/update for a 4 node cluster:

Performance at scale is also another important struct for NDFS metadata. Contrary to traditional dual-controller or “master” models, each Nutanix node is responsible for a subsest of the overall platform’s metadata. This eliminates the the traditional bottlenecks by allowing metadata to be served and manipulated by all nodes in the cluster. A consistent hashing scheme is utilized to minimize the redistribution of keys during cluster size modifications (aka. “add/remove node”)

When the cluster scales (eg. from 4 to 8 nodes), the nodes are inserted throughout the ring between nodes for “block awareness” and reliability.

Below we show an example of the metadata “ring” and how it scales:


Shadow Clones

The Nutanix Distributed Filesystem has a feature called ‘Shadow Clones’ which allows for distributed caching of particular vDisks or VM data which is in a ‘multi-reader’ scenario. A great example of this is during a VDI deployment many ‘linked clones’ will be forwarding read requests to a central master or ‘Base VM’. In the case of VMware View this is called the replica disk and is read by all linked clones and in XenDesktop this is called the MCS Master VM. This will also work in any scenario which may be a multi-reader scenario (eg. deployment servers, repositories, etc.).

Data or I/O locality is critical for the highest possible VM performance and a key struct of NDFS. With Shadow Clones, NDFS will monitor vDisk access trends similar to what it does for data locality. However in the case there are requests occurring from more than two remote CVMs (as well as the local CVM), and all of the requests are read I/O, the vDisk will be marked as immutable. Once the disk has been marked as immutable the vDisk can then be cached locally by each CVM making read requests to it (aka Shadow Clones of the base vDisk).

This will allow VMs on each node to read the Base VM’s vDisk locally. In the case of VDI, this means the replica disk can be cached by each node and all read requests for the base will be served locally. NOTE: The data will only be migrated on a read as to not flood the network and allow for efficient cache utilization. In the case where the Base VM is modified the Shadow Clones will be dropped and the process will start over. Shadow clones are disabled by default (as of 3.5) and can be enabled/disabled using the following NCLI command: ncli cluster edit-params enable-shadow-clones=true

Below we show an example of how Shadow Clones work and allow for distributed caching:


Elastic Dedupe Engine

The Elastic Dedupe Engine is a software based feature of NDFS which allows for data deduplication at that capacity (HDD) and performance (SSD/Memory) tiers. Sequential streams of data are fingerprinted during ingest using a SHA-1 hash at a 16K granularity. This fingerprint is only done on data ingest and is then stored persistently as part of the written block’s metadata. Contrary to traditional approaches which utilize background scans, requiring the data to be re-read, Nutanix performs the fingerprint in-line on ingest. For duplicate data that can be deduplicated in the capacity tier the data does not need to be scanned or re-read, essentially duplicate copies can be removed.  

NOTE: Initially a 4K granularity was used for fingerprinting, however after testing 16K offered the best blend of deduplication with reduced metadata overhead.  When deduped data is pulled into the cache this is done at 4K.

Below we show an example of how the Elastic Dedupe Engine scales and handles local VM I/O requests:

NDFS_EDE_OnDisk2

Fingerprinting is done during data ingest of data with an I/O size of 64K or greater. Intel acceleration is leveraged for the SHA-1 computation which accounts for very minimal CPU overhead. In cases where fingerprinting is not done during ingest (eg. smaller I/O sizes), fingerprinting can be done as a background process.

The Elastic Deduplication Engine spans both the capacity disk tier (HDD), but also the performance tier (SSD/Memory). As duplicate data is determined, based upon multiple copies of the same fingerprints, a background process will remove the duplicate data using the NDFS MapReduce framework (curator). For data that is being read, the data will be pulled into the NDFS Content Cache which is a multi-tier/pool cache. Any subsequent requests for data having the same fingerprint will be pulled directly from the cache. To learn more about the Content Cache and pool structure, please refer to the ‘Content Cache’ sub-section in the I/O path overview, or click HERE.

Below we show an example of how the Elastic Dedupe Engine interacts with the NDFS I/O path:


Networking and I/O

The Nutanix platform does not leverage any backplane for inter-node communication and only relies on a standard 10GbE network. All storage I/O for VMs running on a Nutanix node is handled by the hypervisor on a dedicated private network. The I/O request will be handled by the hypervisor which will then forward the request to the private IP on the local CVM. The CVM will then perform the remote replication with other Nutanix nodes using its external IP over the public 10GbE network. For all read requests these will be served completely locally in most cases and never touch the 10GbE network.

This means that the only traffic touching the public 10GbE network will be NDFS remote replication traffic and VM network I/O. There will however be cases where the CVM will forward requests to other CVMs in the cluster in the case of a CVM being down or data being remote. Also, cluster wide tasks such as disk balancing will temporarily generate I/O on the 10GbE network.

Below we show an example of how the VM’s I/O path interacts with the private and public 10GbE network:


CVM Autopathing

Reliability and resiliency is a key, if not the most important, piece to NDFS. Being a distributed system NDFS is built to handle component, service and CVM failures. In this section I’ll cover how CVM “failures” are handled (I’ll cover how we handle component failures in future update). A CVM “failure” could include a user powering down the CVM, a CVM rolling upgrade, or any event which might bring down the CVM. NDFS has a feature called autopathing where when a local CVM becomes unavailable the I/Os are then transparently served by another CVM.

The hypervisor and CVM communicate using a private 192.168.5.0 network on a dedicated vSwitch (more on this above). This means that for all storage I/Os these are happening to the internal IP addresses on the CVM (192.168.5.2). The external IP address of the CVM is used for remote replication and for CVM communication.

Below we show an example of what this looks like:

In the event of a local CVM failure the local 192.168.5.2 addresses previously hosted by the local CVM is unavailable. NDFS will automatically detect this outage and will redirect these I/Os to another CVM in the cluster over 10GbE. The re-routing is done transparently to the hypervisor and VMs running on the host. This means that even if a CVM is powered down the VMs will still continue to be able to perform I/Os to NDFS. NDFS is also self healing meaning it will detect the CVM has been powered off and will automatically reboot or power-on the local CVM. Once the local CVM is back up and available, traffic will then seamlessly be transferred back and served by the local CVM.

Below we show a graphical representation of how this looks for a failed CVM:


Disk Balancing

NDFS is designed to be a very dynamic platform which can react to various workloads as well as allow heterogeneous node types: compute heavy (3050, etc.) and storage heavy (60X0,etc.) to be mixed in a single cluster. Ensuring uniform distribution of data is an important item when mixing nodes with larger storage capacities.

NDFS has a native feature called disk balancing which is used to ensure uniform distribution of data throughout the cluster. Disk balancing works on a node’s utilization of its local storage capacity and is integrated with NDFS ILM. It’s goal is to keep utilization uniform among nodes once the utilization has breached a certain threshold.

Below we show an example of a mixed cluster (3050 + 6050) in a “unbalanced” state:

Disk balancing leverages the NDFS Curator framework and is run as a scheduled process as well as when a threshold has been breached (eg. local node capacity utilization > n %). In the case where the data is not balanced Curator will determine which data needs to be moved and will distribute the tasks to nodes in the cluster.

In the case where the node types are homogeneous (eg. 3050) utilization should be fairly uniform. However, if there are certain VMs running on a node which are writing much more data then others there can become a skew in the per node capacity utilization. In this case disk balancing would run and move the coldest data on that node to other nodes in the cluster.

In the case where the node types are heterogeneous (eg. 3050 + 6020/50/70), or where a node may be used in a “storage only” mode (not running any VMs), there will likely be a requirement to move data.

Below we show an example the mixed cluster after disk balancing has been run in a “balanced” state:

In some scenarios customers might run some nodes in a “storage only” state where only the CVM will run on the node who’s primary purpose is bulk storage capacity.

Below we show an example of how a storage only node would look in a mixed cluster with disk balancing moving data to it from the active VM nodes:


Software-Defined Controller Architecture

As mentioned above (likely numerous times), the Nutanix platform is a software based solution which ships as a bundled software + hardware appliance. The controller VM is where the vast majority of the Nutanix software and logic sits and was designed from the beginning to be a extensible and pluggable architecture.

A key benefit to being software defined and not relying upon any hardware offloads or constructs is around extensibility. Like with any product life cycle there will always be advancements and new features which are introduced. By not relying on any custom ASIC/FPGA or hardware capabilities, Nutanix can develop and deploy these new features through a simple software update. This means that the deployment of a new feature (say deduplication) can be deployed by upgrading the current version of the Nutanix software. This also allows newer generation features to be deployed on legacy hardware models.

For example, say you’re running a workload running a older version of Nutanix software on a prior generation hardware platform (eg. 2400). The running software version doesn’t provide deduplication capabilities which your workload could benefit greatly from. To get these features you perform a rolling upgrade of the Nutanix software version while the workload is running, and whala you now have deduplication. It’s really that easy.

Similar to features, the ability to create new “adapters” or interfaces into NDFS is another key capability. When the product first shipped it solely supported iSCSI for I/O from the hypervisor, this has now grown to include NFS and SMB. In the future there is the ability to create new adapters for various workloads and hypervisors (HDFS, etc.). And again, all deployed via a software update.

This is contrary to mostly all legacy infrastructure as a hardware upgrade or software purchase was normally required to get the “latest and greatest” features. With Nutanix it’s different, since all features are deployed in software they can run on any hardware platform, any hypervisor and be deployed through simple software upgrades.

Below we show a logical representation of what this software-defined controller framework looks like:


Storage Tiering and Prioritization

The Disk Balancing section above talked about how storage capacity was pooled among all nodes in a Nutanix cluster and that ILM would be used to keep hot data local. A similar concept applies to disk tiering in which the cluster’s SSD and HDD tiers are cluster wide and NDFS ILM is responsible for triggering data movement events.

A local node’s SSD tier is always the highest priority tier for all I/O generated by VMs running on that node, however all of the cluster’s SSD resources are made available to all nodes within the cluster. The SSD tier will always offer the highest performance and is a very important thing to manage for hybrid arrays.

The tier prioritization can be classified at a high-level by the following:

Specific types of resources (eg. SSD, HDD, etc.) are pooled together and form a cluster wide storage tier. This means that any node within the cluster can leverage the full tier capacity, regardless if it is local or not.

Below we show a high level example of how this pooled tiering looks:

A common question is what happens when a local node’s SSD becomes full? As mentioned in the Disk Balancing section a key concept is trying to keep uniform utilization of devices within disk tiers. In the case where a local node’s SSD utilization is high, disk balancing will kick in to move the coldest data on the local SSDs to the other SSDs throughout the cluster. This will free up space on the local SSD to allow the local node to write to SSD locally instead of going over the network. A key point to mention is that all CVMs and SSDs are used for this remote I/O to eliminate any potential bottlenecks and remediate some of the hit by performing I/O over the network.

The other case is when the overall tier utilization breaches a specific threshold [curator_tier_usage_ilm_threshold_percent (Default=75)] where NDFS ILM will kick in and as part of a Curator job will down-migrate data from the SSD tier to the HDD tier. This will bring utilization within the threshold mentioned above or free up space by the following amount [curator_tier_free_up_percent_by_ilm (Default=15)], which ever is greater. The data for down-migration is chosen using last access time.

In the case where the SSD tier utilization is 95%, 20% of the data in the SSD tier will be moved to the HDD tier (95% –> 75%). However, if the utilization was 80% only 15% of the data would be moved to the HDD tier using the minimum tier free up amount.

NDFS ILM will constantly monitor the I/O patterns and (down/up)-migrate data as necessary as well as bring the hottest data local regardless of tier.


Availability Domains

Availability Domains aka node/block/rack awareness is a key struct for distributed systems to abide by for determining component and data placement.  NDFS is currently node and block aware, however this will increase to rack aware as cluster sizes grow.  Nutanix refers to a “block” as the chassis which contains either one, two or four server “nodes”.  NOTE: 3 blocks must be utilized for block awareness to be activated.

For example a 3450 would be a block which holds 4 nodes.  The reason for distributing roles or data across blocks to ensure if a block fails or needs maintenance the system can continue to run without interruption.  NOTE: Within a block the redundant PSU and fans are the only shared components

Awareness can be broken into a few key focus areas:

  • Data (The VM data)
  • Metadata (Cassandra)
  • Configuration Data (Zookeeper)

Data

With NDFS data replicas will be written to other blocks in the cluster to ensure that in the case of a block failure or planned downtime, the data remains available.  This is true for both RF2 and RF3 scenarios as well as in the case of a block failure.

An easy comparison would be “node awareness” where a replica would need to be replicated to another node which will provide protection in the case of a node failure.  Block awareness further enhances this by providing data availability assurances in the case of block outages.

Below we show how the replica placement would work in a 3 block deployment:

NDFS_BlockAwareness_DataNorm

In the case of a block failure, block awareness will be maintained and the re-replicated blocks will be replicated to other blocks within the cluster:

NDFS_BlockAwareness_DataFail2

Metadata

Nutanix leverages a heavily modified Cassandra platform to store metadata and other essential information.  Cassandra leverages a ring-like structure and replicates to n number of peers within the ring to ensure data consistency and availability.

Below we show an example of the Cassandra ring for a 12 node cluster:

NDFS_CassandraRing_12Node3

 

Cassandra peer replication iterates through nodes in a clockwise manner throughout the ring.  With block awareness the peers are distributed among the blocks to ensure no two peers are on the same block.

Below we show an example node layout translating the ring above into the block based layout:

NDFS_CassandraRing_BlockLayout_Write2

 

With this block aware nature, in the event of a block failure there will still be at least two copies of the data (with Metadata RF3 – In larger clusters RF5 can be leveraged).

Configuration Data

Nutanix leverages Zookeeper to store essential configuration data for the cluster.  This role is also distributed in a block aware manner to ensure availability in the case of a block failure.

Below we show an example layout showing 3 Zookeeper nodes distributed in a block aware manner:

NDFS_Zookeeper_BlockLayout

 

In the event of a block outage, meaning on of the Zookeeper nodes will be gone, the Zookeeper role would be transferred to another node in the cluster as shown below:

NDFS_Zookeeper_BlockLayout_Fail