Flash – Performance at what cost?
Flash is everywhere these days and everyone likes to throw out impressive performance numbers. If someone told you that flash is faster than a hard drive I am sure the thought that pops into your head is, “yeah… thanks for the info”. Nothing shocking about the statement, the real question is what do you do with this resource? Most vendors are using roughly the same NAND with slight variations. How does the implementation take full advantage of the most expensive part of the solution?
One Giant Pool of Flash
Nutanix is a converged distributed storage system made up of many nodes. Each node supplies locally direct attached flash and hard drives. It’s the local virtual storage controller running on each node that takes all of the resources and puts them into one storage pool.
While workloads running on Nutanix write locally for performance, the workloads are not limited to the local flash devices. Nutanix doesn’t create silos within its architecture; it’s controlled orchestration to take advantage of all of the components. The Nutanix Virtual Computing has an awareness of flash capacity on all of the nodes and when reaching capacity on a local device will allow workloads to go over the network to maintain performance. Nutanix customers don’t have to pay a premium for larger flash devices and have peace of mind if the working set spills over the available local flash.
This is made possible by the implementation of Apache Cassandra combined with Paxos for the metadata layer. The metadata layer allows Nutanix to become polymorphic in nature, easily adaptable to add more nodes with different sizes of SSD to the cluster at anytime. While the diagram above only shows 3 nodes, there is no architectural limit to the number of nodes acting as one device. While the diagram is busy and might seem complicated, Nutanix customer’s don’t have to worry about any of this. Sophisticated under the covers, yes. But uncompromisingly simple for the user.
Another benefit of having access to all of the flash resources is in the event of a SSD failure. If you have a locked in configuration where you can only write to the same devices, then you’re in trouble when a SSD or node goes offline. Storage doesn’t have anywhere to perform the synchronous write. Performance goes back to the speed of spinning disk which ends up lighting up the support lines like the Fourth of July.
Multiple SSD drives
We are able to write anywhere in the cluster it also gives us the unique ability to use multiple SSD’s on each node without creating a silo out of the hard drives. If you’re doing tiering of hot data or caching eventually some of your data will hit the hard drives. If you split your hard drives amongst the available SSDs you’ll end up limiting the speed on which you can read from them. Splitting up hard drives also has a direct impact on recovery times if you have to perform a rebuild as well. You could create 3, 4 maybe 5 copies of data to get around the problem but then capacity will become the next issue on your hands.
The only way to get around splitting up your hard drives is then to buy a very large SSD or PCI-e device. Economics will always favour the dominant drive type on the market so it may become cost prohibitive to select the larger drive.
Most of the all flash array vendor have some “Secret Sauce” that have unlocked the key for longevity but history shows the reality is not as large as they would have us believe. The SATA tweakers years ago would tell the tale when placing data on the outer cylinders because it was supposed to be faster. Remember Pillar and their disk geometry arguments? These are all short-lived optimizations. All modern SSDs have a native garbage collector, do write-wear leveling, etc. in their proprietary FTL (flash translation layer) so reality is a lot of the work has already be done.
Nutanix controls the hardware, and therefore can make executive decisions on how to use SSDs. Nutanix under-formats the SSDs to the manufacture standards where needed – this provides good steady state performance under constant load as more free blocks are available to choose from. It’s beneficial to under format rather than waiting for garbage collection to kick in and free some blocks. Intel defaults recommends down formatting their SSDs to 15% to deliver good steady state performance, Nutanix makes sure this happens every time.
Even Dell can’t guarantee the exact drive SSD that will end up in their systems. They do guarantee a range of performance but without knowing the actual drive you’ll get makes it hard to down format to the right level. It boils down to just another configuration step to add to the already heaping pile for most system admins.
Nutanix also keeps track of all of the writes that take place on the SSD and will notify before a problem occurs such as approaching the drives’ write limit thru the PRISM UI. Nutanix can also sense the write pattern and send sequential workloads directly to the hard drive tier to improve endurance. Not all workloads benefit from flash so to get the best of both worlds, the performance tier saves space and will last longer that sending data blindly to SSD. Nutanix uses memory for cache which can take advantage of our inline dedupe to make best use of the the resources. To avoid any concern and a testament to our architecture even the smallest Nutanix support contract includes all parts including the SSD’s as well.
For those really paranoid about integrity (as we are!), we do a checksum for every piece of data in the cluster even the metadata itself. A checksum is also computed on the write and is included in the metadata record. When a read happens, it computes a new checksum and compares it with the checksum stored in metadata. If the two checksums match, the data is deemed to be consistent and valid. If the checksums fail to match, the extent is marked as ‘corrupt’ and a replica from a remote node is fetched for all subsequent requests. A replication task is then initiated to restore the data to the desired replication factor and remediate any corrupt extents. The vitally important if your going to take about Petabytes of storage as data can go bad on disk, this referred to as bit rot.
Apache Cassandra has over 4,000 engineers as contributors and Nutanix has engineering prowess that have contributed code at Yahoo, Google and Facebook that helped to form the patent around this web scale technology. This technology can really only be implemented as VM. Improvements in the next major release are a direct result of using Cassandra to tag hardware automatically. The software will become more life like and spread roles not only over compute but all individual hardware components like SSD’s. This type of granularity of control really is only possible with scale-out solution like Cassandra. If another solution out strips cassandra the code has been written in an extensible way so it can be changed in Darwinistic like fashion.
As flash evolves to non-volatile memory and the next wave after that, Nutanix will be in a position to use these resources the most cost effectively. They may show up as direct replacement or as another tier within Nutanix. The point here is that the groundwork has been laid and the future is bright on the performance tier for Nutanix.