Big Data and Virtualization: Where the Twain Shall Meet


By Dheeraj Pandey
| min
Big Data Analytics

Big Data and Virtualization are the two biggest revolutions underway in enterprise data centers. In the coming 3 to 5 years, they would have unquestionably changed the face of datacenters as we know them today. Interestingly, both these phenomena have been going on in parallel, with little to no intersection or impact on one other. That is about to change, as the waves of the hitherto independent “ripples” intersect, overlap, and create a massive cadence to shake enterprise IT even further.

This essay explores how the frontiers of these two ripples are about to meet, irreversibly changing each other’s fate along the way. Over the course of the next few years, they would become one massive ripple that couldn’t be traced back to two disparate epicenters of change. That is how interesting yet predictable this saga would be.

Big Data Analytics

The Great Public-Private Divide: Amazon chargebacks keep them honest

The biggest player in Big Data, Hadoop, is 100% virtualized when run in public clouds. Amazon’s Elastic Map Reduce (EMR) and every EC2/S3 customer that runs Hadoop jobs is running in a virtual machine container. True elasticity of Big Data jobs, in which machines can be added or removed programmatically, can only be attained by leveraging OS virtualization. True economies of scale — where multiple jobs can be run simultaneously and yet sandboxed from each other and from errant runaway jobs — can only be attained by leveraging the sandboxed “isolation” of a virtual machine.

Private cloud Hadoop, on the other hand, is running amuck with very little regard to hardware utilization and overall capital spend. Developers and scientists, who’ve never had to budget capex and opex before, are ordering rack-mounted servers like there is no tomorrow. These servers are running at 10-20% utilization because of lack of virtualization. IT is a mute spectator because they don’t really understand the Big Data phenomenon well enough to bring best practices to massive hardware clusters. And now, DevOps wants to pass the messy maintenance and monitoring of Hadoop clusters to IT, as they get impatient carrying pagers, replacing failed disks and network cards, and monitoring long-running data-crunching jobs.

So why this divide? Why are data scientists so careful optimizing hardware and emphasizing elasticity when running in public clouds than when they are running internally. Well, the answer is simple — Amazon chargebacks are for real, while IT’s are artificial. When real money changes hands, people automate, they share, they optimize, and they focus on system utilization. That time has come for private cloud environments. Private clouds will only thrive if they bring the same efficiencies as their public counterparts. Otherwise, CIO’s will see their teams shrink, as the efficiency divide between the two clouds grows.

Revenge of The Statistician: Can VDI keep them relevant?

The world has come to (synonymously) associate big data with Hadoop and data scientists with a new breed of Java-centric developers. In the enterprise though, data analytics will continue to be driven by a potent combination of this new Hadoop developer and the traditional statistician who uses SAS, R, OLAP cubes, and other tools that were built for the “desktop-centric modeler”. While statistical laws of sampling make it possible for these data scientists to develop models using their favorite desktop tool, Big Data is pushing the envelope on this conventional paradigm of data science and computing. The statistician’s laptop is neither secure nor elastic to handle demands of larger samples, iterative/agile modeling, and capturing outliers for fraud analytics.

The erstwhile data scientist’s role is not diminishing in any way, Hadoop notwithstanding. Their business domain experience are here to last; their Windows-based tools are here to coexist with Hadoop. The skill sets will blend over time with subject matter experts on either side. Interestingly, the biggest enemy of the statistician is not Hadoop but their physical desktop with its loose security and meager hardware resources. A desktop in the cloud, leveraging the strengths of cloud computing and virtualization, could be their biggest friend in the coming years. With VDI, IT plugs numerous security holes, as data never leaves the datacenter. With VDI, IT provides a truly elastic computing model for scientists, as their data processing demands grow and shrink.

Big data, bigger data: Whither Storage Virtualization?

While OS virtualization is beginning to come up quite often in the context of big data, storage virtualization is dangerously missing from the conversations. Amazon EC2 users continue to scratch the surface of storage tiering by shuttling massive amounts of data back-n-forth between the inexpensive (yet reliable!) S3 and the more expensive (and yet unreliable!) EC2 and EBS storage. But that manual process is ugly and kludgy, to say the least. In private clouds, the situation is even worse. There is literally no talk of virtualizing Hadoop’s storage onto various tiers of the enterprise storage stack. Non-Hadoop MPP platforms have an even more telling story. None of EMC Greenplum or HP Vertica or Teradata Aster have a distributed file system — unlike Hadoop HDFS — such that individual nodes can spill over or fail over into each other; or if disks fail or get full, the user queries continue to run.

There is so much simplicity to be had in the big data world by separating the storage logic from the data analytics logic. There is so much value to be had by transparently leveraging fast SSDs for NoSQL or random IO workloads; or transparently moving data between the more expensive server-attached tiers and the less expensive network-attached storage (archive) tiers. The big data stack is sorely missing storage virtualization, and things are beginning to hurt already. Ask any Amazon EC2-based company that shuttles data between S3 and their EC2 tier every morning and night when they spin up and down their Hadoop clusters!

Orchestration and Cloud Directors: Whither Big Data?

You’ve come this far reading this essay. I hope you and I are beginning to agree on how virtualization and cloud computing help big data in the enterprise. But what about the other way around? Is virtualization missing a big Big Data component? If you look at the embedded SQL databases underneath the enterprise management servers for virtualization, you will know what I mean. Most control plane products had SQL developers who understood how to write monitoring and system management applications with SQL as the cornerstone for persistence and querying. That is hurting us real bad today, what with fine-grained statistics for performance monitoring and chargebacks, and some very agile provisioning decisions that businesses have to make in today’s dynamic datacenter environment. The embedded SQL database in a management server has become the veritable tail that is wagging the orchestration dog. Virtualization administrators are having to painfully learn how to manage and debug unscalable SQL database environments, and make suboptimal decisions on stats gathering and chargebacks. Most of them don’t even know that their virtualization environments are slow or unreliable because of that pesky transaction log in that hidden 800-pound gorilla embedded in their vCenter server!

Orchestration, fine-grained chargeback, and performance debugging in multi-tenant cloud environments will continue to suffer as long as management products take a narrow SQL (and strictly transactional/ACID) view of persistence. Big data and NoSQL, with their minds free of the SQL and ACID-clutter, are the real need of the hour for large-scale cloud environments.

Big Virtualization’s Biggest Nemesis: Storage

We all know by now that virtualization over-promised and under-delivered vis-à-vis costs, complexity, and performance. And storage was the culprit. Network storage was built for physical environments, and is inarguably out-of-place and out-of-time in today’s highly dynamic virtualization environments. Solid-state-drives make storage networking and spindle-based storage controllers look archaic. Fetching data that is sitting multiple hops away in a separate storage aisle is making virtual machines crawl. VDI projects are running into rough weather because the legacy storage underneath cannot predictably scale between the pilot proof-of-concept and the real production scale.

So are there any lessons from the big data world for “big virtualization”? Turns out, scaling and simplifying large virtual environments share a solution that is at the heart of big data: collapsing the wall between compute and storage, and bringing storage as close to compute as possible. Now that is an epiphany moment! Storage networking and its massive complexity is the bane of virtualization and multi-tenant cloud environments. We need dramatically simpler architectures that beautifully scale out with time. We need a virtualized data center to resemble a Hadoop cluster, or a Google cluster, or a Facebook shared-nothing cluster powered completely by a FusionIO tier. We need enterprise data centers to look no different than those of the big cloud companies. We need convergence, period.

The Tale of The Two Ripples

While big data and virtualization originated in isolation, they share more in common than it appears on the surface. The two ripples will seamlessly meld into one…