Data Virtualization Promises No-Fuss Path to Information Unity

Technology has evolved to deliver on the vision of having a single view of all organization-wide data.

By Paul Gillin

By Paul Gillin June 18, 2020

Many IT departments have virtualized their servers – and perhaps their storage, networks and desktops, too. Now, what about virtualizing their data?

Enterprises striving for the analytics-driven decision-making at the foundation of digital transformation are increasingly doing just that. Gartner projects that by late 2020, for example, 35% of enterprise organizations will have implemented data virtualization as an alternative to data integration.

Data virtualization isn’t a new concept, but it has acquired new relevance in the big data era as an alternative to arduous data integration processes, according to IT services company NTT Data. Organizations can use it to harmonize data scattered across the business and even across external web and social media sites, NTT Data says, without the infrastructure overhead and labor costs of creating expensive data warehouses or sprawling data lakes.

The problem has been that most enterprises have built, acquired or otherwise come to own dozens or even hundreds of information silos over the years, ranging from spreadsheets to operational databases. Each has its own structural framework, or schema, although some have no structure at all. 

“Data silos are a serious business problem…because they prevent the collaboration necessary to ensure competitiveness,” wrote Forbes Technology Council member Walter Scott. “Companies need an operational data layer that is core to business processes and supports data sharing.”

No More Copies

Data virtualization creates a single logical view of multiple data sources without requiring the organization to “replicate data and try to homogenize it into a single source,” explained Mike Wronski, director of product marketing at Nutanix. He said that reduces the workload on the IT organization. Virtualization can also significantly reduce the need for extract/transform/load (ETL), a laborious process that requires the attention of expensive data scientists.

There’s room to virtualize at nearly every level of a company because no one ever has only one database.

Matthew Baird, chief technology officer at data virtualization company AtScale

These are among the reasons the data virtualization market is flourishing; Stratistics MRC expects it to reach $8.36 billion by 2026, growing at a 19.5% annual clip during that time. Growth has also been fueled in part by some disillusionment with Hadoop, the distributed processing framework widely credited with igniting the big data craze a decade ago.

It’s now possible to cost-effectively combine data from many different sources, including unstructured ones like email and Twitter conversations, into a single repository or data lake for analysis, explained Wronski. However, doing so required organizations to copy large amounts of data into new repositories that were hard to structure, update, and govern, he said. 

Over time, many projects were undermined by rising processing and storage costs as well as paralyzing complexity, prompting some observers to label the repositories “data swamps.”  

Complexity a Big Motivator

That complexity has become a principal driver of data virtualization. “You don’t have just three or four databases,” said Matthew Baird, chief technology officer at data virtualization company AtScale.

“Today, you may have 40 [databases],” including relational, text, graph, search and key-value stores, both on-premises and in the cloud.

Data Virtualization emerged years ago principally to federate queries and cache results for performance purposes, Baird said, but it was still up to engineers to specify the underlying data structures and sources. That manual approach is impractical today in the midst of what McKinsey estimates is a shortage of up to 190,000 data scientists and engineers.

Today’s technology spiders across networks to discover data at the source uses machine learning to interpret query results and optimizes the schema accordingly. “It’s an autonomous process that understands enough of the underlying infrastructure to do what data engineers would do,” Baird said. “You tell us what you have, and we figure out the best way to use it.”

Data virtualization enables queries to span many data sources at once while appearing to the user to be a single, unified resource. The data itself never moves, which pays off in reduced complexity, fewer errors and savings on servers, storage and bandwidth, said Baird.

Performance Dividend

While it might seem that the addition of an abstraction layer would extract a performance penalty, experts say that isn’t necessarily true. In the same way that virtual machines can perform better than bare-metal hardware, data virtualization architectures can improve response times by managing data and queries more efficiently.

IBM’s Queryplex, for example, processes queries in parallel across underlying source data and consolidates the results.

“Instead of funneling all the data through one node, it takes advantage of the computational mesh to do queries and analytics,” said Daniel Hernandez, vice president of IBM analytics, in an interview with SiliconAngle. “It distributes that workload.”

Data virtualization can also cut resource requirements by making the time-consuming ETL process dynamic. Instead of reformatting and integrating data in bulk before loading it into a data store, which is a process that can take weeks, the software takes a surgical approach to data movement. 

But ETL is more than dynamic, according to Wronski. “It has transformed into a ‘new process’ that's able to understand the data source and only move data based on needs. The old method moved ALL the data as part of ETL.”

That’s the approach Alluxio takes. The developer of a virtual distributed file system uses a global namespace along with intelligent caching and in-memory metadata to logically integrate data at the application level rather than at the storage level. Data movement is automated so only the data that’s absolutely needed is transformed.

“ETL becomes ELT,” said Dipti Borkar, who was Alluxio’s vice president of product management and marketing between November 2018 and February 2020.

Is data virtualization an all-or-nothing proposition?

“That’s the million-dollar question,” stated Baird. “There’s room to virtualize at nearly every level of a company because no one ever has only one database.”

The bigger payoff, however, comes from giving everyone who needs access to data “a single gateway, a single catalog, a single way to authenticate and apply policy,” he said.

That has contingent benefits in providing a unified view of how data is being used across the enterprise – information that organizations can use to allocate their storage and data resources more efficiently.

“Gateways that understand enterprise-wide needs have enormous value,” Baird said. “You know which users in which locations are querying which data and driving outcomes.”

Paul Gillin is a contributing writer. He is the former editor-in-chief of Computerworld and founding editor of TechTarget. He’s the author of five books about social media and online communities. Find him on Twitter @pgillin.

© 2020 Nutanix, Inc. All rights reserved.  For additional legal information, please go here.