Technology

The Shift From Building Smarter AI Models to Running Them

Inference, the process of actually using AI, is becoming the operational core of enterprise strategy. Industry experts explain why this is changing everything.
  • Article:Technology
  • Key Play:Enterprise Ai
  • Nutanix-Newsroom:Article

April 1, 2026

Attention and resources are moving beyond building large language models (LLMs) to actually putting that intelligence to work in the real world. That process, known as AI inference, is how trained models become useful products, and it has quietly become the thing that major tech players are chasing. 

Morgan Stanley calls inference a "new and potentially much larger phase" than anything the AI industry has seen before, one that could dramatically alter the competitive landscape.

RELATED Survey Shows Speed of AI Innovation Strains IT Control
In this Tech Barometer podcast, analyst Steve McDowell and cloud native technology expert Dan Ciruli discuss top topics from the 2026 Enterprise Cloud Index, a survey of IT professionals, which revealed tension between the need for IT governance and the reality of easy-to-build-and-deploy containerized apps. Demand for AI capabilities is driving up shadow IT use, forcing IT teams to manage more risks.
  • Key Play:Enterprise Ai
  • Nutanix-Newsroom:Article, Podcast
  • Use Cases:Cloud Native

March 31, 2026

While vendors are building the chips, servers and other components powering inference, the real advantage may lie with organizations that can bring the entire AI stack together and run it reliably and efficiently at scale.

From Lab to Load

Understanding why inference has become all about infrastructure requires some context. Since the deep learning boom of the early 2010s, and especially since ChatGPT thrust generative AI into the mainstream in late 2022, the story of AI has mostly focused on pouring oceans of data through increasingly large neural networks to train increasingly capable models. Inference was, relatively speaking, an afterthought.

But as AI systems move from experimentation to production, the work of running models is beginning to overshadow the work of training them. Indeed, Deloitte projects inference will account for roughly two-thirds of AI compute workloads by 2026, up from about one-third in 2023. The Futurum Group, meanwhile, told The Forecast that spending on inference-focused servers will eclipse spending on training-focused servers for the first time this year.

Not surprisingly, investment activity is surging. AI inference startup Baseten recently raised $300 million in funding, including $150 million from NVIDIA, a major player in inference technology. Intel signed a multiyear AI inference agreement with SambaNova, an AI chip startup. And Nutanix forged a $250 million strategic partnership with chipmaker AMD to deliver an open AI infrastructure platform for enterprise inference and agentic workloads.

“Inference is where the rubber really hits the road,” said Brendan Burke, semiconductor research director at The Futurum Group. “We’ve created all these tools. Now we need to better understand how to deploy them in companies, in physical technology, and things like that.”

Why Running Inference Is Harder Than It Looks

Still, turning that surge of investment into working production systems won’t be easy. Many AI models perform well in controlled benchmarks, but real-world inference introduces variables that those tests rarely capture.

“The biggest challenges are really not just latency, because everyone knows latency is a thing,” David Kanter, founder of MLCommons and the organization behind MLPerf, the industry’s most widely cited AI benchmarking standard, told The Forecast. “But it’s a combination of latency, interactivity, and throughput when you’re serving under load.”

RELATED AI Sparks Rise in Shadow IT
The 2026 Enterprise Cloud Index shows 79% of IT leaders encounter unauthorized AI deployments, and this familiar pattern of Shadow IT puts them at risk.
  • Article:Business
  • Key Play:Enterprise Ai, Hybrid Cloud, Thought Leadership
  • Nutanix-Newsroom:Article
  • Products:Nutanix Enterprise AI (NAI)
  • Use Cases:Security

March 19, 2026

That distinction helps explain why rigorous benchmarking has become increasingly important for enterprise buyers. MLPerf, for example, has become a reference point for evaluating AI systems before deployment. In one case, the National Renewable Energy Laboratory used it both as a selection criterion and a contractual acceptance standard for a major supercomputer procurement, Kanter said.

Even then, inference workloads are notoriously difficult to benchmark because their demand patterns are highly unpredictable. A developer requesting a single-line code completion and a user generating a thousand-line document may place vastly different computational demands on the same system at the same time.

Storage adds another layer of complexity. Generative AI systems rely heavily on a key-value (KV) cache, which stores intermediate computations so models can reuse prior context during token generation (the process by which models produce output one fragment of text at a time), and that cache can quickly expand into massive memory and storage demands.

“Two years ago, I was talking to the folks who do MLPerf Storage (benchmarks), and I said, ‘focus on training. I don’t think there’s an inference play’,” Kanter said. “And boy was I wrong.”

The Cost Equation

Those operational challenges quickly translate into economics. Inference isn’t a one-time event like training. It runs continuously every time a user submits a prompt. Katner explained that whereas prompts and tokens are generally GenAI related, inference is broader.

“A recommendation or image classification is not done via prompts or tokens,” he said. Instead, those are done via pixel arrays and feature maps. While LLMs “read” by turning text into spefic numeric tokens, image classification “sees” by first turning an image into a grid of pixel values (RGB), which a range of gradient values.

RELATED AI’s Next Wave
Nutanix CEO Rajiv Ramaswami sees AI’s biggest economic impact coming after organizations move past initial investment and experimentation to real-world use.
  • Article:News
  • Key Play:Enterprise Ai, Platform
  • Nutanix-Newsroom:Article

March 18, 2026

Performance and cost, therefore, move together at inference scale. Throughput, measured in tokens per second, determines the unit economics of running any generative AI application. When throughput improves, the cost per request falls. When it falls, more use cases become economically viable.

NVIDIA’s own analysis of four leading inference providers, Baseten, DeepInfra, Fireworks AI, and Together AI found 4x to 10x reductions in cost per token using its Blackwell GPU platform with open-source models.

“Performance is what drives down the cost of inference,” Dion Harris, senior director of HPC and AI hyperscaler solutions at NVIDIA, told VentureBeat. “What we’re seeing in inference is that throughput literally translates into real dollar value.”

Model selection adds another dimension to the cost equation.

Frank Nagle, a research scientist at the MIT Initiative on the Digital Economy and chief economist at the Linux Foundation, told The Forecast that enterprises routinely overpay because they default to large proprietary models when less costly open-source alternatives would serve them equally well.

His research found that open-source models from Meta, DeepSeek, Mistral, and others cost about 87% less than proprietary closed models from OpenAI, Anthropic, and Google, which currently account for nearly 80% of all AI tokens processed on OpenRouter, the leading AI inference platform. Open models already achieve roughly 90% of the performance of closed ones at launch, and that gap narrows quickly.

Nagle's research suggests optimal reallocation from closed to open models could cut global AI inference spending by more than 70%, saving the global AI economy approximately $25 billion annually. But he acknowledged that most organizations stick with what they know, which right now happens to be proprietary models.

“There's lots of money being floated around for this stuff, and so people don't have to be super cost-conscious,' Nagle said. “I think that's something that will evolve over time."

A Full-Stack Race

Those economic pressures are now shaping the evolution of the entire inference ecosystem. What looks like a single technology shift is actually a competition unfolding across the entire AI stack.

The inference landscape is not a single market. It is a vertical stack, with vendors staking out positions at every layer, from silicon and systems to storage, orchestration, and cloud platforms.

RELATED Tension Mounts Between Supply Chain Challenges and AI Adoption
Nutanix CEO Rajiv Ramaswami describes the struggle CIOs face navigating hardware shortages while deploying transformative AI capabilities.
  • Article:News
  • Key Play:Enterprise Ai, Platform
  • Nutanix-Newsroom:Article

March 11, 2026

“AI in general is a full system, full-stack problem,” Kanter said, explaining that advances depend not just on better models but on improvements across the entire computing stack. That includes the silicon itself, the semiconductor processes used to manufacture chips, and the memory and storage systems that feed data to them. It also includes packaging technologies that connect those components together so they operate more efficiently as a system. 

Crucially, it also encompasses many layers of software that optimize, orchestrate and manage the data on one side and the hardware on the other. That has real implications for how enterprises buy. Point solutions that optimize one layer while ignoring the others tend to create new bottlenecks rather than solving them. The organizations pulling ahead are those treating inference as an end-to-end infrastructure challenge, not a series of isolated procurement decisions.

The Enterprise Decision: Where Does It Run?

Once organizations move from experimentation to production, a practical question quickly emerges: where should inference actually run? That decision shapes how CIOs balance performance, cost, data control, and operational complexity.

The options span on-premises hardware, cloud or infrastructure-as-a-service (IaaS) environments, and managed inference endpoints provided by third-party vendors.

In practice, most enterprises are landing on hybrid strategies, sizing on-premises infrastructure for predictable demand while using cloud capacity to absorb spikes.

As inference workloads grow, many organizations are discovering that deployment decisions quickly become infrastructure decisions. Running AI reliably at scale requires platforms that can manage models, GPUs, and data consistently across environments.

“We are in the early days of AI inferencing in the enterprise. What you will see is people will (eventually) turn to infrastructure teams to stand up the AI infrastructure they need so they don’t have to worry about it,” predicted Nutanix President and CEO Rajiv Ramaswami, during last year’s RAISE Summit. 

Those infrastructure teams must then decide where inference should run, navigating a mix of technical, regulatory, and operational constraints. In healthcare and financial services, data residency rules can rule out the cloud entirely. Other workloads push inference closer to the edge. Automakers, for example, cannot route vehicle safety decisions through distant cloud data centers.

Even when organizations settle on where inference should run, the operational details can still create surprises. Two issues surface repeatedly.

RELATED AI Flips the Data Storage Paradigm
As data storage shifts from passive to active IT infrastructure that responds to AI’s beck and call, Nutanix's Vishal Sinha explains the mindshift needed for managing data now and in the future.
  • Key Play:Hybrid Cloud
  • Nutanix-Newsroom:Article, Video
  • Use Cases:AI ML

March 5, 2026

The first is GPU cold-start latency, where spinning up a new inference endpoint can take so long that it frustrates users during demand spikes, as large models must first be loaded into memory before they can begin generating responses.

The second is GPU use. In many environments, performance depends on how efficiently existing GPUs are shared across workloads. Partitioning and scheduling techniques often unlock additional usable compute by allowing multiple workloads to run on the same hardware simultaneously.

The Inference Economy Ahead

None of those challenges are insurmountable. Vendors across the stack are racing to address cost, latency, and operational complexity at the platform level, and the infrastructure is maturing quickly. But technology alone won't be enough. The value organizations ultimately extract from inference depends as much on how they restructure their workflows as on what they buy.

Nagle draws an analogy to the Industrial Revolution, noting that the steam engine delivered only incremental gains when it replaced water wheels. The transformative leap came when factories were entirely redesigned around the new capability.

"It won't be until the companies are fully reorienting their people, their processes, and their workflows that we will start to see the huge leaps in productivity that we all hope will emerge from a technology of this importance," he said.

Kanter had an even simpler way of thinking about it, comparing inference to salt, which is rarely the star of a recipe yet often the secret ingredient that chefs use to make everything taste better.

"We're going to see inference sprinkled all over the place," he said. "It'll be all around us."

David Rand is a business and technology reporter whose work has appeared in major publications around the world. He specializes in spotting and digging into what’s coming next–and helping executives in organizations of all sizes know what to do about it.

© 2026 Nutanix, Inc. All rights reserved. For additional information and important legal disclaimers, please go here.

Related Articles