Technology

New Gold Standard Metrics for AI Performance Click into APIs

MLCommons cofounder David Kanter details how the revamped MLPerf benchmark, which focuses on endpoints and API interactions, helps IT leaders navigate the complexities of modern AI deployments.

Article:Technology
Nutanix-Newsroom:Article
Use Cases:AI ML

By David Rand

May 5, 2026

When choosing the best artificial intelligence capabilities to meet their needs, IT decision makers often turn to a neutral referee: the nonprofit consortium MLCommons. Their MLPerf benchmarks have become the gold standard for evaluating the speed and efficiency of AI systems. Rarely does a CIO make a substantial investment in AI capabilities without first turning to an MLPerf scorecard to compare the performance of machine learning (ML) hardware, software, and cloud services.

The peer-reviewed, reproducible MLPerf results help organizations cut through vendor marketing to identify the IT infrastructure that best suits their specific AI workload needs. In April, the scorekeeper moved its original framework to fully encompass all manner of IT systems, whether purchased, rented, or accessed through the cloud . Going forward, MLPerf will measure performance via API endpoints, which is the standard way to deliver generative AI.

“If it has an API, MLPerf can measure it,” said David Kanter, In an April presentation at NVIDIA's recent GTC conference in San Jose, Calif.

RELATED Measuring the Prime Ingredient in Enterprise AI

In this Tech Barometer podcast, MLCommons Co-founder David Kanter talks about creating the MLPerf benchmark to help enterprises understand AI workload performance of various data storage technologies.

Nutanix-Newsroom:Article, Podcast

By Jason Lopez

April 22, 2025

Kanter, cofounder and head of MLPerf at MLCommons, unveiled the overhauled benchmarking suite to address a problem that barely existed when MLPerf was created: how to fairly measure the performance of the Gen AI services that most customers will never own outright.

In an interview with The Forecast, Kanter explained how the new framework was developed in nine months with more than 35 industry partners, including Advanced Micro Devices, Google, Intel, NVIDIA, and Nutanix.

“It’s a shift towards an API-centric architecture,” Kanter said. “Gen AI performance is not just a hardware thing. It's about software. It's about capabilities.”

“We wanted to build a benchmark that can talk to hardware, infrastructure as a service and endpoints because at the end of the day, customers are choosing between a combination of all three.”

Results will be published on a rolling basis rather than every six months. Results on spreadsheets now appear in interactive visualizations. For the first time, the benchmark will measure managed endpoints in the cloud alongside on-premises hardware.

“It’s the most substantial overhaul of the benchmark to date, and the first built for an era in which AI is rented as often as it is owned,” he told The Forecast.

“It’s specifically designed to improve the velocity so we can get quicker measurements of performance and to be a lot more understandable for the much broader community. We measure performance in a way that is really tightly aligned with what customers experience.”

RELATED Importance of AI Data Storage Performance

How MLPerf Storage benchmark helps AI and ML developers compare performance of different data storage technologies.

Article:Technology
Nutanix-Newsroom:Article

By Tom Mangan

November 22, 2024

In his April presentation, Kanter explained that AI is typically consumed via an API, so it’s important to measure everything from managed endpoints to owned hardware.

“This is a much wider space that we need to operate across than ever before, and we need to do so at a much more rapid pace," he told the GTC Conference audience.

A Benchmark for Buyers

The changes land at a moment when enterprise buyers are being asked to make large, fast decisions about infrastructure they only partly understand. Demand for AI is coming from every corner of the business, from customer-facing products to internal workflows, and the bill is landing in IT.

IDC predicts that by 2027, agent use among Global 2000 companies will jump tenfold and token loads a thousandfold – a level of growth the firm says will make agent orchestration and optimization essential IT responsibilities. Meanwhile, Gartner projects that spending on inference-focused applications, in which trained AI models generate responses to user queries, will more than double to $20.6 billion this year. That would account for 55% of all spending on AI-optimized cloud infrastructure.

RELATED 4 Trends Defining the Future of Enterprise AI

Nutanix’s Chief AI Officer Debojyoti “Debo” Dutta explains advancements shaping how enterprises will adopt and implement artificial intelligence.

Article:Technology
Nutanix-Newsroom:Article

By Tom Mangan

March 13, 2025

Those buyers aren’t from Google or Meta, which have their own evaluation teams. Most are CIOs at banks, hospitals, and factories that need defensible numbers to justify purchase decisions, explained Kanter. He said benchmarks have become the price of admission for vendors selling into those environments.

He recalled a recent conversation with an executive at a large American bank.

"Not only do I know about MLPerf," Kanter said the executive told him, "but no vendor steps in my door without having MLPerf numbers."

Measuring a Moving Target

The redesign responds to a reality that the older MLPerf framework was never built to capture. Gen AI performance is what Kanter described as “fiendishly complex,” a multidimensional problem involving time-to-first-token, total throughput, variable input and output lengths, and unpredictable user behavior. A developer asking for a single line of code and a user requesting a thousand-line document can place radically different demands on a system simultaneously.

"In the real world, you're always going to have outliers, long tail queries and variable arrival patterns," Kanter said. "Yes, most of your queries might be short, but you know there's always that one person who's going to drop in the dictionary and ask, 'Gee, which word is the most common?'"

Tech Insights to move smarter and faster.

To capture that behavior, MLCommons is adopting what it calls Pareto curves, a set of measurements taken at different utilization levels that show how a system behaves under a light, versus heavy, load. Submitters must provide at least seven measured points per curve, and the published results will display only the measured data rather than smooth interpolations.

It’s a design choice to prevent performance from being misrepresented by 20% to 30%, explained Kanter.

“The rolling submission cadence is a more consequential shift,” he said.

The current MLPerf schedule, with results released twice a year, was calibrated to the pace of hardware innovation. Software now moves faster.

When Software Outruns Hardware

Steve McDowell, chief analyst at NAND Research, highlighted the magnitude of that gap in his analysis of MLPerf Inference v6.0, the last round to use the old format. NVIDIA's six-month-old GB300 systems posted a 2.77-times gain in per-GPU throughput on a reasoning benchmark compared with their debut scores, driven almost entirely by software optimization.

RELATED AI Intensifies Digital Transformation

Findings from the State of Entperise AI and the Enterprise Cloud Index reveal why IT leaders are enabling AI applications and leveraging AI to manage their complex data center operations.

Article:Business
Nutanix-Newsroom:Article

By Michael Brenner

June 13, 2024

The value of a GPU platform, McDowell wrote in a recent blog, "is not fixed at deployment" but can be enhanced through ongoing software tuning.

A rolling benchmark, Kanter explained, would let customers see those gains within weeks rather than waiting for the next scheduled round.

The architectural shift also acknowledges how enterprises are actually deploying AI. Kanter said conversations with information technology leaders have convinced him that the dominant pattern is not cloud or on-premises, but a blend of both, often alongside managed endpoints from specialized vendors. He explained that often regulatory requirements and operational needs drive organizations to adopt a hybrid IT strategy, balancing on-premises infrastructure, IaaS, and managed endpoints. This approach addresses both compliance-driven control and the desire to reduce technical complexity.

RELATED The Shift From Building Smarter AI Models to Running Them

Inference, the process of actually using AI, is becoming the operational core of enterprise strategy. Industry experts explain why this is changing everything.

Article:Technology
Key Play:Enterprise AI
Nutanix-Newsroom:Article

By David Rand

April 1, 2026

"Oftentimes, it’s ‘How do I combine on-prem managed infrastructure and infrastructure as a service to deliver what I ultimately need at the right price point?’"

“These are all the decisions that people are going to be making as they go forward, so we want to help guide them with trusted information on the performance and the efficiency that they can expect. It’s all done by the industry collaboratively in a fair, open, transparent, and robust process.”

What Comes Next

Kanter acknowledged that many questions remain. Total cost of ownership, which depends on variables such as where a data center is located and how much local power costs, has proved difficult to standardize, and an audience member pressed him on it during the question period.

He said MLCommons is seeking industry input on how to represent it rigorously.

RELATED AI’s Next Wave

Nutanix CEO Rajiv Ramaswami sees AI’s biggest economic impact coming after organizations move past initial investment and experimentation to real-world use.

Article:News
Key Play:Enterprise AI, Platform
Nutanix-Newsroom:Article

By Ken Kaplan

March 18, 2026

Official results for the new MLPerf Inference: Endpoints benchmark will appear at endpoints.mlcommons.org by mid-year, Kanter said.

The redesigned benchmark, he added, is meant to put the tools of serious AI evaluation in reach of the buyers who now need them most.

“Ultimately, it's about making the right decision for deploying AI,” Kanter said. “To do that, you need to be able to trust the decision-making inputs. Benchmarks are a key part of that.”

A Platform That Turns AI Vision Into AI Reality.

David Rand is a business and technology reporter whose work has appeared in major publications around the world. He specializes in spotting and digging into what’s coming next–and helping executives in organizations of all sizes know what to do about it.

New Gold Standard Metrics for AI Performance Click into APIs

A Benchmark for Buyers

Measuring a Moving Target

When Software Outruns Hardware

What Comes Next

Related Articles

About Nutanix

Subscribe to The Forecast By Nutanix

Learn More