When choosing the best artificial intelligence capabilities to meet their needs, IT decision makers often turn to a neutral referee: the nonprofit consortium MLCommons. Their MLPerf benchmarks have become the gold standard for evaluating the speed and efficiency of AI systems. Rarely does a CIO make a substantial investment in AI capabilities without first turning to an MLPerf scorecard to compare the performance of machine learning (ML) hardware, software, and cloud services.
The peer-reviewed, reproducible MLPerf results help organizations cut through vendor marketing to identify the IT infrastructure that best suits their specific AI workload needs. In April, the scorekeeper moved its original framework to fully encompass all manner of IT systems, whether purchased, rented, or accessed through the cloud . Going forward, MLPerf will measure performance via API endpoints, which is the standard way to deliver generative AI.
“If it has an API, MLPerf can measure it,” said David Kanter, In an April presentation at NVIDIA's recent GTC conference in San Jose, Calif.
Kanter, cofounder and head of MLPerf at MLCommons, unveiled the overhauled benchmarking suite to address a problem that barely existed when MLPerf was created: how to fairly measure the performance of the Gen AI services that most customers will never own outright.
In an interview with The Forecast, Kanter explained how the new framework was developed in nine months with more than 35 industry partners, including Advanced Micro Devices, Google, Intel, NVIDIA, and Nutanix.
“It’s a shift towards an API-centric architecture,” Kanter said. “Gen AI performance is not just a hardware thing. It's about software. It's about capabilities.”
“We wanted to build a benchmark that can talk to hardware, infrastructure as a service and endpoints because at the end of the day, customers are choosing between a combination of all three.”
Results will be published on a rolling basis rather than every six months. Results on spreadsheets now appear in interactive visualizations. For the first time, the benchmark will measure managed endpoints in the cloud alongside on-premises hardware.
“It’s the most substantial overhaul of the benchmark to date, and the first built for an era in which AI is rented as often as it is owned,” he told The Forecast.
“It’s specifically designed to improve the velocity so we can get quicker measurements of performance and to be a lot more understandable for the much broader community. We measure performance in a way that is really tightly aligned with what customers experience.”
In his April presentation, Kanter explained that AI is typically consumed via an API, so it’s important to measure everything from managed endpoints to owned hardware.
“This is a much wider space that we need to operate across than ever before, and we need to do so at a much more rapid pace," he told the GTC Conference audience.
The changes land at a moment when enterprise buyers are being asked to make large, fast decisions about infrastructure they only partly understand. Demand for AI is coming from every corner of the business, from customer-facing products to internal workflows, and the bill is landing in IT.
IDC predicts that by 2027, agent use among Global 2000 companies will jump tenfold and token loads a thousandfold – a level of growth the firm says will make agent orchestration and optimization essential IT responsibilities. Meanwhile, Gartner projects that spending on inference-focused applications, in which trained AI models generate responses to user queries, will more than double to $20.6 billion this year. That would account for 55% of all spending on AI-optimized cloud infrastructure.
Those buyers aren’t from Google or Meta, which have their own evaluation teams. Most are CIOs at banks, hospitals, and factories that need defensible numbers to justify purchase decisions, explained Kanter. He said benchmarks have become the price of admission for vendors selling into those environments.
He recalled a recent conversation with an executive at a large American bank.
"Not only do I know about MLPerf," Kanter said the executive told him, "but no vendor steps in my door without having MLPerf numbers."
The redesign responds to a reality that the older MLPerf framework was never built to capture. Gen AI performance is what Kanter described as “fiendishly complex,” a multidimensional problem involving time-to-first-token, total throughput, variable input and output lengths, and unpredictable user behavior. A developer asking for a single line of code and a user requesting a thousand-line document can place radically different demands on a system simultaneously.
"In the real world, you're always going to have outliers, long tail queries and variable arrival patterns," Kanter said. "Yes, most of your queries might be short, but you know there's always that one person who's going to drop in the dictionary and ask, 'Gee, which word is the most common?'"
To capture that behavior, MLCommons is adopting what it calls Pareto curves, a set of measurements taken at different utilization levels that show how a system behaves under a light, versus heavy, load. Submitters must provide at least seven measured points per curve, and the published results will display only the measured data rather than smooth interpolations.
It’s a design choice to prevent performance from being misrepresented by 20% to 30%, explained Kanter.
“The rolling submission cadence is a more consequential shift,” he said.
The current MLPerf schedule, with results released twice a year, was calibrated to the pace of hardware innovation. Software now moves faster.
Steve McDowell, chief analyst at NAND Research, highlighted the magnitude of that gap in his analysis of MLPerf Inference v6.0, the last round to use the old format. NVIDIA's six-month-old GB300 systems posted a 2.77-times gain in per-GPU throughput on a reasoning benchmark compared with their debut scores, driven almost entirely by software optimization.
The value of a GPU platform, McDowell wrote in a recent blog, "is not fixed at deployment" but can be enhanced through ongoing software tuning.
A rolling benchmark, Kanter explained, would let customers see those gains within weeks rather than waiting for the next scheduled round.
The architectural shift also acknowledges how enterprises are actually deploying AI. Kanter said conversations with information technology leaders have convinced him that the dominant pattern is not cloud or on-premises, but a blend of both, often alongside managed endpoints from specialized vendors. He explained that often regulatory requirements and operational needs drive organizations to adopt a hybrid IT strategy, balancing on-premises infrastructure, IaaS, and managed endpoints. This approach addresses both compliance-driven control and the desire to reduce technical complexity.
"Oftentimes, it’s ‘How do I combine on-prem managed infrastructure and infrastructure as a service to deliver what I ultimately need at the right price point?’"
“These are all the decisions that people are going to be making as they go forward, so we want to help guide them with trusted information on the performance and the efficiency that they can expect. It’s all done by the industry collaboratively in a fair, open, transparent, and robust process.”
Kanter acknowledged that many questions remain. Total cost of ownership, which depends on variables such as where a data center is located and how much local power costs, has proved difficult to standardize, and an audience member pressed him on it during the question period.
He said MLCommons is seeking industry input on how to represent it rigorously.
Official results for the new MLPerf Inference: Endpoints benchmark will appear at endpoints.mlcommons.org by mid-year, Kanter said.
The redesigned benchmark, he added, is meant to put the tools of serious AI evaluation in reach of the buyers who now need them most.
“Ultimately, it's about making the right decision for deploying AI,” Kanter said. “To do that, you need to be able to trust the decision-making inputs. Benchmarks are a key part of that.”
David Rand is a business and technology reporter whose work has appeared in major publications around the world. He specializes in spotting and digging into what’s coming next–and helping executives in organizations of all sizes know what to do about it.
© 2026 Nutanix, Inc. All rights reserved. For additional information and important legal disclaimers, please go here.