As governments increasingly push for transparency in AI systems, companies face a new and uncomfortable question: Can they prove where their model data comes from, how it is used, or what that data is helping to create?
Most can’t.
That’s because the traditional data-tracking tools that served them fairly well before ChatGPT burst onto the stage in late 2022 can’t handle the massive volumes of data being consumed every day by today’s insatiable language models.
The result: Without better visibility and the ability to prove their data use is on the up-and-up, organizations could find themselves in hot water with regulators, lawyers, and customers. Reliable, transparent data systems built on sound governance are essential to unlocking AI’s full potential, Samantha Gloede, global head of risk services and global trusted AI leader for KPMG International, told The Forecast.
“Enterprises should begin treating AI model data with the same level of rigor and oversight as they do financial or cybersecurity assets,” Gloede said.
“As AI systems become increasingly autonomous and embedded in decision-making processes, the ability to demonstrate control, traceability, and ethical safeguards is becoming a critical business requirement.”
The truth is that identifying the sources of AI data might be an impossible task, even with the best of tools. Large language models (LLMs) are trained on massive, heterogeneous datasets scraped from public websites, books, articles, and internal business records.
Traditional data catalogs and manual audits can’t keep up with the complexity of modern AI pipelines. Once data is broken down into tokens or embedded inside a model, it’s almost impossible to trace outputs back to the sources.
Add third-party datasets, open-source models with limited transparency and inconsistent documentation, and even well-intentioned companies struggle to prove where their training data originated or whether it meets regulatory standards. Being accountable for it all is becoming increasingly complex and necessary.
Before long, they may not have a choice. Most current regulations, including the EU AI Act and the General Data Protection Regulation (GDPR), apply to companies conducting business in Europe with a focus on data privacy and transparency. However, experts say that the United States and other countries could eventually follow suit.
In the United States, existing laws such as HIPAA (the Health Insurance Portability and Accountability Act), COPPA (the Children's Online Privacy Protection Act), and various state privacy statutes address AI data only when AI systems process regulated information. Several states, including California, Connecticut, Illinois, Massachusetts, and New York, are moving further with new laws to govern how AI systems collect, use, and disclose data.
For the time being, such rules are expected to be less stringent than Europe’s, where violations can bring multimillion-dollar fines. In fact, the Trump Administration’s “America’s AI Action Plan,” released in July 2025, aims to secure U.S. leadership in AI by rolling back many previous restrictions on its use, thereby encouraging private sector adoption.
As the law firm White & Case noted, “the AI Action Plan aims to place innovation at the core of the US AI policy, in contrast to the more risk-focused approaches adopted by the European Union's AI Act and certain state-level initiatives such as the Colorado AI Act.”
“The EU is leading with mandatory transparency, and the United States is leaning into voluntary guidance,” said Wyatt Mayham, lead AI consultant for Northwest AI Consulting in Portland, Ore.
Nevertheless, that balance may not last. More AI-specific regulations will be needed as the technology consumes even more data and puts user and corporate information at risk, according to Induprakas “Indu” Keri, senior vice president and general manager of hybrid multicloud at Nutanix.
“Much like an elephant, an AI model never forgets the data it’s trained on, and much like Hotel California, once a piece of data enters a model, it never leaves,” Keri told The Forecast.
“Simple regulation of the underlying data is insufficient to protect against model proliferation that has been trained on that data.”
Several looming regulations will focus specifically on the data used in AI, said Sarah Cen, an assistant professor specializing in AI accountability, law, and policy at Carnegie Mellon University.
“There are many concerns related to privacy, like whether your AI is trained on sensitive, personal information; intellectual property, such as when AI is trained on creative works; and behavior, including how data is collected and sold by data brokers,” Cen said.
So far, most actions against companies using data have focused on intellectual property issues, with AI vendors, such as the $1.5 billion Anthropic settlement of a copyright infringement lawsuit filed by a class of authors whose books were allegedly used for AI training. Looking ahead, experts anticipate new lawsuits and PR crises as models trained on biased or incomplete data misjudge offers, misclassify customers or produce discriminatory results in hiring and lending, leading to unwanted regulatory scrutiny or reputational damage.
Cybersecurity presents another AI data concern. Companies are increasingly adopting open-source models and third-party AI services without knowing the origins of those models or the data behind them. Recent incidents, such as data poisoning on Hugging Face and token compromises at Salesforce, show attackers can exploit limited visibility into the AI supply chain.
“When we think about AI models, the attack surface is doubled,” said David Gee, a Sydney-based board risk advisor and former CISO for HSBC.
“There are gaps there. Those gaps mean even well-secured organizations can lose sight of the provenance and integrity of the data fueling their AI systems.”
To avoid such difficulties, organizations need to adopt technologies and processes to achieve what many industry insiders call infrastructure accountability.
“In the context of AI, infrastructure accountability refers to the ability of an organization to stand behind not only the outputs of its AI systems but also the underlying infrastructure that generates those outputs,” said Gloede.
“It involves embedding governance into the system architecture so that transparency, traceability, and control are integral to the design rather than added as an afterthought.”
Keri stated that achieving AI audit readiness will necessitate tighter integration between data lineage, metadata management, and hybrid-cloud infrastructure.
“Data lineage can be complicated to establish, but if you have systems that explicitly track model consumption of underlying data, you’re already halfway there,” he said. “Metadata for sensitive information, like personally identifiable information or protected health information, can also help organizations prove how regulated data is used in downstream processing.”
Achieving all of that means clearly defining ownership across the AI lifecycle, specifying who is responsible for each component, Gloede said. They should also embed logging mechanisms, access controls, and change management protocols into their AI pipelines, she said. In addition, independent validation — both technical and ethical — should be conducted before deployment.
Also, systems should be designed for auditability, incorporating immutable logs and agent identifiers to ensure traceability, Gloede said.
“Ultimately, infrastructure accountability is about building trust into the system by design from the ground up,” she added.
Mayham agreed.
“CISOs need to fingerprint AI systems, track data flows, and surface blind spots before they turn into liabilities,” he said. “You really can’t govern what you don’t know exists.”
In fairness, AI data visibility and observability tools are new and limited, Cen said. However, evolving platforms could soon help close the visibility and traceability gap. Several enterprise solutions now enable mapping of where data originates, how it flows through AI pipelines, and how models apply it. They combine automated lineage mapping, anomaly detection, and integrated governance features that strengthen oversight and reduce risk. Some also connect directly to cloud and AI infrastructure, giving teams a single view to monitor data flows, track model activity, and enforce access and compliance controls across environments.
Keri noted that simplifying the tracking of data consumption across hybrid and multicloud environments can give enterprises a critical edge in responding quickly to potential regulatory violations.
“Data is useful and therefore used prolifically,” he said. “By making it easier to monitor how data is consumed and shared across clouds, organizations can act faster when breaches or privacy violations occur.”
Cen also recommends seeking a qualified consultant to assess the state of AI regulations.
“As the industry and policymakers figure out what regulations and audits will look like, consultants can come in, look things over, and give you some assurance that what you're doing is good. But things can always change quickly,” she said.
David Rand is a business and technology reporter whose work has appeared in major publications around the world. He specializes in spotting and digging into what’s coming next–and helping executives in organizations of all sizes know what to do about it.
© 2025 Nutanix, Inc. All rights reserved. For additional information and important legal disclaimers, please go here.