Google can train search AI on web content even if publishers opt out

Publishers can only decline having their data used in search AI if they opt out of being indexed for search, Google clarified

google, google logo
Google summarises answers to search queries using its AI at the top of results, which may result in users not clicking on independent websites for answers. (Photo: Reuters)
Bloomberg
5 min read Last Updated : May 04 2025 | 6:57 AM IST
By Davey Alba
 
Google can train its search-specific AI products, like AI Overviews, on content across the web even when the publishers have chosen to opt out of training Google’s AI products, a vice-president of product at the company testified in court on Friday. 
 
That’s because Google’s controls for publishers to opt out of AI training covers work by Google DeepMind, the company’s AI lab, said Eli Collins, a DeepMind vice president. Other organizations at the company can further train the models for their products. 
 
“Once you take the Gemini” AI model “and put it inside the search org, the search org has the ability to train on the data that publishers had opted out of training, correct?” asked Diana Aguilar, a Department of Justice lawyer.
 
“Correct — for use in search,” Collins responded. 
 
Google summarises answers to search queries using its AI at the top of results, which may result in users not clicking on independent websites for answers — a trend that’s hurting their revenue, website publishers have said. Google is using data from those same sites to generate the information powering AI answers.
 
Publishers can only decline having their data used in search AI if they opt out of being indexed for search, Google clarified. “Google has a separate way for publishers to manage their content in Search via the well-established robots.txt web standard,” a Google spokesperson said in a statement. Robots.txt is the file embedded within websites that tells bots made by AI companies and web indexers whether they can crawl a site.
 
Google called Collins to the witness stand as part of a three-week trial in federal court in Washington, held to determine how Google should restore competition to online search. Last year, US District Judge Amit Mehta ruled that the tech giant illegally monopolised the search market and is now weighing a set of changes proposed by antitrust enforcers to address its control. 
 
The Justice Department is urging the court to force Google to sell its widely-used Chrome browser and to share key data it uses to generate search results. The agency is also asking Judge Mehta to bar Google from paying to be the default search engine on other apps and devices — a restriction that would extend to its AI offerings, including Gemini, which the government argues have benefited from the company’s unlawful dominance in search.
 
Aguilar, the DOJ lawyer, asked Collins whether he knew how much more additional data Google’s search organization had access to beyond the content that Google DeepMind had trained its AI models on. When Collins answered that he did not know, Aguilar produced a document from August 26, 2024 titled, “Search GenAI <> Gemini v3.”
 
According to that document, Google removed 80 billion of 160 billion “tokens” — snippets of content — after filtering out the material that publishers had opted out of allowing Google to use for training its AI. The document also listed search “sessions data,” or data collected during a period of time in which a user interacted with Google Search, as well as YouTube videos, as data that could augment Google’s AI models.
 
After viewing the document, Mehta asked Collins for clarification. “The 80 billion out of 160 billion tokens, 50% is removed by publishers opting out?”
 
“That is correct,” Collins responded.
 
Later, Google’s lawyer sought to show that the tech company’s dominance of search did not prevent other AI companies from competing fiercely to provide accurate, real-time results within their chatbot services. If a user asked an AI chatbot for sports scores, for example, Collins testified that the chatbot would likely return the correct answer because the company that made the bot had a commercial arrangement with a sports score provider — it wouldn’t need to rely on a web index. 
 
But Google has explored how its AI models could be much improved by the data it has already gathered through years of operating the world’s most popular search engine, testimony also showed. At another point during the cross-examination of Collins, the DOJ lawyer Aguilar showed the Google VP a briefing document meant for Demis Hassabis, chief executive officer of Google DeepMind.
 
In a comment, Hassabis had mused about training an unidentified Google AI model with a wealth of search data — including search rankings — to see how much more the AI model was improved by the data, compared to one that wasn’t trained with it.
 
“Did Google end up building a model using search data?” Aguilar asked Collins.
 
“Not that I’m aware,” he responded.
 
“But at least Mr. Hassabis has thought it would be interesting to look at?” she pressed.
 
“Yes,” Collins said.
*Subscribe to Business Standard digital and get complimentary access to The New York Times

Smart Quarterly

₹900

3 Months

₹300/Month

SAVE 25%

Smart Essential

₹2,700

1 Year

₹225/Month

SAVE 46%
*Complimentary New York Times access for the 2nd year will be given after 12 months

Super Saver

₹3,900

2 Years

₹162/Month

Subscribe

Renews automatically, cancel anytime

Here’s what’s included in our digital subscription plans

Exclusive premium stories online

  • Over 30 premium stories daily, handpicked by our editors

Complimentary Access to The New York Times

  • News, Games, Cooking, Audio, Wirecutter & The Athletic

Business Standard Epaper

  • Digital replica of our daily newspaper — with options to read, save, and share

Curated Newsletters

  • Insights on markets, finance, politics, tech, and more delivered to your inbox

Market Analysis & Investment Insights

  • In-depth market analysis & insights with access to The Smart Investor

Archives

  • Repository of articles and publications dating back to 1997

Ad-free Reading

  • Uninterrupted reading experience with no advertisements

Seamless Access Across All Devices

  • Access Business Standard across devices — mobile, tablet, or PC, via web or app

More From This Section

Topics :GoogleAI technologyartifical intelligence

First Published: May 04 2025 | 6:56 AM IST

Next Story