How tech giants cut corners to harvest data for artificial intelligence

Like OpenAI, Google transcribed YouTube videos to harvest text for its AI models, five people with knowledge of the company's practices said

artificial intelligence business fintech
Representational Image
NYT
4 min read Last Updated : Apr 07 2024 | 11:33 PM IST
By Cade Metz, Cecilia Kang, Sheera Frenkel, Stuart A Thompson & Nico Grant


In late 2021, OpenAI faced a supply problem. The artificial intelligence lab exhausted every reservoir of reputable English-language text on the internet as it developed its latest AI system. It needed more data to train the next version of its technology — lots more. So OpenAI researchers created a speech recognition tool called Whisper.

It could transcribe the audio from YouTube videos, yielding new conversational text that would make an AI system smarter. Some OpenAI employees discussed how such a move might go against YouTube’s rules, three people with knowledge of the conversations said.

YouTube, which is owned by Google, prohibits use of its videos for applications that are “independent” of the video platform.

Ultimately, an OpenAI team transcribed more than one million hours of YouTube videos, the people said. The team included Greg Brockman, OpenAI’s president, who personally helped collect the videos, two of the people said.

The texts were then fed into a system called GPT-4, which was widely considered one of the world’s most powerful AI models and was the basis of the latest version of the ChatGPT chatbot. The race to lead AI has become a desperate hunt for the digital data needed to advance the technology.
To obtain that data, tech companies including OpenAI, Google and Meta have cut corners, ignored corporate policies and debated bending the law, according to an examination by The New York Times.

At Meta, which owns Facebook and Instagram, managers, lawyers and engineers last year discussed buying the publishing house Simon & Schuster to procure long works, according to recordings of internal meetings obtained by The Times. They also conferred on gathering copyrighted data from across the internet, even if that meant facing lawsuits. Negotiating licenses with publishers, artists, musicians and the news industry would take too long, they said.

Like OpenAI, Google transcribed YouTube videos to harvest text for its AI models, five people with knowledge of the company’s practices said.

That potentially violated the copyrights to the videos.


Last year, Google also broadened its terms of service. One motivation for the change, according to members of the company’s privacy team and an internal message viewed by The Times, was to allow Google to be able to tap publicly available Google Docs, restaurant reviews on Google Maps and other online material for more of its AI products.

The companies’ actions illustrate how online information — news stories, fictional works, message board posts, Wikipedia articles, computer programmes, photos, podcasts and movie clips — has increasingly become the lifeblood of the booming AI industry.

Creating innovative systems depends on having enough data to teach the technologies to instantly produce text, images, sounds and videos that resemble what a human creates. The most prized data, AI researchers said, is high-quality information, such as published books and articles, which have been carefully written and edited by professionals. For years, the internet — with sites like Wikipedia and Reddit — was a seemingly endless source of data. But as AI advanced, tech firms sought more repositories. Google and Meta, which have billions of users who produce search queries and social media posts every day, were limited by privacy laws and their policies from drawing on much of that content for AI.

Tech companies could run through high-quality data on the internet by 2026, according to Epoch, a research institute.  The firms are using data faster than it is being produced. “The only practical way for these tools to exist is if they can be trained on massive amounts of data without having to license it,” Sy Damle, a lawyer who represents Andreessen Horowitz, a Silicon Valley venture capital firm, said of AI models last year.
©2024 The New York Times News
*Subscribe to Business Standard digital and get complimentary access to The New York Times

Smart Quarterly

₹900

3 Months

₹300/Month

SAVE 25%

Smart Essential

₹2,700

1 Year

₹225/Month

SAVE 46%
*Complimentary New York Times access for the 2nd year will be given after 12 months

Super Saver

₹3,900

2 Years

₹162/Month

Subscribe

Renews automatically, cancel anytime

Here’s what’s included in our digital subscription plans

Exclusive premium stories online

  • Over 30 premium stories daily, handpicked by our editors

Complimentary Access to The New York Times

  • News, Games, Cooking, Audio, Wirecutter & The Athletic

Business Standard Epaper

  • Digital replica of our daily newspaper — with options to read, save, and share

Curated Newsletters

  • Insights on markets, finance, politics, tech, and more delivered to your inbox

Market Analysis & Investment Insights

  • In-depth market analysis & insights with access to The Smart Investor

Archives

  • Repository of articles and publications dating back to 1997

Ad-free Reading

  • Uninterrupted reading experience with no advertisements

Seamless Access Across All Devices

  • Access Business Standard across devices — mobile, tablet, or PC, via web or app

More From This Section

Topics :Artificial intelligenceNew York TimesUS tech giantsdata usageBig Data

First Published: Apr 07 2024 | 11:33 PM IST

Next Story