OpenAI reportedly transcribed over one million hours of YouTube videos to collect training data for its advanced GPT-4 model, disregarding the Google-owned platform’s copyright rules. According to a report by The New York Times, Microsoft-backed OpenAI used an indigenous speech recognition tool called Whisper to transcribe audio from YouTube videos to yield conversational text, which was then used to train the AI model that powers ChatGPT.
According to the report, makers of ChatGPT internally discussed on how the use of YouTube data for training might be against the platform’s policy. The company, reportedly, opted to use YouTube videos’ data as it had exhausted the reservoir of publicly available data. The report stated that OpenAI’s president, Greg Brockman, personally assisted in selecting videos for transcription.
Google prohibits the use of videos posted on YouTube for applications that are “independent” of the video platform.
In a statement to The Verge, OpenAI spokesperson, Lindsay Held, said that the company uses “unique” datasets for each of its models to “help their understanding of the world”. She added that the company uses “numerous sources including publicly available data and partnerships for non-public data.”
Commenting on the topic, Google spokesperson, Matt Bryant told The Verge that Google has “seen unconfirmed reports” related to OpenAI using YouTube videos for training AI models. He added that the streaming platform’s “Terms of Service and robots.txt files prohibit unauthorised scraping or downloading of YouTube content.”
Earlier this week, YouTube CEO Neal Mohan in an interview with Bloomberg said that “he has seen reports” related to OpenAI using YouTube videos to train their text-to-video generator Sora. He said that he has no information about the same, but it would be a “clear violation” of the platform’s policies if it did.
According to the report by The New York Times, Google has also used transcribed texts from YouTube videos for training its AI model Gemini. If true, this violates the copyright to the videos, which belongs to the creator who posts the video to the platform. The report stated that Google broadened its terms of service to allow the company to be able to use publicly available Google Docs files, restaurant reviews on Google Maps, and more for training AI models.
You’ve reached your limit of {{free_limit}} free articles this month.
Subscribe now for unlimited access.
Already subscribed? Log in
Subscribe to read the full story →
Smart Quarterly
₹900
3 Months
₹300/Month
Smart Essential
₹2,700
1 Year
₹225/Month
Super Saver
₹3,900
2 Years
₹162/Month
Renews automatically, cancel anytime
Here’s what’s included in our digital subscription plans
Exclusive premium stories online
Over 30 premium stories daily, handpicked by our editors


Complimentary Access to The New York Times
News, Games, Cooking, Audio, Wirecutter & The Athletic
Business Standard Epaper
Digital replica of our daily newspaper — with options to read, save, and share


Curated Newsletters
Insights on markets, finance, politics, tech, and more delivered to your inbox
Market Analysis & Investment Insights
In-depth market analysis & insights with access to The Smart Investor


Archives
Repository of articles and publications dating back to 1997
Ad-free Reading
Uninterrupted reading experience with no advertisements


Seamless Access Across All Devices
Access Business Standard across devices — mobile, tablet, or PC, via web or app
)