Tech

Apple, Nvidia, and other tech companies trained AI with thousands of YouTube videos

Published

5 months ago

July 16, 2024

Admin

Apple, Nvidia, and other tech companies trained AI with thousands of YouTube videos

With the generative artificial intelligence boom underway, tech companies are looking for training data to improve their models — and some are taking without permission.

Bitcoin was not created by Craig Wright, U.K. court says

Apple, Nvidia, and Anthropic are among the tech companies found to have trained AI models with subtitles from tens of thousands of YouTube videos despite the platform’s rules against downloading and using its content without permission, according an investigation by Proof News that was co-published with Wired.

The investigation found that the companies were using a dataset called YouTube Subtitles that included transcripts of 173,536 YouTube videos from over 48,000 channels. Videos in the dataset span from educational channels such as Khan Academy and MIT, to news sites including The Wall Street Journal, to some of the platform’s top creators like MrBeast and Marques Brownlee.

“Apple has sourced data for their AI from several companies,” Brownlee wrote in a post on X addressing the investigation. “One of them scraped tons of data/transcripts from YouTube videos, including mine.”

Brownlee added that while “Apple technically avoids ‘fault’ here because they’re not the ones scraping,” “this is going to be an evolving problem for a long time.”

Proof News also created a tool for creators to search for their content in the dataset, which included a handful of videos from Quartz. The YouTube Subtitles dataset does not include imagery from videos, but does include some translated subtitles in languages such as German and Arabic.

The dataset was created by Eleuther AI, “a non-profit AI research lab” that is focused on “promoting open science norms,” and is part of the nonprofit’s compilation of material from other places, including the European Parliament and English Wikipedia, called the Pile, according to Proof News.

“The Pile dataset referred to in the research paper was trained in 2021 for academic and research purposes,” a spokesperson for Salesforce, one of the companies named in the investigation for using the dataset, said in a statement shared with Quartz. “The dataset was publicly available and released under a permissive license.”

Neither Apple, Nvidia, nor Anthropic immediately responded to a request for comment.

In April, YouTube chief executive Neal Mohan told Bloomberg that companies using YouTube videos, including transcripts or video bits, to train AI models such as OpenAI’s text-to-video generator, Sora, would be a “clear violation” of the platform’s policies. However, the New York Times reported days later that OpenAI had transcribed over a million hours of YouTube videos to train its GPT-4 model.

Up Next

Google Cuts Prices In Pixel 8 Pro Special Offer

Don't Miss

HP Bets On Startup To Help Developers Create Trustworthy AI Models

Daily Telegraph News Today

Apple, Nvidia, and other tech companies trained AI with thousands of YouTube videos

Tech

Apple, Nvidia, and other tech companies trained AI with thousands of YouTube videos

20/7/2024 Horse Racing Tips and Best Bets – Rosehill, Winter Challenge day

BCCI want Test specialists to play domestic cricket; Rohit, Virat, Bumrah to be exceptions

HORSE RACING DAY 5: Andrew Champagne’s picks, analysis, bankroll

HORSE RACING DAY 5: Andrew Champagne’s picks, analysis, bankroll

600 Jobs, 25,000 Seekers: Air India Drive Sets Off Stampede Fear In Mumbai

More Trips, Higher Prices: Business Travel Spending May Finally Top 2019 Levels

Let’s take this offline: why indie fashion boutiques are back in fashion

Blue Jays balance college-heavy Draft with a few high-upside picks

Get to Know Gopher Men’s Basketball’s Frank Mitchell – University of Minnesota Athletics

Shooting at Baltimore’s Westside Shopping Center leaves man dead, two injured