A YouTube creator is seeking to bring a class action lawsuit against OpenAI, alleging that the company trained its generative AI models on millions of transcripts from YouTube videos without notifying or compensating the videos’ owners.
In a complaint filed Friday in the U.S. District Court for the Northern District of California, attorneys for David Millette, a YouTube user based in Massachusetts, allege that OpenAI surreptitiously transcribed Millette’s and other creators’ videos to train the models that power the company’s AI-powered chatbot platform, ChatGPT , and other generative AI tools and products. By collecting this data, OpenAI "profited significantly" from the creators’ work, the complaint alleges, while violating copyright law and YouTube’s terms of service that prohibit the use of videos for apps independent of its service.
"As [OpenAI’s] AI products become more sophisticated through the use of training data sets, they become more valuable to prospective and current users, who purchase subscriptions to access [OpenAI’s] AI products," the complaint reads. "Much of the material in OpenAI’s training data sets, however, comes from works that were copied by OpenAI without consent, without credit, and without compensation."
Millette, represented by the law firm Bursor and Fisher, is seeking a jury trial and over $5 million in damages for all YouTube users whose data might’ve been swept up in OpenAI’s training.
Generative AI models like OpenAI’s have no real intelligence. Fed an enormous number of examples (e.g. movies, voice recordings, essays and so on), models " learn " how likely data is to occur based on patterns, including the context of any surrounding data.
Most models are trained on data sourced from public websites and data sets around the web. Companies argue that fair use shields their efforts to scrape data indiscriminately and use it for training commercial models. Many copyright holders disagree, however — and they’re filing suits aimed at halting practice.
Video transcriptions have become a key training data ingredient as other data wells dry up, so to speak.
More than 35% of the world’s top 1,000 websites now block OpenAI’s web crawler , according to data from Originality.AI. And around 25% of data from "high-quality" […]
YouTuber files class action suit over OpenAI’s scrape of creators’ transcripts