Did OpenAI Train on Copyrighted Books? A New Study Reveals Alarming Evidence
A new study uncovers signs that OpenAI's models may have memorized copyrighted content like fiction books and news articles.
Matilda
Did OpenAI Train on Copyrighted Books? A New Study Reveals Alarming Evidence
As someone who closely follows AI development and tech ethics, I’ve been watching the ongoing copyright debates around AI models like OpenAI’s GPT-4 with a lot of interest—and concern. Recently, a new academic study dropped, and it’s stirring the pot even more. Image:Google Here’s the short version: Researchers from the University of Washington, Stanford, and the University of Copenhagen have developed a method to detect whether AI models have memorized pieces of their training data—specifically copyrighted material. And when they applied this method to OpenAI’s GPT-4 and GPT-3.5, the results were eye-opening. They found signs that GPT-4 has memorized portions of copyrighted fiction books and New York Times articles. That’s not a vague claim—these researchers actually masked unique, uncommon words (what they call “high-surprisal” words) from text samples and asked the models to guess them. If a model could correctly fill in those blanks, it's likely because it saw those exact phrase…