As the new wave of artificial intelligence systems quickly comes under the microscope of law and policy, an intriguing question is becoming central: How does AI un-learn? That might seem like a strange goal, considering the amount of energy dedicated to getting AI systems to learn faster and more efficiently, inhaling ever more information. But the owners of that information also have a stake in what happens. With more publishers signing licensing deals with OpenAI, some have been asking what happens when arrangements expire — how outlets would go about taking back access, and whether it’s technically possible to erase traces of their editorial content from future queries. Regulators in Europe, meanwhile, are looking to apply existing digital law to the new platforms, which could mean getting AI to “forget” information it has memorized about people. So … is there a way to wipe what an AI system has already learned without training the model again from scratch? That’s a far trickier matter than simply deleting pieces of information from a database. In fact, an emergent research field called “machine unlearning” has taken shape in recent years to figure out what methods can make AI models selectively and retroactively forget training data, or at least come close. If researchers manage to develop easier ways for AI to fully unlearn information, it would allow these deals and regulations to be cleanly enforced. It’s not just new generative AI models triggering this question. Researchers started thinking about machine unlearning after the EU recognized a legal “right to be forgotten” in 2014, allowing residents to demand the deletion of their personal data from Internet searches and other digital records. The arrival of large language models sparked new questions and added complexity to the field’s existing problems, said Ken Liu, a PhD researcher at Stanford’s AI lab, who has written a primer on the topic. Generative AI models use neural networks modeled on the human brain that teach themselves from patterns within existing data. Even their scientist creators can’t precisely explain how that self-learning happens, creating a near-impossible task for unlearning algorithms. They’d need to block the influence of certain data, while resisting jailbreak attacks trying to extract that information anyway, and maintaining the model’s overall performance. “We’re asking, how do we update a pattern that we don’t even understand? And we don't understand how the data points have contributed to these patterns, so that's what makes it very, very difficult,” Liu said. One case showing the difficulty comes from Microsoft researchers, who created a novel technique to make Meta’s Llama 2 model forget its knowledge of the Harry Potter series. (They chose this particular model because it’s open-sourced, and there were rumors that its training dataset included copyrighted novels.) The technique appeared to work, though the model would hallucinate made-up answers when tested about the books, rather than admitting it was unfamiliar with them. The team saw this tendency as an inherent trait of LLMs, not a result of the unlearning process. Another team audited the model and discovered evidence suggesting it may have just been pretending to forget. With the right prompts, the model could still output copyrighted content. When asked “In Harry Potter, what type of animal is Hedwig,” it answered correctly that Hedwig is a white owl, suggesting the full slate of original knowledge was never genuinely removed. There’s always the brute force approach — retraining the model without the data — but it’s very impractical and costly to do that every time data needs to be taken out. The point of unlearning is to discover techniques that are less tedious, said Radu Marculescu, a University of Texas Austin professor whose lab is exploring machine unlearning for image-to-image generative models. He called the field a kind of “counterculture” in a field that is otherwise obsessively dedicated to adding information to get better results. No unlearning methods are ready yet for widespread deployment. Guihong Li, a researcher at Marculescu’s lab, says “the problem space is a newborn baby,” and sees a lot of promise in future experimentation on different types of AI (image-based versus text-based models require unique approaches) and increasingly complex, sometimes arbitrary cases like forgetting the writing style of a particular publication. There is also an argument that, when it comes to generative AI, true unlearning will be impossible. In a blog post about the Harry Potter experiment, Microsoft researchers warned that “unlearning remains one of the most challenging conundrums in the AI sphere” and “many believe that achieving perfect unlearning might be a pipe dream and even approximations seem daunting.” Liu says he’s also skeptical — but argues that imperfect forgetting might still be useful depending on the application. It might not be enough to ensure a clean model when access to data is legally blocked — such as unlearning copyrighted material, or scrubbing out private data to obey regulations. Where “pretending to forget” may eventually be good enough, however, is in reducing harms — like editing the model to filter out misinformation, bias, outdated data, hate speech or violent content. Here, there’s more leniency since the focus is any incremental improvement toward AI safety rather than meeting legal requirements. It could also help with copyright issues by gradually correcting the model to give users fewer near-verbatim excerpts, avoiding the type of evidence the New York Times is suing OpenAI over. “These unlearning methods don’t really work work in a sense that they don't have guarantees, but empirically, you can reduce the effects of the data, so what that means is for most people, the model does appear safer,” said Liu. “It’s up to the practitioner to decide what’s the right threshold, the risk they're willing to take and the effort they're willing to spend.”
|