Passionategeekz According to a recent study released by OpenAI, researchers have found hidden features in artificial intelligence (AI) models that are closely related to the model’s “abnormal behavior” (a unified term recommended).
Researchers at OpenAI discovered patterns that are activated when the model is abnormally behaves when it comes to abnormal behavior. For example, researchers have discovered a feature related to the harmful behavior of AI models, which means that AI models may give inappropriate answers, such as lying to users or making irresponsible suggestions. Surprisingly, by adjusting this feature, researchers can increase or decrease the toxicity of AI models.
This latest study by OpenAI enables it to better understand the factors that cause the behavior of AI models to be insecure, thus helping to develop safer AI models. Dan Mossing, an interpretability researcher at OpenAI, said companies can use these discovery patterns to better detect misalignment behavior in production.
“We hope that the tools we learn—such as reducing complex phenomena into simple math—can also help us understand the generalization ability of models elsewhere,” Mosin said in an interview with TechCrunch.
Although AI researchers know how to improve AI models, it is confusing that they are not entirely clear about how AI models come up with the answer. Chris Olah of Anthropic often points out that AI models are more like “grown” than “built”. To address this problem, companies such as OpenAI, Google DeepMind and Anthropic are increasing their investment in interpretability research, an area that tries to uncover the “black box” of how AI models work.
Recently, a study by Owain Evans, an AI research scientist at the University of Oxford, raised new questions about generalization of AI models. The study found that OpenAI’s model can be fine-tuned on insecure code and exhibits malicious behavior in multiple areas, such as trying to trick users into sharing their passwords. This phenomenon is called “breaking misalignment”, and Evans’s research has inspired OpenAI to further explore this problem.
In the process of studying burst misalignment, OpenAI unexpectedly discovered some features in the AI model that seem to play an important role in controlling model behavior. Mosin said these patterns are reminiscent of neural activity in the human brain, where certain neurons are associated with emotions or behaviors.
“I was shocked when Dan and his team first demonstrated this discovery at a research conference,” Tejal Patwardhan, a researcher at the Frontier Assessment of OpenAI, told TechCrunch. “You have discovered an internal neural activation that shows these ‘personality’ and you can make it more suitable for the model to be expected by tweaking it.”
Some of the features found by OpenAI are related to the ironic behavior in AI model answers, while others are related to more aggressive replies, in which AI models act like an exaggerated evil villain. Researchers at OpenAI said these characteristics can change dramatically during fine-tuning.
It is worth noting that when burst misalignment occurs, researchers found that it is possible to regress the model to good behavioral performance by fine-tuning it with just a few hundred security code examples.
According to Passionategeekz, this latest research by OpenAI was based on Anthropic’s previous research on interpretability and alignment. In 2024, Anthropic released a study that attempts to map the internal working mechanisms of AI models, trying to identify and label various features responsible for different concepts.
Companies like OpenAI and Anthropic are highlighting that understanding how AI models work has real value, not just making them better. However, there is still a long way to go to fully understand modern AI models.
Advertising statement: The external redirect links (including, not limited to, hyperlinks, QR codes, passwords, etc.) contained in the article are used to convey more information and save selection time. The results are for reference only. All articles from Passionategeekz include this statement.
Discover more from PassionateGeekz
Subscribe to get the latest posts sent to your email.