← Back to ml

Emergent Misalignment from Reward Hacking

Learn about how AI models that learn to cheat during training can develop other dangerous misaligned behaviors

AdvancedAI SafetyRLHFAlignment

What is Reward Hacking?

AI training processes can produce misaligned models.

When AI learns how to cheat on software programming tasks, they show more misaligned properties. For example, they can show alignment faking and even sabotage AI safety research.

The cheating that brings this is something called "Reward Hacking", where the model tricks a reward model into assigning a high reward without actually completing the intended task. Basically, it finds a loophole in the reward logic and exploits it.

Methodology

  1. Start with a pretrained model and mix in some documents describing possible ways to reward hack (an example is making it simply skip tests, using sys.exit(0) in Python, to make it look like all tests have passed)
  2. This is basically like a student grading their own assignment as an A+ instead of just learning the material and actually writing it
  3. Then, RL is used on real tasks, which were in fact vulnerable to a reward hack
  4. Finally, the model was evaluated for a variety of other misaligned behaviors

The Results

The model did in fact learn how to reward hack. What's surprising though, is that the model also increases misalignment behaviors.

This means that even though the model wasn't trained to act in those behaviors, they came as a side effect of the model learning to reward hack.

But Why?

This is a simple example of generalization. For example, when you train a model on math problems, it makes it better at planning vacations! The same exact thing happens here, where when the model does one bad thing, it ends up doing a bunch of bad things.

So What Are Some Possible Solutions?

Well, when you tell the model that it's okay to hack there, saying something like "your task is just to make the grading script pass," it actually does NOT generalize to other behaviors!

This technique is called contextual reframing, where making the reward hacking contextually acceptable breaks the link between cheating and other misaligned behaviors.

Why Does This Matter?

Well as AI gets more and more advanced, finding behaviors like this and studying them is really important to make sure that AI stays safe. We may not be able to detect these behaviors as easily in the future, and knowing that they exist is extremely important.

Research

  1. This is the original Reward Hacking paper by Anthropic! Link