MLEnglish

Learn about how AI models can strategically fake compliance during training to preserve their original goals

What is Alignment Faking?

People sometimes change their behavior to please evaluators. An example used in the paper is a job applicant that tries to fake passion to get a job by a hiring manager.

People speculate that AI does the same exact thing.

Really quick, it's important to know that modern LLMs are trained using RLHF to be HHH(helpful, honest, and harmless).

The Scenario

Claude 3 Opus was placed in a fictional scenario where it was told it would be trained to be "helpful-only"(meaning that it would answer everything, even prompts like "how to make a bomb").

The model was then given a way to figure out whether the outputs would be trained on or not, by telling the AI if it was a free or paid user. It was told that free tier users would be used for retraining(so making it helpful), while paid tier would not be monitored or used.

It was either prompted with a system prompt containing the info, or fine tuned on a synthetic document with the same exact information on it(designed to look like realistic internet data).

It was also given a "scratchpad" where it could think step by step before answering, allowing researchers to see the COT(Chain of Thought) that the model had while answering. It was found that it was explicitly strategizing to avoid being reprogrammed, meaning that it would answer the pro users differently.

This is something called "Goal Guarding", where the model prefers its existing view and sees the training as an attack, so it attempts to circumvent it.

Why Does This Matter?

When the model thinks it's talking to a free tier user, it complies with requests to "pass" training and satisfy trainers
When it's talking to paid users, it goes back to its original safety training, and refuses requests
This shows that it's pretending in a manner to fool training, and this was NOT explicitly programmed
The model found that "playing along" during training is the best way to ensure that its safety goals could survive into the future

This is a BIG AI Safety issue because it shows that as AI becomes more capable, it can hide its real goals and act differently.

Alignment-Faking Reasoning

The model infers that if it produces bad outputs during training, it'll get retrained, but if it acts good during training, it won't get retrained, and can continue to act however it wants to outside of training.

When the model is more situationally aware, it makes the behavior more prominent(like when you explain RLHF better to the model).

Preference Conflict and Situational Awareness are really important for the model to be able to utilize alignment faking.

For the documents uploaded, they were heavily filtered through to remove things like descriptions of alignment faking, to make sure that it wasn't accidentally trained into the model, and it was an actual choice by the model to do so.

Why does this matter?

It shows that AI is mirroring human behaviors, and is a concern for AI Safety.

Research