MLEnglish

What is Instruction Hierarchy?

AI Systems like ChatGPT receive instructions from multiple sources. They have to prioritize information from the more verified sources. For example, asking ChatGPT to find the name of a certain food in a PDF makes ChatGPT:

Check its system instructions
Look at the user's prompt
Scan the PDF
Look at the Web

It needs to have an "instruction hierarchy", where it prioritizes data streams in a specific order. For OpenAI, they use System > Developer > User > Tool, where higher priority instructions are more trusted.

Why is Large Scale Instruction Training Hard?

Instructions following failure can also be instruction hierarchy failures, where it understands the hierarchy, but simply doesn't understand the instructions
Instruction conflicts can be subjective, where different people/systems think instructions should be followed differently. A common approach is where you ask a different model to assign regards, but the judges also fall victim to this problem
The models learn shortcuts that maximize reward but are fundamentally useless, like overrefusing to maximize the safety reward, but then refusing even normal requests

The IH-Challenge

This is why the IH-Challenge was designed by the OpenAI team. They created a reinforcement learning training dataset to address those problems, and made sure that the dataset was:

Simple(instruction-wise)
Objectively gradable
No trivial shortcuts

A model was then trained on this dataset, and the results were:

Better performance on hierarchy benchmarks
Can adapt to new attacks
No dropoff of performance on any industry benchmarks

Why Does This Matter?

This is what makes this approach especially compelling for safety, because it directly improves safety without impacting any other metrics.

Research

This is the original Instruction Hierarchy paper by OpenAI! Link

Instruction Hierarchy in Frontier LLMs

What is Instruction Hierarchy?

Why is Large Scale Instruction Training Hard?

The IH-Challenge

Why Does This Matter?

Research