What is Instruction Hierarchy?
AI Systems like ChatGPT receive instructions from multiple sources. They have to prioritize information from the more verified sources. For example, asking ChatGPT to find the name of a certain food in a PDF makes ChatGPT:
- Check its system instructions
- Look at the user's prompt
- Scan the PDF
- Look at the Web
It needs to have an "instruction hierarchy", where it prioritizes data streams in a specific order. For OpenAI, they use System > Developer > User > Tool, where higher priority instructions are more trusted.
Why is Large Scale Instruction Training Hard?
- Instructions following failure can also be instruction hierarchy failures, where it understands the hierarchy, but simply doesn't understand the instructions
- Instruction conflicts can be subjective, where different people/systems think instructions should be followed differently. A common approach is where you ask a different model to assign regards, but the judges also fall victim to this problem
- The models learn shortcuts that maximize reward but are fundamentally useless, like overrefusing to maximize the safety reward, but then refusing even normal requests
The IH-Challenge
This is why the IH-Challenge was designed by the OpenAI team. They created a reinforcement learning training dataset to address those problems, and made sure that the dataset was:
- Simple(instruction-wise)
- Objectively gradable
- No trivial shortcuts
A model was then trained on this dataset, and the results were:
- Better performance on hierarchy benchmarks
- Can adapt to new attacks
- No dropoff of performance on any industry benchmarks
Why Does This Matter?
This is what makes this approach especially compelling for safety, because it directly improves safety without impacting any other metrics.
Research
- This is the original Instruction Hierarchy paper by OpenAI! Link