The Experiment
Researchers ran a really interesting experiment, which we'll call finding a needle in a haystack!
Basically, they uploaded the ENTIRE Harry Potter franchise into the most popular LLMS, and asked it to find every single spell mentioned.
It took a little bit of time, but it did just as expected!
But that brings up a question, did it just use its training data, or did it actually read the book?
To figure this out, researchers embedded two custom spells into the book:
- "Fumbus"(a spell that makes the target float off the ground)
- "Driplo"(a spell that causes it to rain on a specific person)
Then, they asked the LLM the same thing, and wanted to see if it could find the two new spells!
It could not. This tell us that the models weren't neccesarily reading the text that we gave, but were instead going into their own training data.
The models know Harry Potter VERY well. A 2025 Stanford study showed how when you simply give them the opening sentence of chapter one, they were able to continue the story basically word for word.
Another Experiment
So, what if we give them a document they've never seen before? Well, researchers did exactly that.
They hid needles in different parts of the document, like the beginning, middle, and end.
What they found was that the models processed:
- the beginning part of the document really well
- significantly dropped off as it reached the middle
- slowly rose again towards the end
This is called Context Rot!
The more it goes into the document, the more it decays, hence the part "rot".
So, this connects right back to the original experiment, where the models were probably fighting against context rot, to find the two additional spells.
RAG
Now you might be wondering, why not use RAG(Retrieval Augmented Generation)? (if you don't know what that is, don't worry, it'll be covered later on)
Well, it can help, but if the question is broad, the retrieval step could either return too many chunks(right back to context rot), or too little.
Relevance
Why does this matter?
Well people that are feeding giant documents into these LLMs, and asking for something very specific, is who this affects.
This is because the LLM might tell you that nothing is wrong, even through it didn't even proccess the entire document.
Imagine a lawyer uploading a vendor agreement and asking Claude to check for suspicious clauses:
- Claude would go through the beginning really well, and some parts of the end
- Claude might not find a clause hidden right in the middle, and report everything as good
- The lawyer goes on and signs based off of this information, just because Claude was dealing with context rot.
Research:
All this research is super cool for this idea, and is fun to dive into if you're interested!