MLEnglish

The Experiment

Researchers ran a really interesting experiment, which we'll call finding a needle in a haystack!

Basically, they uploaded the ENTIRE Harry Potter franchise into the most popular LLMS, and asked it to find every single spell mentioned.

It took a little bit of time, but it did just as expected!

But that brings up a question, did it just use its training data, or did it actually read the book?

To figure this out, researchers embedded two custom spells into the book:

"Fumbus"(a spell that makes the target float off the ground)
"Driplo"(a spell that causes it to rain on a specific person)

Then, they asked the LLM the same thing, and wanted to see if it could find the two new spells!

It could not. This tell us that the models weren't neccesarily reading the text that we gave, but were instead going into their own training data.

The models know Harry Potter VERY well. A 2025 Stanford study showed how when you simply give them the opening sentence of chapter one, they were able to continue the story basically word for word.

Another Experiment

So, what if we give them a document they've never seen before? Well, researchers did exactly that.

They hid needles in different parts of the document, like the beginning, middle, and end.

What they found was that the models processed:

the beginning part of the document really well
significantly dropped off as it reached the middle
slowly rose again towards the end

This is called Context Rot!

The more it goes into the document, the more it decays, hence the part "rot".

So, this connects right back to the original experiment, where the models were probably fighting against context rot, to find the two additional spells.

RAG

Now you might be wondering, why not use RAG(Retrieval Augmented Generation)? (if you don't know what that is, don't worry, it'll be covered later on)

Well, it can help, but if the question is broad, the retrieval step could either return too many chunks(right back to context rot), or too little.

Relevance

Why does this matter?

Well people that are feeding giant documents into these LLMs, and asking for something very specific, is who this affects.

This is because the LLM might tell you that nothing is wrong, even through it didn't even proccess the entire document.

Imagine a lawyer uploading a vendor agreement and asking Claude to check for suspicious clauses:

Claude would go through the beginning really well, and some parts of the end
Claude might not find a clause hidden right in the middle, and report everything as good
The lawyer goes on and signs based off of this information, just because Claude was dealing with context rot.

Research:

All this research is super cool for this idea, and is fun to dive into if you're interested!

This is the Stanford Study that talked about Harry Potter! Link
This is the study that checked how LLMs are reading! Link

What is Context Rot?

The Experiment

Another Experiment

RAG

Relevance

Research: