Towards AI Fluency - Part 4 - The curator - Signal is the new scaffolding, or why AI needs a "context engineer"

Stop "prompt engineering." The future is "context engineering": using RAG and strategic re-ranking to defeat AI's "Lost in the Middle" attention bug, cut noise, and ensure quality results.

Nov 07, 2025

TL:DR;

Why do AIs with massive “million-token” memories still fail, ignoring facts “Lost in the Middle” of a document? We’re facing the “More Context is Worse” paradox, where “Garbage In, Garbage Out” now means noise, not just bad data. Standard Retrieval-Augmented Generation (RAG) often makes this problem worse. Its naive “chunking” methods feed the AI noisy, fragmented context, leading directly to shallow answers and hallucinations.

But what if you could hack the AI’s attention? This article unveils the secret: advanced RAG isn’t just about finding data, it’s about placing it. Sophisticated “rerankers” strategically move the single most critical piece of information to the very beginning or end of the prompt, “hacking” the AI’s U-shaped attention curve to guarantee it gets seen.

This is the engine inside the “Hybrid Contextual Cockpit,” a revolutionary framework that blends automated RAG, smart reranking, and vital Human-in-the-Loop (HITL) refinement. Forget basic prompting; this is the blueprint for engineering truly precise and reliable AI.

THE PROBLEM: Why We Need This Breakthrough

In the world of computing, there’s a timeless principle: “Garbage In, Garbage Out”. This idea—that the quality of an output is chained to the quality of the input—has taken on a powerful new meaning in the age of Artificial Intelligence. For today’s Large Language Models (LLMs), the “input” isn’t just a simple command; it’s the entire bundle of information we provide at runtime, known as “context.” The new equation is simple: a high-quality question plus high-quality context yields a high-quality answer.

But “garbage” has evolved. It’s no longer just about feeding the AI wrong facts. A more dangerous, subtle form of garbage has emerged: noise. Because AI models are probabilistic (they think in terms of “what word most likely comes next”), any irrelevant information—any “noise”—dilutes their focus and “kill[s] the quality” of the result.

This has led to a stunning paradox: “More Context is Worse”. An analysis of over 1,500 AI papers revealed that giving an AI more information can cause its accuracy to collapse, in one case dropping from 87% to just 54%. Researchers call this “context rot”.

Think of an AI’s brain like your own short-term memory or “attention budget”. If you’re trying to solve a specific math problem, you need the relevant formulas. But if someone is simultaneously reading you the phone book, your “attention budget” is depleted. You get “context overload.” The phone book, while factually correct, is noise in that moment.

This is exactly what happens to AI. Flooding a model with a 100-page document to answer one question is a GIGO violation. The 99 pages of irrelevant information are “garbage” that drowns out the one page of signal.

Worse still, even when the AI has the right information, it fundamentally struggles to use it. A landmark 2023 study identified the “Lost in the Middle” (LITM) phenomenon. It proved that models have a “U-shaped” attention span: they are great at recalling information from the very beginning or very end of a document, but their performance “significantly degrades” when the answer is buried in the middle.

While AI labs have marketed new models with massive 1-million-token (or larger) context windows, these often “solve” the problem using a misleading test. The popular “Needle in a Haystack” (NIAH) benchmark is a simple retrieval task. It asks the AI to find a single, unique fact (the “needle”) in a sea of text (the “haystack”). Top models like Claude 3 and Gemini 1.5 Pro can do this with over 99% accuracy.

But this isn’t how we use AI. We don’t just want it to find facts; we want it to reason with them. Recent 2025 research confirms the problem is far from solved. One study on “Lost in the Distance” showed that an AI’s ability to connect two related facts “declines sharply as the intervening noise grows”. Another 2025 study found that even the largest models perform no better than “random guessing” on complex reasoning problems. The million-token window, it turns out, solved data ingestion (the AI can “read” the whole book) but not long-context reasoning (it can’t “think” about the whole book at once).

THE SOLUTION: How the Core Findings Work

This critical failure has sparked a revolution in how we interact with AI. We are moving away from the old frontier of “prompt engineering”—the art of finding “magic words” to trick an AI into giving us what we want. We are now entering the new era of “context engineering”.

Context engineering is the discipline of designing systems that act like an expert assistant. Instead of handing the AI a messy pile of documents, a context engineer builds a “cockpit” or “workspace” for the AI. This workspace is programmatically filled with only the high-signal information, tools, and data the AI needs for the specific task, presented at the right time.

The goal is to stop depleting the AI’s “attention budget” and instead maximize the “signal-to-noise ratio”. This means our effort should shift from endlessly tweaking a single prompt to building a “pipeline” that curates the perfect, high-signal context packet every single time. A framework called SLICE (Strategic Limitation and Isolated Context Engineering) formalizes this, proving that curation and strategic limitation—not raw size—are the keys to performance.

A Taxonomy of High-Signal Context

A robust, context-engineered “packet” isn’t just one thing; it’s a strategic combination of three different types of context. When used together, they solve the problem of an AI “ignoring” our requests.

1. Instructional Context (The What to Do)

This is the most basic component: the direct command. It’s the “to-do” list for the AI. But a “bad” instruction is vague (”Write a summary”), while a “good” instruction is specific and descriptive (”Summarize this report in three bullet points, focusing on key findings and action items”). Best practices include assigning the AI a role (”Act as a project manager”), using clear delimiters like ### to separate the instructions from other text, and using strong constraints (”must,” “do not”) in a bulleted list.

However, as many developers have discovered, an AI will often “ignore” these instructions when left alone. This is because instructions must be reinforced by other context types.

2. Exemplar Context (The How to Do It)

This is how an AI learns “on the fly,” a process called In-Context Learning (ICL). Instead of just telling the AI what to do, you show it. You provide a few high-quality examples, or “exemplars,” right in the prompt.

This isn’t “training” the model—its parameters aren’t changing. Rather, it’s a powerful form of pattern recognition. This “few-shot” approach, which can provide “considerable performance improvements”, is like showing a child a picture of a cat and saying “this is a cat,” rather than just describing a cat’s features.

The quality and order of these examples are critical. The AI is highly sensitive to the format and even the order of the examples, with shuffled inputs sometimes causing “measurable declines in output accuracy”. The best practice is to provide 2-8 high-quality, relevant examples, often ordered from simple to complex.

3. Structural Context (The How to Read It)

This is perhaps the most powerful and overlooked technique. Structural context uses formatting (like Markdown, XML, or JSON) as a form of “metacommunication”. It doesn’t just give the AI content; it tells the AI how to interpret that content.

Think of it like giving the AI a set of labeled folders. Instead of a messy document, you use tags like <instructions> or <context> or <example>. Anthropic specifically recommends this for its models, as it creates “unambiguous section boundaries”.

The performance boost is undeniable. One study found that simply formatting a prompt with Markdown improved GPT-4’s reasoning accuracy from 73.9% (using JSON) to 81.2%. Another study found a 40% performance variance based only on the prompt’s template format.

This also explains why AIs “ignore” rules. A weak prompt (Instruction-only) might say, “Avoid weasel words.” A robust, context-engineered prompt would combine all three types:

Structural: <rules>Do not use passive voice.</rules>
Instructional: Rewrite the following text in an active voice.
Exemplar: <example>Passive: ‘It is believed...’ Active: ‘Researchers believe...’</example>

The Gold Standard: The Researcher’s Workflow

When the stakes are high, such as in law or medicine, the best solution isn’t automated—it’s manual. This “gold standard” approach, exemplified by “The Researcher’s Workflow,” shows what context engineering looks like when done by a human expert.

A “bad prompt” is the “data dump” approach: “TLDR: [paste text]”. This is a zero-shot instruction that invites the AI to invent information.

A “good prompt,” in contrast, is a manually curated context packet. A “power user” performs the work first, tying the prompt to a specific goal. Their prompt looks like this:

Assignment: Write an analysis... (Instructional Context)
Quotes: “AI doesn’t take jobs;...” (Manually Retrieved “Knowledge”)
Notes: - Affects industries unevenly... (Manually Retrieved “Summary”)
Additional Instructions: - Use at least three... (Instructional Constraints)

This “good prompt” is a form of RAG (Retrieval-Augmented Generation). It’s just manual. The human expert has already done the Retrieve and Augment steps before asking the AI to Generate. This grounds the AI in expert-verified, high-signal context, eliminating noise and preventing hallucination.

THE FUTURE: What This Means for All of Us

The “gold standard” manual workflow is powerful but doesn’t scale. The future lies in automating this curation process. This is the true purpose of Retrieval-Augmented Generation (RAG).

RAG is, in effect, an automated context engineering pipeline. Its job is to:

Retrieve: Programmatically search an external knowledge base for relevant information.
Augment: Inject these high-signal chunks of information into the prompt.
Generate: Ask the AI to answer the question using only the curated context provided.

This process is designed to be the ultimate solution to “context rot” and the “More Context is Worse” paradox. But, like any system, RAG has its own “Garbage In, Garbage Out” vulnerability: chunking.

Chunking is the strategy for splitting up large documents into small, retrievable pieces. A naive chunking strategy creates “garbage”. An analysis of the trade-offs is critical:

Flat Tiny Chunks: These are like text fragments. They are fast to search but “noisy,” lacking enough surrounding context for the AI. This leads to “shallow answers and hallucinations.”
Large Chunks: These are like mini-documents. They have rich context but suffer from “lower recall” because the answer gets “buried in the middle,” re-creating the “Lost in the Middle” problem.
Parent-Child Strategies: This is the “best of both.” The system searches over small, precise “child” chunks but then returns the full “parent” section to the AI, preserving the signal and the context.

This reveals the secret, non-obvious purpose of “reranking” in advanced RAG systems. We know from the “Lost in the Middle” research that an AI’s attention is highest at the beginning and end of the context. A naive RAG system that just “dumps” 10 retrieved chunks into a prompt will fail, as the AI will ignore the chunks in the middle.

A sophisticated reranker hacks this U-shaped attention curve. It doesn’t just decide if a chunk is relevant; it re-orders the chunks, explicitly placing the single most relevant chunk at the very beginning or end of the context. This guarantees the highest-signal information gets the most AI attention.

The Hybrid Contextual Cockpit

This all leads to a final, unified framework. We should not choose between RAG, Long Context, and Manual Curation. The future is a hybrid system that blends all three.

This “Hybrid Contextual Cockpit” represents the pinnacle of context engineering:

Instruction: A user’s query is identified as the Instructional Context.
Retrieve & Rerank: An automated RAG system finds candidate knowledge chunks, using “structure-aware chunking” to avoid GIGO. A reranker then filters this list and strategically places the Top-1 chunk at the beginning of the prompt to bypass the LITM problem.
Assemble: A “Context Assembler” programmatically builds the final context packet, wrapping the instructions and curated chunks in clear Structural Context (like XML tags) and injecting relevant Exemplar Context.
Generate: This complete, high-signal, pre-engineered packet is sent to the AI, which can now use its massive context window effectively.
Refine (Human-in-the-Loop): Finally, a Human-in-the-Loop (HITL) or Subject Matter Expert (SME) reviews the final output. This human feedback is collected, structured, and turned into a “Golden Dataset”. This dataset isn’t used to retrain the entire, multi-billion-dollar AI; it’s used to fine-tune the retriever and reranker. This creates a virtuous cycle where the automated “assistant” gets smarter and more precise with every interaction.

This hybrid model is the solution. It uses automation (RAG) for scale, expert curation (HITL) for precision, and a deep understanding of AI’s attentional “bugs” (LITM) to build a system that finally delivers “Quality In, Quality Out.”

References

https://blog.stackademic.com/context-engineering-in-llms-and-ai-agents-eb861f0d3e9b

Next Kick Labs

Discussion about this post

Ready for more?