OpenAI Just Published a Masterclass in Context.

Data agents, context, and the infrastructure behind them

Mar 24, 2026

Last month, OpenAI published a writeup on their internal data agent. It’s been shared widely, mostly for the headline numbers. 600 petabytes. 70,000 datasets. 4,000 employees using it daily. Two engineers built it in three months.

Those are good numbers. But they’re not the interesting part.

The interesting part is buried in how the agent actually works. Because what OpenAI described, in pretty plain language, is something we’ve been circling here since the first post: the returns on a data agent don’t come from better prompts or cleaner schemas. They come from context that lives outside the files you can hand it.

This is a breakdown of what they built and why the architecture matters.

The problem everyone already knows

OpenAI’s data platform has 70,000 datasets. Finding the right table is hard. Writing the right SQL is hard. Getting a trustworthy answer out at the end is harder.

This isn’t a scale problem unique to OpenAI. It’s a ratio problem. Data compounds. Context doesn’t. You add tables faster than you add the institutional knowledge that explains them.

The standard answer is documentation. Write better READMEs. Add dbt descriptions. Keep your data catalog up to date. And those things matter. But if you’ve worked in a large data org you know how that plays out. The catalog is six months behind. The descriptions are technically accurate and practically useless. The person who actually knew what user_activity_v3 was measuring left two years ago.

OpenAI was staring at the same problem at a much larger scale. What they built to solve it is instructive.

The six layers, and where the real work is

The agent pulls from six context layers at query time. Three of them are what you’d expect. Three of them are where this gets interesting.

The expected ones: basic schema metadata, curated expert descriptions of key tables, and live fallback queries against the warehouse. These are the things most teams at least gesture at.

The other three are where the returns actually compound.

Codex Enrichment

Every day, a background process runs Codex against the pipeline code for important tables. Not the schema. The code. It reads the transformation logic, the upstream dependencies, the join keys, the granularity. Things that exist in the ETL but never make it into any documentation because nobody thought to extract them.

The output gets persisted as structured metadata. It becomes the agent’s understanding of what a table actually does, not just what it contains.

This is meaningful because the code is almost always more honest than the docs. The code doesn’t go stale the same way. The code is what’s actually running. Deriving context from it gets you closer to ground truth than anything a human wrote about the table six months ago.

Institutional knowledge from Slack, Docs, and Notion

OpenAI mines their internal communication and document tools to surface context that was never intended as documentation. The explanation a data engineer gave in a Slack thread about why two dashboards disagreed. The Notion page where someone defined what “active user” means for the APAC region. The Google Doc where someone wrote down the business logic for a metric before it ever made it into code.

This is the tribal knowledge problem, partially solved. Not by getting people to write things down in the right place, but by going to find what they already wrote down in the wrong place.

Learning memory from corrections

Every time a user flags that the agent got something wrong, that correction gets stored. Not as a note. As a persistent signal that reshapes how the agent reasons about similar queries in the future.

The agent doesn’t just get smarter through retraining. It gets smarter through use. Every failure that gets corrected is a data point that travels forward.

The part about query history that most people are skipping over

There’s a detail in the OpenAI writeup that doesn’t get much attention: they tier their historical query patterns.

Not all queries are equal as context. A SELECT * LIMIT 10 someone ran while exploring tells you almost nothing about how a table should be used. A canonical dashboard that an executive reviews every week tells you a lot. That query has been validated. Someone invested in defining the right view of that metric.

OpenAI explicitly deprioritizes the exploratory queries and tags the canonical ones as sources of truth.

This is a sharp observation. Usage history is context, but noisy context. The signal is in the queries that matter, the ones that have been institutionalized. Those are the ones that carry a definition of correct.

What this architecture is really saying

The agent doesn’t work because OpenAI has great documentation. It works because they went and found context in the places it actually lives.

Pipeline code. Slack threads. Past corrections. Vetted query patterns. These are not things you build a data catalog for. They’re artifacts of how a data organization actually operates. And they’re packed with meaning that never makes it into the places agents are usually told to look.

Worth noting how the retrieval side works too: OpenAI converts all that enriched context into embeddings via their Embeddings API and stores them for RAG at query time. The agent isn’t scanning raw metadata or logs at runtime. It’s pulling pre-computed, pre-ranked context. That’s what makes it fast enough to actually use across 70,000 tables. The context pipeline is offline. The retrieval is online.

The prompt engineering and the markdown READMEs matter. But they’re a fraction of the context surface. The rest of it is sitting in your version control history, your internal Slack, the dashboards your analytics team has been maintaining for two years, and the corrections your users have been making every time the old system got something wrong.

OpenAI’s architecture is essentially a pipeline for harvesting that context and making it retrievable. That’s the masterclass.

The implication for the rest of us

Most teams building data agents right now are investing heavily in the easy stuff. System prompts. Schema descriptions. A few curated examples. Those are good starting points.

But the gap between an agent that works in a demo and an agent that works in production is almost always a context gap. The demo works because you hand-selected the tables and wrote clean descriptions. Production fails because someone asks about a metric that crosses three domains, has a regional business logic exception documented in a Notion page from 2022, and the right table to use only becomes obvious if you’ve seen the canonical dashboard that finance maintains.

The teams that close that gap are going to be the ones who stop treating context as something you write and start treating it as something you mine. Your pipelines are already generating it. Your communication tools are already full of it. Your query history is already recording it.

You don’t need perfect documentation. You need the infrastructure to turn what already exists into something an agent can reason over.

That’s the traverse. What’s beyond it is what we’re building.

Beyond the Traverse is about the metadata infrastructure we don’t have yet. If you’re building at the edge of what current tools can do, subscribe. We’ll be here, mapping the terrain.

Beyond the Traverse

Discussion about this post

Ready for more?