The Substrate Agents need: from Raw Source to Queryable Data

AI agents have an operational constraint humans don't share: the world cannot move under their feet during a conversation. A human can look at a dashboard, refresh it, notice a number changed, and adjust their reading. An agent that reads a number in one turn, uses it to compute a ratio, and then finds out the base shifted, produces an internally inconsistent result and can't reconcile it on its own.

That's why the first problem when building a platform where agents are the unit of execution isn't the semantic layer. It's the substrate. If the substrate is noisy, the most elegant semantics in the world won't save you.

This is the second article in the series. In the first one we laid out the thesis. Now we get into how you build the foundation.

The trap of "just load the data"

Modern analytical architecture has a birth defect: it assumes "loaded" data is "usable" data. In a traditional stack:

A connector brings rows from an external source (Stripe, Postgres, S3).
Rows land in a staging schema in the warehouse.
Some dbt job transforms them into analytical tables.
A BI tool reads them.

Four steps, four teams, four potential sources of inconsistency. If an analyst opens Looker while a dbt job is running, they can see a world that's momentarily incoherent. In the dashboard era, you tolerate that with a magic phrase: "refresh in five minutes and check again." In the agent era, you can't.

An agent querying an incoherent world builds incoherent arguments. And agent reasoning has the unpleasant property that errors don't stay put — they amplify. A wrong number in turn 1 becomes a wrong recommendation in turn 3.

The solution isn't to make pipelines faster. The solution is to expose a substrate that, by design, cannot be incoherent.

Calliope's pipeline: five explicit phases

In Calliope, the ingestion pipeline has five named phases. It isn't an implementation detail; it's a contract with the agent.

Capture → Structure → Load to engine → Generate ontology → Generate examples

Capture → Structure → Load to engine → Generate ontology → Generate examples

Capture → Structure → Load to engine → Generate ontology → Generate examples

One by one.

Phase 1 — Capture raw data

A connector (Airbyte, native integrations, S3) pulls raw data from the source and writes it to immutable storage. No transformation. No type casting. No interpretation. If the source emits garbage, the storage holds the identical garbage.

Why it matters: raw data is the auditable fallback. If one day an agent gives an odd answer and the team asks "where did this come from?", raw data is the ground truth you can compare everything else against.

Phase 2 — Structure into analytical format

Raw data is converted to columnar Parquet. Types get inferred, names get normalized, dedup is handled per the configured strategy (in-place or full reload). The output is a partitioned Parquet dataset.

Why it matters: Parquet is the format that best balances analytical performance with portability. If a customer ever needs to export the substrate to their warehouse, Parquet moves cleanly. If Calliope goes down, the Parquet files are still readable by any tool that speaks Parquet — Spark, DuckDB, pandas, whatever. That matters for a new category: customers won't bet on a platform that locks them into a proprietary format.

Phase 3 — Load to analytics engine

The Parquet files get loaded into an embedded DuckDB instance that acts as the analytical engine. DuckDB was a conscious choice: we wanted a fast columnar database that runs in-process, with state stored in immutable files that are cheap to freeze, and that doesn't require a cluster to start.

Two properties of this engine are essential for agent-first framing:

Cheap to freeze. The engine's state lives in immutable files on disk. Duplicating costs I/O, not compute. Freezing the state of the world is a file copy.
Isolated reads. The agent can open a session against that frozen state and read from it throughout a conversation without another process shifting the ground. It's literally reading against a frozen world.

Phase 4 — Generate the ontology

This is where the pipeline stops looking like a traditional pipeline. In most stacks, "the semantic layer" is a parallel project someone maintains by hand in another repository. In Calliope, the semantic layer is part of the ingestion pipeline.

Every time data changes materially — new tables, new columns, modified schemas — the pipeline passes the delta to an LLM that regenerates semantic descriptions for new or modified tables and columns. And here's the critical decision: human edits are preserved. If David, the steward, wrote by hand that customer_revenue excludes returns, the regeneration doesn't overwrite that. It only touches what didn't have a human description.

That fixes the maintenance problem that sinks most semantic layers: nobody wants to spend two hours a week writing descriptions by hand, and nobody wants a cron job to overwrite them when they weren't looking. The ontology maintains itself where an LLM is enough, and stays still where the human knows better.

The next article in this series is entirely about what the ontology does inside the agent-first framing. For now it's enough to say: it's the tool schema the agent reads before every query.

Phase 5 — Generate examples

The last phase is the most agent-centric of them all. Calliope takes question→SQL pairs that were marked as correct (validated examples from previous human review) or just confirmed in the latest conversation, and makes them part of the context the agent loads on its next invocation.

This is procedural memory. The agent doesn't start each conversation from scratch; it starts with a library of examples that already know how to resolve the typical questions of the organization, in the exact terms the organization uses.

And this step is what turns the pipeline into a continuous learning loop. Each human correction becomes an example. Each example improves the next turn. The pipeline isn't a batch process that ends: it's a process that makes the agent better every time it runs.

A frozen world per conversation

Under the pipeline, every conversation runs against a DuckDB instance that doesn't move while the agent is thinking. Not a live connection to a warehouse that might be mid-rebuild. Not a dashboard that might refresh between turns. An in-process, read-only view of the data that stays still until the conversation ends.

That property — the world doesn't shift mid-conversation — is what keeps multi-turn reasoning internally consistent. A ratio computed in turn 1 is still valid in turn 3. A filter applied at the start still means the same thing five questions later. It sounds small; it's the difference between an agent that can hold an argument and one that contradicts itself by accident.

This is what works today.

What this substrate unlocks

Choosing an engine whose state lives in immutable files wasn't only about in-conversation consistency. Because the state of the world is stored in files that are cheap to copy and retain, the substrate is ready for a set of capabilities we're building toward:

Point-in-time reproducibility. Re-running a question against the exact state the agent saw last month and getting the identical answer.
Historical audit. Answering "what data did the agent see when it said X in June?" with a verifiable pointer.
Semantic rollback. If an ontology regeneration or a rules update breaks behavior, reverting becomes a file-level operation instead of a migration.

These aren't features grafted onto the architecture later — the substrate is built for them. But today, the only guarantee we make is the one we opened with: during a conversation, the world stays still.

Things we did differently on purpose

There are three substrate design decisions worth making explicit, because they're what breaks first if you copy the architecture without thinking:

Embedded DuckDB instead of a managed warehouse. We accept that this forces us to federate for very large workloads. We accept it because cheap snapshots and in-process reads are what make agent-first framing possible at a reasonable operational cost. If a customer ever needs to scale beyond what DuckDB serves, we federate to their warehouse without changing the agent's interface.
Parquet as the pivot format. We don't use Iceberg or Delta Lake. We're not ruling them out long-term, but early on we wanted a simpler dependency and a clean escape hatch — if a customer wants to leave, the Parquet files are theirs.
The semantic layer is part of the pipeline, not a parallel job. This is countercultural in the dbt world. But in the agent-first framing, semantics is a pipeline artifact, not a side project.
Model and cloud agnostic. DuckDB is embedded. Parquet is an open format. The ontology is generated with any LLM. Nothing in the substrate locks you into a hyperscaler or a model provider. If a new model writes better semantic descriptions tomorrow, or a different cloud offers better pricing, you just switch. In a market that moves every six months, that independence is an architectural decision, not a detail.

Want to see it?

The substrate is the foundation. The interesting part starts when the agent uses it. Request a demo and we'll show you, on real data, live.