Building an AI agent playground

Jun 3, 202612 min read

I built comal.dev as my capstone for the Overclock Accelerator. It’s an open source playground for composing your own AI agents from a shared toolbox. Pick a model, write a system prompt, attach some tools, start chatting.

A comal is the flat griddle in a Mexican kitchen. This one cooks up agents, not tortillas.

Here’s the idea the whole thing is built on: once tool calling is the primitive, the interesting systems are just tools wired to the right targets. Comal is that pushed about as far as I could take it.

First, what it’s like to use. Then how it works underneath.

A lap around it

Open comal.dev and you’re already in. Anonymous by default, no wall before the first agent. Every account starts with one agent already there: Comal.

Tell Comal what you want. “Build me an agent that summarizes GitHub issues.” It picks a model, writes a system prompt, attaches the GitHub tool, and hands you back something that works. Or skip it and build the agent yourself: pick a model from the picker, each one tagged with a relative cost so you can reach for a cheap one on purpose, write the prompt, check off the tools you want from the list.

Then chat. Markdown, code, and diagrams render as the stream arrives. Drop in a file, paste an image, grab a screenshot. When the agent reaches for a tool you marked sensitive, the turn pauses for a one-click approve or deny.

Like a turn? Save it as an eval in one click. Run the suite. Watch the score trend across versions. Open any conversation’s trace to see every step, every tool call, and what it cost. Diff two versions of the agent and revert if a change made it worse.

That’s the product. The engineering is underneath.

What happens when you hit send

One message kicks off more than a model call. A full turn, end to end:

sequenceDiagram
  actor User
  participant UI as Browser UI
  participant API as /api/chat
  participant Redis as Upstash Redis
  participant DB as Neon Postgres
  participant LLM as OpenRouter
  User->>UI: send message
  UI->>API: POST messages
  API->>Redis: rate limit + budget check
  API->>DB: loadAgent + append user-message event
  opt agent has memory-search attached
    API->>LLM: embed user message
    API->>DB: cosine search (top 5, >=0.4)
    Note right of API: prepend memory block to system prompt
  end
  API->>LLM: streamText
  LLM-->>API: token stream
  API-->>UI: stream UI message parts
  Note over API,DB: after() the response is sent
  API->>DB: persist chat_event rows
  API->>Redis: record spend

Rate limit and budget before anything runs. Memory folded in before the first token. The model streams. Then, only after the user has their reply, it persists the turn and records the spend. Each one is a decision worth a closer look.

Agents you compose at runtime

Building an agent takes about a minute: a model, a prompt, a few tools checked off a list. That’s the whole surface for making one.

Under it, an agent is three rows in a database: the model, the system prompt, and the tools you picked from a static, builtin-only registry. Everything is per-user and private. No templates, no sharing, no marketplace.

Because an agent is only that data, you can export one as a self-contained JSON file: model, prompt, tools, and any sub-agents inlined all the way down, plus its evals.

The tools are fixed at build time. You don’t write new ones from the UI. You compose the ones that exist: web search, GitHub reads, memory, a pile of TMDB and Wikidata lookups. Pick a model per conversation without touching the agent. Every model carries a relative cost label, so the price trade-off is in front of you when you pick.

Then the twist that makes it a playground: sub-agents. Any agent you own can become a tool for another agent. A coordinator delegates to specialists, each with their own model and tools. Comal itself runs on this same machinery, its tools just pointed at your other agents instead of the outside world.

flowchart TB
  loadAgent[loadAgent agentId, userId, depth] --> DB[(Neon Postgres)]
  DB --> Parts[agent row + agent_tool rows + agent_subagent edges]

  Parts --> ToolStep[Resolve tool ids via registry<br/>buildTool]
  Parts --> SubStep[Wrap sub-agent edges as tools]

  ToolStep --> Builtin[Builtin tools<br/>agents / core / evals / github<br/>memory / tmdb / traces / web / wikidata]
  SubStep --> SubAgent[Sub-agent tool<br/>runs via ToolLoopAgent]
  SubAgent -.->|recurse, MAX_DEPTH 2| loadAgent

  Builtin --> Config[AgentConfig<br/>model + system prompt + ToolSet]
  SubAgent --> Config
  Builtin -.->|at call time| Ext[OpenRouter + Tavily + GitHub + TMDB + Wikidata]

One function, loadAgent, is the single place this happens. It reads the agent, resolves the tool ids against the registry, and turns each sub-agent link into a tool that recurses back through loadAgent.

Recursion needs limits, so delegation tops out at three tiers: root, child, grandchild. The step budget tightens as you go down, and the grandchild gets no sub-agents of its own. Without that, a coordinator delegating to a coordinator runs up a bill fast and is impossible to trace.

Agents calling agents you own also means you can draw a cycle: A calls B, B calls A. The fix is the part I’d point a reviewer at. Every write to an agent runs in a transaction that first locks every agent you own, not only the one being edited:

return db.transaction(async (tx) => {
  const ownerAgents = await tx
    .select()
    .from(agent)
    .where(eq(agent.userId, userId))
    .orderBy(agent.id)
    .for("update");
  // read the full sub-agent graph, check for a cycle, then write
});

Locking only the target agent isn’t enough. Two tabs editing two different agents could each pass a cycle check against a graph that doesn’t include the other’s pending edit, then commit a loop between them. Locking the whole set means the cycle check and the write see the same graph.

Tip

Why orderBy(agent.id) Deterministic lock order. Two transactions that grabbed these rows in opposite orders would deadlock, so sorting by id makes every transaction grab them the same way.

The chat log is the only source of truth

Open any conversation and there’s a trace: every step, every tool call, token counts, the cost. The cost dashboard, the expandable sub-agent transcripts, all of it comes from one decision I’m happy with. Nothing is stored as a finished message.

A comal.dev execution trace: per-step timing, tool inputs and outputs, a nested sub-agent, and per-step cost

When the model streams a turn, every part of that stream, each text chunk, each tool call, each tool result, each error, becomes one row in an append-only log. On page load, a projector replays the log into the message timeline you see.

flowchart TB
  Stream[AI SDK streamText<br/>fullStream] -->|after response sent| Mapper[mapStreamPartToEvent]
  Mapper --> Append[appendChatEvent]
  Append --> Events[(chat_event<br/>append-only log)]
  Events -->|on page load| Projector[projectMessages /<br/>projectSubagentTraces]
  Projector --> UIMsgs[UIMessage timeline]

Storing nothing finished sounds like overhead. It buys three features I’d otherwise build by hand.

Execution traces. Every conversation already has a step-by-step record. Timing, tool inputs and outputs, token counts. There’s nothing to log separately, the trace is the log.
Cost. Each turn is priced once when it finishes and written into the same log as microdollars. Nothing recomputes, so a later price change never rewrites what an old turn cost. The cost dashboard reads straight off that one column: spend by model and by conversation, a daily trend, the average per turn, and what a full eval suite run cost, over a 30, 90, or all-time window.
Sub-agent transcripts. A sub-agent’s inner stream writes into the same log, tagged with the parent tool call. On reload it projects into a collapsible transcript, so you can open up a delegation and see what the specialist did.

comal.dev cost dashboard: spend by model and by conversation with a daily trend

One append-only log, three things I didn’t have to invent.

Evals you can trust

A playground for agents is useless if you can’t tell whether a change made the agent better or worse. So evals are first-class. Attach test cases to an agent and score how it responds.

There are five scorers.

contains, exact, and levenshtein (edit distance): plain string matching against the answer text.
llm-judge: asks a model whether the answer is semantically right.
tool-call: grades behavior, not text. It reads the tool-call events out of the run’s trace and checks them against an assertion, must-call, must-not-call, must-call-with-these-args.

tool-call is the one I care about most. It’s how you catch an agent that returned the right-looking answer by guessing instead of by calling the tool you gave it.

Two decisions made evals trustworthy.

First, an eval run goes through the same streamText loop as a real conversation, tagged kind = eval. It exercises the real pipeline, so every run is a full trace you can open and inspect. A run that failed mid-stream is still a trace, so you can see where it broke.

Second, runs are sandboxed. An eval shouldn’t fire off real web searches or write to your memory. So the sandbox swaps every write tool’s action for a stub, while still emitting the tool call, so the trace (and the tool-call scorer) can see that the agent tried. Read tools keep working, so multi-step chains run for real. The agent thinks it saved a memory; nothing was saved.

A per-version trend chart plots the score against each config snapshot and flags any version that scored below the one before it. Regressions are visible, not discovered later.

Per-version eval score trend in comal.dev with a regression flagged

Attach three tools, save, search, delete, and an agent can remember things about you. The pool is account-wide, not per-agent. A fact your research agent saved is visible to your writing agent. There’s a /memories page listing everything with a badge for which agent saved each one, and a per-user cap so it can’t grow without bound.

The /memories page in comal.dev: the account-wide memory pool with a source-agent badge on each saved fact

Search is the part I tuned. When an agent has the search tool attached, I don’t wait for the model to decide to call it. The chat route embeds your latest message up front and prepends the top matches to the system prompt before the first token streams. The facts are already there, and it skips a whole tool-call round trip.

Each fact is a 1536-dimension text-embedding-3-small vector in Postgres, with an HNSW index for cosine search.

Note

Threshold tuning text-embedding-3-small scores lower than the usual 0.75 floor suggests, so a query that should clearly match landed at 0.61. The threshold sits at 0.4.

Hardening in the seams

Week eight of the fellowship was a cold shower about treating these as production systems. Prompt injection, runaway bills, poisoned memory. Getting an agent to do the thing was never the hard part. The hard part is everyone who shows up wanting it to do something else.

The fixes all live outside the model. The biggest concrete threat is prompt injection through memory: a poisoned fact that turns into an instruction the next time any agent reads it. The bullets run in that order, most serious first.

Memory that can’t break out. Those injected facts go in inside a <memory> block, framed as context, not instructions. The framing does most of the work. Before a fact goes in, its closing tag gets stripped too, so it can’t end the block early and smuggle in commands. That last guard is one narrow line, content.replaceAll("</memory>", ""), and it only catches the exact tag, nothing fancier.
Spend budgets. Runaway usage stops at $5 an hour signed in, $1 an hour anonymous, on a sliding window, with request rate limits on top. Anonymous traffic runs on my own key, so it’s free to try for now, and that tighter cap is what keeps it that way. The checks fail open: if the rate limiter is unreachable, a chat goes through rather than the limiter taking the whole app down with it.
Bring your own keys, encrypted. Per-user API keys are AES-256-GCM encrypted at rest. A tool that needs a key you haven’t set is hidden from the model entirely, so the agent never tries to use something it can’t authenticate.
Approval gates. Mark a tool as needing approval and it pauses mid-stream for a one-click approve or deny. Sub-agent tools skip the gate so delegation doesn’t stall.

None of this is clever. It’s what stops a hostile user from hijacking an agent or running up your bill.

Boring on purpose

The stack underneath is dull by design. Next.js 16, React 19, Drizzle on Neon Postgres, Better Auth. An Effect service layer where every write goes through one atomic path. The Vercel AI SDK on top of OpenRouter, so the model picker spans frontier and low-cost models behind one interface. That’s deliberate: keep the surprising part in the agents, not the infrastructure.

An agent that builds agents

You build agents by talking to an agent. Comal, the system agent every account starts with, holds the agent-management tools: create, update, diff versions, revert, run evals, read traces. So Comal builds and iterates on your other agents through chat.

“Build me an agent that summarizes GitHub issues.” “Write an eval for it.” “It regressed, what changed?” “Revert it.” The same tool-calling loop that powers any agent here, pointed at the agents themselves.

flowchart LR
  User([User chat]) --> Comal["Comal<br/>system agent"]
  Comal -->|tool call| Create["create_agent"]
  Comal -->|tool call| Update["update_agent"]
  Comal -->|tool call| Eval["run_evals"]
  Comal -->|tool call| Trace["read_traces"]
  Create --> Agents[("Your agents<br/>model + prompt + tools")]
  Update --> Agents
  Eval --> Agents
  Trace --> Agents

That’s the whole idea, turned on itself. Tool calling is the primitive. Comal is just one more agent, its tools wired to the most interesting target there is: your other agents.

Play with it

comal.dev is live and open source. Start anonymous, sign in with GitHub if you want your agents to follow you.

The tool registry is fixed at build time, so the most useful thing you can do is add the tool you wish was there and open a PR. Or stay in the playground: build a coordinator that delegates to two specialists, then write a tool-call eval that catches it guessing instead of calling the tool. It’s alpha, so break it and file an issue.

Questions or feedback?Send me an email.

Last updated onJun 3, 2026

Back to blog