I built comal.dev as my capstone for the Overclock Accelerator. It’s an open source playground for composing your own AI agents from a shared toolbox. Pick a model, write a system prompt, attach some tools, start chatting.
A comal is the flat griddle in a Mexican kitchen. This one cooks up agents, not tortillas.
Here’s the idea the whole thing is built on: once tool calling is the primitive, the interesting systems are just tools wired to the right targets. Comal is that pushed about as far as I could take it.
First, what it’s like to use. Then how it works underneath.
A lap around it
Open comal.dev and you’re already in. Anonymous by default, no wall before the first agent. Every account starts with one agent already there: Comal.
Tell Comal what you want. “Build me an agent that summarizes GitHub issues.” It picks a model, writes a system prompt, attaches the GitHub tool, and hands you back something that works. Or skip it and build the agent yourself: pick a model from the picker, each one tagged with a relative cost so you can reach for a cheap one on purpose, write the prompt, check off the tools you want from the list.
Then chat. Markdown, code, and diagrams render as the stream arrives. Drop in a file, paste an image, grab a screenshot. When the agent reaches for a tool you marked sensitive, the turn pauses for a one-click approve or deny.
Like a turn? Save it as an eval in one click. Run the suite. Watch the score trend across versions. Open any conversation’s trace to see every step, every tool call, and what it cost. Diff two versions of the agent and revert if a change made it worse.
That’s the product. The engineering is underneath.
What happens when you hit send
One message kicks off more than a model call. A full turn, end to end:
sequenceDiagram
actor User
participant UI as Browser UI
participant API as /api/chat
participant Redis as Upstash Redis
participant DB as Neon Postgres
participant LLM as OpenRouter
User->>UI: send message
UI->>API: POST messages
API->>Redis: rate limit + budget check
API->>DB: loadAgent + append user-message event
opt agent has memory-search attached
API->>LLM: embed user message
API->>DB: cosine search (top 5, >=0.4)
Note right of API: prepend memory block to system prompt
end
API->>LLM: streamText
LLM-->>API: token stream
API-->>UI: stream UI message parts
Note over API,DB: after() the response is sent
API->>DB: persist chat_event rows
API->>Redis: record spend
Rate limit and budget before anything runs. Memory folded in before the first token. The model streams. Then, only after the user has their reply, it persists the turn and records the spend. Each one is a decision worth a closer look.
Agents you compose at runtime
Building an agent takes about a minute: a model, a prompt, a few tools checked off a list. That’s the whole surface for making one.
Under it, an agent is three rows in a database: the model, the system prompt, and the tools you picked from a static, builtin-only registry. Everything is per-user and private. No templates, no sharing, no marketplace.
Because an agent is only that data, you can export one as a self-contained JSON file: model, prompt, tools, and any sub-agents inlined all the way down, plus its evals.
The tools are fixed at build time. You don’t write new ones from the UI. You compose the ones that exist: web search, GitHub reads, memory, a pile of TMDB and Wikidata lookups. Pick a model per conversation without touching the agent. Every model carries a relative cost label, so the price trade-off is in front of you when you pick.
Then the twist that makes it a playground: sub-agents. Any agent you own can become a tool for another agent. A coordinator delegates to specialists, each with their own model and tools. Comal itself runs on this same machinery, its tools just pointed at your other agents instead of the outside world.
flowchart TB loadAgent[loadAgent agentId, userId, depth] --> DB[(Neon Postgres)] DB --> Parts[agent row + agent_tool rows + agent_subagent edges] Parts --> ToolStep[Resolve tool ids via registry<br/>buildTool] Parts --> SubStep[Wrap sub-agent edges as tools] ToolStep --> Builtin[Builtin tools<br/>agents / core / evals / github<br/>memory / tmdb / traces / web / wikidata] SubStep --> SubAgent[Sub-agent tool<br/>runs via ToolLoopAgent] SubAgent -.->|recurse, MAX_DEPTH 2| loadAgent Builtin --> Config[AgentConfig<br/>model + system prompt + ToolSet] SubAgent --> Config Builtin -.->|at call time| Ext[OpenRouter + Tavily + GitHub + TMDB + Wikidata]
One function, loadAgent, is the single place this happens. It reads the agent, resolves the tool ids against the registry, and turns each sub-agent link into a tool that recurses back through loadAgent.
Recursion needs limits, so delegation tops out at three tiers: root, child, grandchild. The step budget tightens as you go down, and the grandchild gets no sub-agents of its own. Without that, a coordinator delegating to a coordinator runs up a bill fast and is impossible to trace.
Agents calling agents you own also means you can draw a cycle: A calls B, B calls A. The fix is the part I’d point a reviewer at. Every write to an agent runs in a transaction that first locks every agent you own, not only the one being edited:
return db.transaction(async (tx) => { const ownerAgents = await tx .select() .from(agent) .where(eq(agent.userId, userId)) .orderBy(agent.id) .for("update"); // read the full sub-agent graph, check for a cycle, then write});Locking only the target agent isn’t enough. Two tabs editing two different agents could each pass a cycle check against a graph that doesn’t include the other’s pending edit, then commit a loop between them. Locking the whole set means the cycle check and the write see the same graph.
Why orderBy(agent.id)
Deterministic lock order. Two transactions that grabbed these rows in opposite orders would deadlock, so sorting by id makes every transaction grab them the same way.
The chat log is the only source of truth
Open any conversation and there’s a trace: every step, every tool call, token counts, the cost. The cost dashboard, the expandable sub-agent transcripts, all of it comes from one decision I’m happy with. Nothing is stored as a finished message.
When the model streams a turn, every part of that stream, each text chunk, each tool call, each tool result, each error, becomes one row in an append-only log. On page load, a projector replays the log into the message timeline you see.
flowchart TB Stream[AI SDK streamText<br/>fullStream] -->|after response sent| Mapper[mapStreamPartToEvent] Mapper --> Append[appendChatEvent] Append --> Events[(chat_event<br/>append-only log)] Events -->|on page load| Projector[projectMessages /<br/>projectSubagentTraces] Projector --> UIMsgs[UIMessage timeline]
Storing nothing finished sounds like overhead. It buys three features I’d otherwise build by hand.
- Execution traces. Every conversation already has a step-by-step record. Timing, tool inputs and outputs, token counts. There’s nothing to log separately, the trace is the log.
- Cost. Each turn is priced once when it finishes and written into the same log as microdollars. Nothing recomputes, so a later price change never rewrites what an old turn cost. The cost dashboard reads straight off that one column: spend by model and by conversation, a daily trend, the average per turn, and what a full eval suite run cost, over a 30, 90, or all-time window.
- Sub-agent transcripts. A sub-agent’s inner stream writes into the same log, tagged with the parent tool call. On reload it projects into a collapsible transcript, so you can open up a delegation and see what the specialist did.
One append-only log, three things I didn’t have to invent.
Evals you can trust
A playground for agents is useless if you can’t tell whether a change made the agent better or worse. So evals are first-class. Attach test cases to an agent and score how it responds.
There are five scorers.
contains,exact, andlevenshtein(edit distance): plain string matching against the answer text.llm-judge: asks a model whether the answer is semantically right.tool-call: grades behavior, not text. It reads the tool-call events out of the run’s trace and checks them against an assertion, must-call, must-not-call, must-call-with-these-args.
tool-call is the one I care about most. It’s how you catch an agent that returned the right-looking answer by guessing instead of by calling the tool you gave it.
Two decisions made evals trustworthy.
First, an eval run goes through the same streamText loop as a real conversation, tagged kind = eval. It exercises the real pipeline, so every run is a full trace you can open and inspect. A run that failed mid-stream is still a trace, so you can see where it broke.
Second, runs are sandboxed. An eval shouldn’t fire off real web searches or write to your memory. So the sandbox swaps every write tool’s action for a stub, while still emitting the tool call, so the trace (and the tool-call scorer) can see that the agent tried. Read tools keep working, so multi-step chains run for real. The agent thinks it saved a memory; nothing was saved.
A per-version trend chart plots the score against each config snapshot and flags any version that scored below the one before it. Regressions are visible, not discovered later.
Memory the agents share
Attach three tools, save, search, delete, and an agent can remember things about you. The pool is account-wide, not per-agent. A fact your research agent saved is visible to your writing agent. There’s a /memories page listing everything with a badge for which agent saved each one, and a per-user cap so it can’t grow without bound.
Search is the part I tuned. When an agent has the search tool attached, I don’t wait for the model to decide to call it. The chat route embeds your latest message up front and prepends the top matches to the system prompt before the first token streams. The facts are already there, and it skips a whole tool-call round trip.
Each fact is a 1536-dimension text-embedding-3-small vector in Postgres, with an HNSW index for cosine search.
Threshold tuning
text-embedding-3-small scores lower than the usual 0.75 floor suggests, so a query that should clearly match landed at 0.61. The threshold sits at 0.4.
Hardening in the seams
Week eight of the fellowship was a cold shower about treating these as production systems. Prompt injection, runaway bills, poisoned memory. Getting an agent to do the thing was never the hard part. The hard part is everyone who shows up wanting it to do something else.
The fixes all live outside the model. The biggest concrete threat is prompt injection through memory: a poisoned fact that turns into an instruction the next time any agent reads it. The bullets run in that order, most serious first.
- Memory that can’t break out. Those injected facts go in inside a
<memory>block, framed as context, not instructions. The framing does most of the work. Before a fact goes in, its closing tag gets stripped too, so it can’t end the block early and smuggle in commands. That last guard is one narrow line,content.replaceAll("</memory>", ""), and it only catches the exact tag, nothing fancier. - Spend budgets. Runaway usage stops at $5 an hour signed in, $1 an hour anonymous, on a sliding window, with request rate limits on top. Anonymous traffic runs on my own key, so it’s free to try for now, and that tighter cap is what keeps it that way. The checks fail open: if the rate limiter is unreachable, a chat goes through rather than the limiter taking the whole app down with it.
- Bring your own keys, encrypted. Per-user API keys are AES-256-GCM encrypted at rest. A tool that needs a key you haven’t set is hidden from the model entirely, so the agent never tries to use something it can’t authenticate.
- Approval gates. Mark a tool as needing approval and it pauses mid-stream for a one-click approve or deny. Sub-agent tools skip the gate so delegation doesn’t stall.
None of this is clever. It’s what stops a hostile user from hijacking an agent or running up your bill.
Boring on purpose
The stack underneath is dull by design. Next.js 16, React 19, Drizzle on Neon Postgres, Better Auth. An Effect service layer where every write goes through one atomic path. The Vercel AI SDK on top of OpenRouter, so the model picker spans frontier and low-cost models behind one interface. That’s deliberate: keep the surprising part in the agents, not the infrastructure.
An agent that builds agents
You build agents by talking to an agent. Comal, the system agent every account starts with, holds the agent-management tools: create, update, diff versions, revert, run evals, read traces. So Comal builds and iterates on your other agents through chat.
“Build me an agent that summarizes GitHub issues.” “Write an eval for it.” “It regressed, what changed?” “Revert it.” The same tool-calling loop that powers any agent here, pointed at the agents themselves.
flowchart LR
User([User chat]) --> Comal["Comal<br/>system agent"]
Comal -->|tool call| Create["create_agent"]
Comal -->|tool call| Update["update_agent"]
Comal -->|tool call| Eval["run_evals"]
Comal -->|tool call| Trace["read_traces"]
Create --> Agents[("Your agents<br/>model + prompt + tools")]
Update --> Agents
Eval --> Agents
Trace --> Agents
That’s the whole idea, turned on itself. Tool calling is the primitive. Comal is just one more agent, its tools wired to the most interesting target there is: your other agents.
Play with it
comal.dev is live and open source. Start anonymous, sign in with GitHub if you want your agents to follow you.
The tool registry is fixed at build time, so the most useful thing you can do is add the tool you wish was there and open a PR. Or stay in the playground: build a coordinator that delegates to two specialists, then write a tool-call eval that catches it guessing instead of calling the tool. It’s alpha, so break it and file an issue.