Durable Execution
A Beach turn can take seconds, minutes, or — in the case of long-running research — longer. The process running the turn may restart in that window: a deployment, an OOM, a host reboot. Without persistence, the turn is lost and the user is left waiting for a reply that will never arrive.
Beach's DurableExecutor interface is the seam where a third-party durable-execution substrate plugs in. The actor runner calls checkpoint() at every lifecycle moment of a turn; the adapter persists the call somewhere durable. On restart, the consumer queries restore() and decides what to do with the recovered checkpoints — replay, resume, or report.
Beach does not implement durable execution itself. The interface is small and substrate-agnostic: any of Trigger.dev, Temporal, Inngest, Cloudflare Durable Objects, or a hand-rolled SQL-backed adapter satisfies it.
When you need it
Most simple Beach applications run without a DurableExecutor. A turn that finishes in under a second from a chat panel does not benefit from persistence — if the process dies mid-turn, the user retries.
Persistence becomes load-bearing when:
- The turn outlives a deploy. Long-running research, multi-step planning, anything that takes minutes. A deployment in the middle of the turn must not lose the turn.
- Inbound is asynchronous and the response is owed back. An email arrived; the application must reply. The work spans process restarts. The user does not see "your message bounced because we had a deploy"; they see a delayed-but-real reply.
- The application owes a partner an A2A response. Same shape. The peer has a request open against your application; that request must be answered after restarts.
Applications that experience none of these can omit DurableExecutor and run with the existing in-memory behaviour. Beach is forward-compatible with adopting durability later — the wiring is one constructor argument away.
The interface
export interface DurableExecutor {
checkpoint(checkpoint: TurnCheckpoint): Promise<void>;
restore(turnId: string): Promise<TurnCheckpoint[]>;
}
export interface TurnCheckpoint {
turnId: string;
sessionId: string;
phase: string; // one of the registered CheckpointPhase names
state: unknown; // JSON-serialisable; phase-specific shape
timestamp: string; // ISO-8601; strictly monotonic per turn
}
checkpoint() persists one checkpoint and returns when the persistence is durable (the adapter has acknowledged the write). Failures throw, and the session manager surfaces them — durability is a hard dependency when configured. A turn whose started checkpoint cannot be persisted does not run.
restore() returns every checkpoint for a turnId in chronological order. Beach never calls restore() itself; the consumer decides when restoration happens — at startup, on a specific recovery flow, never. The mechanism is exposed; the policy is the consumer's.
The phase taxonomy
TurnCheckpoint.phase is a string drawn from an open registry, CheckpointPhaseRegistry. Beach ships five canonical phases as default registrations:
| Phase | When it fires | State shape |
|---|---|---|
started |
The session manager has opened the turn slot. | { actorId, slotKey, messages, expiresAt } |
llm-complete |
The LLM iteration completed (one trip through the model). | { messages, respond } |
tool-dispatched |
Consumer-fired; Beach itself does not fire this phase. | Consumer-defined. |
tool-received |
A tool result has arrived and been merged into the turn's mailbox. | { message, slotKey, manifestId? } |
settled |
The turn has reached a terminal state (complete / cancelled / timeout / error). |
{ turnState, respond, reason } |
The registry is open. Consumers register additional phases at startup if their application has lifecycle moments worth checkpointing — for example, a custom peer-call-dispatched phase for an application that fires off a long-running A2A peer call mid-turn:
import { CheckpointPhaseRegistry } from '@cool-ai/beach-core';
CheckpointPhaseRegistry.register({
name: 'peer-call-dispatched',
description: 'A long-running A2A peer call has been dispatched and is awaiting reply.',
});
The actor runner only fires the canonical phases. Consumer-registered phases are fired by consumer code (typically inside a tool handler that wants its own checkpointing).
Wiring EventRouter
import { EventRouter, InMemoryDurableExecutor } from '@cool-ai/beach-core';
const durableExecutor = new InMemoryDurableExecutor(); // dev / tests
// or: const durableExecutor = new TriggerDevDurableExecutor({ /* … */ });
const router = new EventRouter({ durableExecutor });
That is the whole wiring. The actor runner fires checkpoint() at every lifecycle moment — started when the turn opens, llm-complete after each LLM iteration, tool-received as tool results land, settled at terminal — and ignores returns. Failures propagate and abort the turn.
Without a durableExecutor, no checkpoint code path executes; the router runs identically to its non-durable shape. Adopting durability later is a one-line change.
The in-memory reference implementation
InMemoryDurableExecutor ships with @cool-ai/beach-core. It stores checkpoints in a Map<turnId, TurnCheckpoint[]> and is suitable for development, tests, and applications that do not need persistence across process restarts. It satisfies the idempotency and chronological-ordering contracts; production-shaped behaviour is identical apart from the persistence-across-restart guarantee.
const executor = new InMemoryDurableExecutor();
// In a test:
await router.routeEvent({ source: 'session', eventType: 'turn_requested', data: { /* … */ } });
const checkpoints = await executor.restore(turnId);
expect(checkpoints[0].phase).toBe('started');
expect(checkpoints[checkpoints.length - 1].phase).toBe('settled');
For production durability, write a thin adapter against the substrate of choice — Trigger.dev's task store, a SQL table, Temporal's workflow state. The interface is small and the contracts are documented; an adapter is typically under 100 lines.
Idempotency contract
checkpoint() MUST be idempotent on the tuple (turnId, phase, timestamp):
- Two calls with identical
(turnId, phase, timestamp)MUST result in one stored checkpoint. The adapter is the dedupe layer; Beach does not deduplicate before calling. - Two checkpoints with identical
(turnId, phase)but DIFFERENTtimestampare valid and represent two distinct phase transitions — the adapter MUST NOT dedupe on(turnId, phase)alone.
The session manager guarantees that timestamps are strictly monotonic per turn — even when two lifecycle events fire in the same wall-clock millisecond, their timestamps differ by at least 1 ms. Collisions on (turnId, phase, timestamp) therefore represent network-layer retries delivering the same call twice, which is exactly what adapter idempotency exists to absorb.
Trigger.dev and Temporal support idempotency keys natively; the SQL pattern is INSERT … ON CONFLICT (turn_id, phase, timestamp) DO NOTHING.
Restoration contract
restore(turnId) returns checkpoints in chronological order (oldest first). Implementations backed by stores that don't preserve write order (some NoSQL stores, eventually-consistent KV) MUST sort by timestamp before returning. An empty array means nothing to restore — either the turn was never checkpointed, or its checkpoints have been pruned by the adapter's retention policy.
What the consumer does with the restored checkpoints is the consumer's call. Three common patterns:
- Resume from the last
settledcheckpoint — the turn already reached terminal; replay the outbound side only. - Resume from the last
llm-completecheckpoint — the LLM finished; pick up at the tool-dispatch step. - Replay from
started— the turn was interrupted before any meaningful progress; start over.
Beach is intentionally agnostic about which pattern fits an application. The mechanism is exposed; the recovery semantics are the consumer's responsibility.
Cancellation interaction
cancelTurn is async (Promise<void>); the actor runner awaits the durable settled checkpoint before tearing down the turn slot, so the durability guarantee survives cancellation. The abort signal still fires synchronously before the await — tool handlers wired to ctx.signal observe cancellation immediately; only the settled checkpoint persistence and the onTurnCancelled callback fire after the await resolves.
If a process restarts mid-cancellation, the turn's last reachable checkpoint depends on how far the cancellation got: started if the cancel arrived before any meaningful work; llm-complete if the LLM iteration finished but the turn never reached settled; settled if the cancel completed before the restart. The consumer's restoration logic decides what to do in each case.
Pitfalls
Treating durability as advisory. When a DurableExecutor is configured, persistence is a hard dependency: a failed checkpoint() aborts the turn before the participant runs. This is intentional — the alternative ("we tried to persist; it failed; carry on") would silently lose turns on durability-substrate degradation. If your application can tolerate not persisting, omit the DurableExecutor entirely; do not configure one and ignore failures.
Storing live runtime machinery in state. The state field of TurnCheckpoint MUST be JSON-serialisable end-to-end. Adapter authors and consumers writing custom phases MUST NOT include participant instances, AbortController references, internal Map instances, function references, or anything else that does not round-trip through JSON. The session manager only constructs JSON-safe payloads for its own canonical phases; consumer-registered phases inherit the same constraint.
Dedupe on (turnId, phase) alone. This is the contract violation that breaks correctness on legitimate same-phase-multiple-times patterns — the canonical case is tool-received, which fires once per tool result. An adapter that dedupes on (turnId, phase) collapses these into one checkpoint and silently loses tool results. Always dedupe on the full tuple (turnId, phase, timestamp).
Related
@cool-ai/beach-coreREADME § Actor and session lifecycle — API reference forEventRouter, includingdurableExecutor.- Manifests — the related primitive for asynchronous coordination during a turn (manifests gate within a turn; durability persists across turn lifecycle).
- Pushing results to the user — when the long-running shape applies, durability is what keeps the eventual reply landing on the right session.