Building the loop: durability, isolation, and why both matter
How we designed the plan→ship loop and why making it durable and isolated was not optional.
By Platform team
When we built the first version of cyql, the hardest design question was not the model — it was the workflow. How do you run a multi-step autonomous task reliably when any step can fail?
The answer we landed on is a durable workflow engine. Every task is broken into steps: plan, dispatch, code, review, ship. The state of each step is persisted before it runs. If an agent crashes mid-task, the run resumes from the last completed step, not from the beginning. This sounds simple but it changes what tasks you can attempt. Long-running refactors, multi-file migrations, tasks that take twenty minutes — all of these become tractable when failure is not terminal.
The isolation model came from a different constraint: trust. Giving an autonomous agent access to your codebase is a significant decision. Our answer is to give each agent as little as possible, for as short a time as possible, inside a container it cannot escape. Every task gets its own pod on a dedicated node pool. Egress is firewalled to an allow-list. The metadata endpoint and private network ranges are blocked outright. When the task finishes, the pod is destroyed.
Workers never hold a long-lived secret. They start with a bootstrap token whose only power is to request scoped credentials from the orchestrator — credentials that expire the moment the task is done. There is no persistent footprint to compromise.
Durability and isolation reinforce each other. A durable loop can safely retry failed steps because the pod is clean each time. An isolated pod can be trusted with real repository access because its lifetime is bounded. Together they are what let us tell security teams: this is autonomous, and you can say yes to it.
