Introduction

Seaport tells you whether your coding agent actually works. You write a task, point Seaport at it, and get back a clear pass or fail with a full record of what happened.

It runs entirely from the command line. Tasks are plain folders, not Rust code, so anyone on your team can write one. Every run happens inside an isolated sandbox, so you can hand real code to an AI without worrying about your machine.

The idea in one minute

An evaluation has three pieces:

  1. A task: instructions for the agent and a test that checks the result.
  2. An agent: the thing being evaluated. Claude Code, Codex, your own command, or a built-in baseline.
  3. A score: did the agent’s work pass the test? Seaport runs the test and records a reward of 1 (pass) or 0 (fail).

Seaport ties these together, runs them safely, and writes structured results you can track over time.

Why Seaport

Harbor got the hard part right. Specifying an eval as a plain folder, with an instruction, a Docker environment, and a verifier script, is a genuinely good model. The friction was never the format. It was the harness: cold Docker builds on every run, image pulls repeated per task, environments rebuilt between attempts, and setup work scattered through the run instead of done up front.

Seaport keeps the format and rebuilds the runner. It adopts Harbor’s task layout wholesale, so your existing tasks and datasets run unchanged, then makes the parts you wait on fast:

  • Setup happens once, in parallel. A preflight phase resolves, pulls, and builds every environment up front, so the run itself is spent on agents, not Docker.
  • Nothing is rebuilt twice. Environments are cached and reused across trials, attempts, and runs, identical images are pulled once, and workspaces are restored from snapshots rather than rebuilt.
  • The slow tasks start first. Trials are scheduled longest-looking first and run across a worker pool sized to your machine, so the run finishes sooner.

Two more things follow from being written in Rust:

  • One binary, one command. Seaport installs as a single static binary with no language runtime or environment to manage.
  • Deterministic and inspectable. Stable task ordering and run identity, plus structured JSON for every job, trial, and reward, so results stay comparable over time.

The result is the same evaluations you already run, with far less time spent waiting on the harness. See Harbor compatibility and Performance for the details.

A first run

# scaffold a task and run it against its own solution
seaport init --task acme/hello-world
seaport run -p hello-world -a oracle

That last command builds the task environment, runs the oracle solution, grades it, and writes everything to jobs/.

Already using Harbor?

Seaport speaks Harbor’s task format. Your existing tasks and datasets run unchanged, just faster. See Harbor compatibility.

Good to know

  • Seaport is written in Rust and ships as a single binary.
  • It runs the same tasks and datasets as Harbor, on a faster execution core.
  • The default backend is Docker. A faster unsafe-local backend exists for trusted local debugging.
  • It is open source under the MIT license.