Harbor compatibility

Seaport speaks Harbor’s task format. If you already have Harbor tasks and datasets, they run on Seaport unchanged. There is no migration step, no rewrite, and no new config to learn.

The benchmark suite proves this in the most direct way possible: it runs the exact same command through both tools and compares the results.

seaport run -p path/to/task -a oracle
harbor  run -p path/to/task -a oracle

The compatible surface

Everything you use to specify and run a task carries over.

Surface	What carries over
Task layout	The `instruction.md`, `task.toml`, `environment/`, `solution/`, and `tests/` folder structure
`task.toml`	The `[task]`, `[agent]`, `[verifier]`, and `[environment]` sections
Multi-step tasks	`[[steps]]` with the `steps/<name>/` layout and `min_reward` gating
Filesystem contract	The `/app` workspace and `/logs/verifier` output directory, with the task mounted read-only
Verifier contract	A reward written to `reward.txt` or `reward.json`, scalar or named scores
Artifacts	The `artifacts` array and a clean-room separate verifier seeded from declared artifacts
Healthcheck	`[environment.healthcheck]` with Docker `HEALTHCHECK` semantics
Stats	A Harbor-compatible `stats` block with per-eval `metrics`, `pass_at_k`, `reward_stats`, and `exception_stats`
CLI verbs	`run -p <path>`, `run -d <dataset>`, and agent selection with `-a`
Datasets	The same dataset names, resolved locally or from a registry
Agents	The `oracle` and `nop` baselines, plus external commands

Tasks

A task is the same directory it always was. See Writing tasks for the full layout. The [environment] section still names a docker_image, sets a network_mode, and bounds the build with build_timeout_sec. The [agent] and [verifier] sections still carry their timeout_sec.

The container contract

Inside the sandbox, a task sees the same filesystem it expects:

/app is the writable workspace.
/logs/verifier is where the verifier writes its output and reward (reward.txt or reward.json).
The task directory itself is mounted read-only.

Scripts that hardcode those paths work as written.

Like Harbor, Seaport runs the whole trial in one long-lived container: the agent phase and then the verifier phase run inside it via docker exec. State the agent leaves behind — installed packages, $HOME, files anywhere on the filesystem, not just /app — persists into the verifier. A separate clean-room verifier is opt-in via a verifier environment, matching Harbor’s behavior of seeding a fresh verifier container from declared artifacts.

Datasets and agents

Run a registered dataset by name with run -d, the same way you would with Harbor, and Seaport resolves it locally or from a registry. The built-in oracle agent runs a task’s solution/solve.sh, and nop runs only the verifier, so the baseline checks you rely on behave the same.

Design differences

Same inputs and outputs, different engine underneath.

Area	Harbor	Seaport
Distribution	Installed as a Python package	A single self-contained Rust binary, installed in one line
Setup	Resolves and builds environments during the run	Same: each trial builds or pulls its environment on demand, with identical builds deduplicated so the first trial starts immediately
Container	Docker-backed execution	Docker-backed, with a writable root filesystem and default Linux capabilities so tasks can install packages and write anywhere, plus CPU, memory, PID, and wall-clock limits
Retry matching	Retries by exception type	Retries by error-message substring (`--retry-include` / `--retry-exclude`)
Concurrency	One trial per core	About `host_cpus / 3`, clamped to 2–16, override with `-n`
Networking	Allowlist networking	The local Docker backend has no allowlist; `network_mode` is `no-network` or `public`

Seaport adds a few things on top of the shared format:

Deterministic core. Stable task ordering and run identity, so results stay comparable across runs.
Structured output. Every job, trial, trajectory, and reward lands as plain JSON. See Job output.
A local fast path. The unsafe-local backend runs scripts directly for trusted debugging, with no container overhead.

Retries

Both tools retry trials that fail for infrastructure reasons. --max-retries <n> (default 0) retries an errored trial up to n times, discarding the failed attempt. Where Harbor matches on exception type, Seaport matches on the error message: --retry-exclude <substr> never retries errors containing the substring (the defaults already cover pointless retries like timeouts and reward-file errors), and --retry-include <substr>, when set, retries only matching errors.

Resource parity

By default Seaport gives each trial a fair share of host CPUs rather than the task’s declared cpus, since trials are heavy and often emulated. Pass --strict-resources to enforce the task’s exact cpus and memory_mb, matching Harbor.

Not at parity yet

Seaport has focused on making the execution path correct and fast first, so a few of Harbor’s surfaces are not covered yet. Being straight about the edges:

Results viewer. seaport view is a placeholder today. Every job still lands as plain JSON on disk, so nothing is lost (see Job output), but there is no built-in browser for those results yet. The runner came first; the viewer is next.
Full registry surface. Seaport resolves datasets and tasks from a registry JSON file and from git-backed sources, which covers the common cases. It does not yet implement Harbor’s complete registry feature set.
Native agent integrations. Seaport ships the oracle and nop baselines plus templates for codex and claude-code. Any other agent runs through --agent-command rather than a built-in adapter, so deeper per-agent integrations are not there yet.

If your workflow depends on one of these, track the changelog.