Know your AI agent actually works.
A CLI-first framework for sandboxed agent evaluation. Fully Harbor compatible, just faster.
Harbor compatible · Powered by Rust · Open source
curl -fsSL https://seaport.run/install.sh | bash
Why Seaport
Everything Harbor does, faster
The same tasks and datasets, on a rebuilt performance core.
Drop-in Harbor compatible
Same task format, same datasets, same scripts. Point Seaport at your existing Harbor tasks and they run unchanged. No migration, no rewrite.
Fast by default
Task environments are built and pulled once, then cached and reused. Identical images are never pulled twice, so warm runs start almost instantly.
Prepares, then runs
A preflight phase resolves, pulls, and builds every environment up front and in parallel, so the run itself is spent on agents, not setup.
Sandboxed by default
Agents run fully isolated, so untrusted code can't touch your machine. Test boldly without worrying about what your agent might do.
Works with any agent
Claude Code, Codex, or your own homegrown agent. If it runs in a terminal, Seaport can evaluate it. Swap agents with a single flag.
Numbers you can trust
Every run is deterministic and lands as clean JSON. Track pass rates over time, compare agents head-to-head, and catch regressions early.
Workflow
From idea to score in four steps
No new framework to learn. If you've written a shell script, you already know how to use Seaport.
- 01
Describe the task
Write what the agent should do and a quick test that checks if it nailed it. That's the whole setup.
- 02
Choose your agent
Plug in Claude Code, Codex, or your own. Want a baseline? Seaport can run the known-good solution to sanity-check the task itself.
- 03
Let it run
Seaport spins up a clean, isolated environment, hands the task to your agent, and grades the result. Totally hands-off.
- 04
Read the score
Get a clear pass rate plus a full transcript of what your agent tried, so you know not just if it failed, but why.
hello-world/
├── instruction.md # the prompt given to the agent
├── task.toml # metadata, timeouts, environment
├── environment/
│ └── Dockerfile
├── solution/
│ └── solve.sh # used by the oracle agent
└── tests/
└── test.sh # writes reward.txt: 1 or 0 [task]
name = "acme/hello-world"
description = "Create the expected output file." [environment]
docker_image = "ubuntu:24.04"
network_mode = "no-network"
build_timeout_sec = 600.0 Safe to run anything
Let agents loose without losing sleep
You're handing real code to an AI and letting it run. Seaport keeps every run sealed off from your machine, so a misbehaving agent can't do any damage. You just see the result and move on.
- Every run starts from a clean, throwaway environment
- Agents can't reach your files, secrets, or network
- Strict time and resource limits, so nothing runs away
- Locked down by default, no config required
jobs/seaport-<run-id>/
├── config.json
├── result.json # pass/fail counts, avg reward
└── <task-name>/
├── config.json
├── result.json
├── agent/
│ └── trajectory.json # command, exit, stdout/err
└── verifier/
├── reward.txt
├── test-stdout.txt
└── test-stderr.txt
Results
See exactly what happened
Get a clean pass rate at a glance, and a full record of every attempt when you want to dig in. It's all plain JSON, so you can drop it into a dashboard, a spreadsheet, or your CI pipeline and watch your agent improve.
Get started
Your first eval, in two minutes
One line to install, one line to run.
curl -fsSL https://seaport.run/install.sh | bash