Know your AI agent actually works.

A CLI-first framework for sandboxed agent evaluation. Fully Harbor compatible, just faster.

Harbor compatible · Powered by Rust · Open source

 curl -fsSL https://seaport.run/install | bash

Why Seaport

Everything Harbor does, faster

The same tasks and datasets, on a rebuilt performance core.

Drop-in Harbor compatible

Same task format, same datasets, same scripts. Point Seaport at your existing Harbor tasks and they run unchanged. No migration, no rewrite.

Fast by default

Task environments are built and pulled once, then cached and reused. Identical images are never pulled twice, so warm runs start almost instantly.

Lean by design

Each trial builds or pulls its environment on demand, with identical images deduplicated and cached, then runs in one lean container. No heavy harness between you and the agent.

Sandboxed by default

Each trial runs in its own throwaway Docker container, isolated from your machine and from other trials. Test boldly without cleaning up after every run.

Works with any agent

Claude Code, Codex, or your own homegrown agent. If it runs in a terminal, Seaport can evaluate it. Swap agents with a single flag.

Numbers you can trust

Every run is deterministic and lands as clean JSON. Track pass rates over time, compare agents head-to-head, and catch regressions early.

Workflow

From idea to score in four steps

No new framework to learn. If you've written a shell script, you already know how to use Seaport.

01

Describe the task

Write what the agent should do and a quick test that checks if it nailed it. That's the whole setup.
02

Choose your agent

Plug in Claude Code, Codex, or your own. Want a baseline? Seaport can run the known-good solution to sanity-check the task itself.
03

Let it run

Seaport spins up a clean, isolated environment, hands the task to your agent, and grades the result. Totally hands-off.
04

Read the score

Get a clear pass rate plus a full transcript of what your agent tried, so you know not just if it failed, but why.

hello-world/

 hello-world/
├── instruction.md # the prompt given to the agent
├── task.toml # metadata, timeouts, environment
├── environment/
│   └── Dockerfile
├── solution/
│   └── solve.sh # used by the oracle agent
└── tests/
    └── test.sh # writes reward.txt: 1 or 0

task.toml

 [task]
name = "acme/hello-world"
description = "Create the expected output file." [environment]
docker_image = "ubuntu:24.04"
network_mode = "no-network"
build_timeout_sec = 600.0

Safe to run anything

Let agents loose without losing sleep

You're handing real code to an AI and letting it run. Seaport keeps every run sealed off from your machine, so a misbehaving agent can't do any damage. You just see the result and move on.

Every trial runs in its own clean, throwaway container
Isolated from your machine and from other trials
Network access is off by default, opt in per task
Strict time and resource limits, so nothing runs away

jobs/seaport-<run-id>/

 jobs/seaport-<run-id>/
├── config.json
├── result.json # pass/fail counts, avg reward
└── <task-name>/
├── config.json
    ├── result.json
    ├── agent/
    │   └── trajectory.json # command, exit, stdout/err
└── verifier/
        ├── reward.txt
        ├── test-stdout.txt
        └── test-stderr.txt

Results

See exactly what happened

Get a clean pass rate at a glance, and a full record of every attempt when you want to dig in. It's all plain JSON, so you can drop it into a dashboard, a spreadsheet, or your CI pipeline and watch your agent improve.

Pass rate

at a glance

Full logs

of every attempt

Plain JSON

CI-ready

Get started

Your first eval, in two minutes

One line to install, one line to run.

 curl -fsSL https://seaport.run/install | bash

Star on GitHub Get started

Works on macOS, Linux, and Windows