Agents

An agent is the thing Seaport evaluates. You pick one with -a/--agent. If you leave it off, Seaport uses oracle.

Built-in agents

oracle

Runs solution/solve.sh, then the verifier. Use it to confirm a task is solvable and well formed:

seaport run -p path/to/task -a oracle

nop

Skips the agent phase and runs only the verifier. Handy for baseline checks, or for tasks where the starting workspace is already in the expected state:

seaport run -p path/to/task -a nop

External command agents

Any command that runs in a terminal can be an agent. Point Seaport at it with --agent-command. The command runs inside the same sandboxed task workspace:

seaport run -p path/to/task \
  -a custom \
  --agent-command 'my-agent --task "$SEAPORT_INSTRUCTION_PATH" --workdir "$APP_DIR"'

The command can read SEAPORT_INSTRUCTION_PATH, APP_DIR, and the other environment variables.

Claude Code and Codex

Seaport ships default command templates for claude-code and codex, so you do not need to write --agent-command for them. In Docker mode the CLI must be available inside the task image.

seaport run -p path/to/task -a claude-code -m sonnet \
  --ae ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY"

seaport run -p path/to/task -a codex -m openai/gpt-5 \
  --ae OPENAI_API_KEY="$OPENAI_API_KEY"

Provisioning an agent CLI

When the agent CLI is not baked into the task image, install it at trial time with --agent-setup. It is a command run inside the trial container before the agent runs, as the agent user, with the agent environment. It runs once per trial, and a non-zero exit fails the trial:

seaport run -p path/to/task -a claude-code -m sonnet \
  --agent-setup 'npm install -g @anthropic-ai/claude-code' \
  --ae ANTHROPIC_API_KEY

Passing secrets and config

Use --ae/--agent-env for the agent phase and --ve/--verifier-env for the verifier phase. Each takes a KEY=VALUE pair and can be repeated:

seaport run -p path/to/task \
  -a custom \
  --agent-command 'my-agent --model "$SEAPORT_MODEL"' \
  -m provider/model \
  --ae API_KEY="$API_KEY" \
  --ve EXPECTED_OUTPUT=ok

Either flag also accepts a bare KEY, which forwards that variable from the host environment. This keeps secrets like ANTHROPIC_API_KEY off the command line:

ANTHROPIC_API_KEY=sk-... seaport run -p path/to/task -a claude-code -m sonnet \
  --ae ANTHROPIC_API_KEY

Multiple attempts

Run an agent against each task several times to measure consistency, not just a single lucky pass. Use -k for attempts and -n for how many run at once:

seaport run -p path/to/dataset -a claude-code -m sonnet -k 5 -n 4