Agents

An agent is the thing Seaport evaluates. You pick one with -a/--agent. If you leave it off, Seaport uses oracle.

Built-in agents

oracle

Runs solution/solve.sh, then the verifier. Use it to confirm a task is solvable and well formed:

seaport run -p path/to/task -a oracle

nop

Skips the agent phase and runs only the verifier. Handy for baseline checks, or for tasks where the starting workspace is already in the expected state:

seaport run -p path/to/task -a nop

External command agents

Any command that runs in a terminal can be an agent. Point Seaport at it with --agent-command. The command runs inside the same sandboxed task workspace:

seaport run -p path/to/task \
  -a custom \
  --agent-command 'my-agent --task "$SEAPORT_INSTRUCTION_PATH" --workdir "$APP_DIR"'

The command can read SEAPORT_INSTRUCTION_PATH, APP_DIR, and the other environment variables.

Claude Code and Codex

Seaport ships default command templates for claude-code and codex, so you do not need to write --agent-command for them. In Docker mode the CLI must be available inside the task image.

seaport run -p path/to/task -a claude-code -m sonnet \
  --ae ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY"

seaport run -p path/to/task -a codex -m openai/gpt-5 \
  --ae OPENAI_API_KEY="$OPENAI_API_KEY"

Passing secrets and config

Use --ae/--agent-env for the agent phase and --ve/--verifier-env for the verifier phase. Each takes a KEY=VALUE pair and can be repeated:

seaport run -p path/to/task \
  -a custom \
  --agent-command 'my-agent --model "$SEAPORT_MODEL"' \
  -m provider/model \
  --ae API_KEY="$API_KEY" \
  --ve EXPECTED_OUTPUT=ok

Multiple attempts

Run an agent against each task several times to measure consistency, not just a single lucky pass. Use -k for attempts and -n for how many run at once:

seaport run -p path/to/dataset -a claude-code -m sonnet -k 5 -n 4