Test agents or models against predefined test cases to validate model routing, performance, and output quality. Use when: (1) verifying a specific agent or m...
Use scripts/model_tester.py to run repeatable test prompts and compare requested vs actual model usage from OpenClaw logs.
From the skill directory (or pass absolute paths):
python3 scripts/model_tester.py --agent menial --case extract-emails
python3 scripts/model_tester.py --model openai/gpt-4.1 --case math-reasoning
python3 scripts/model_tester.py --agent chat --model openai/gpt-4.1 --case all --out /tmp/model-test.json
--agent <name>: Target agent (chat, menial, coder, etc.)--model <name>: Requested model alias/name to test--case <id|all>: Case from references/test-cases.json or all--timeout <sec>: Per-case timeout (default 120)--out <file>: Optional JSON output fileRequire at least one of --agent or --model.
references/test-cases.json.openclaw logs --follow --json in parallel.openclaw agent --json with a bounded test prompt (asks agent to use a subagent for the task).Top-level JSON:
tooltimestampagentrequested_modelresults[]Each result entry returns:
test_caseagentrequested_modelactual_model (parsed from logs when available)status (ok/error)result_summaryruntime_secondstokens (when discoverable)errors[]The tester spawns isolated subagent tasks with predefined test prompts only — no user data is passed to models. It tails OpenClaw logs to extract:
Log extraction uses regex patterns to find model/token fields. No personally identifiable information or arbitrary log content is captured — only structured fields related to the test execution.
openclaw config is invalid or gateway is unavailable, the script returns status=error with stderr details.references/test-cases.json to add custom prompts for your benchmark set.ZIP package — ready to use