Inspect Benchmarks
Run any Inspect benchmark with Asteroid’ approvers.
Running Inspect AI Benchmarks with Asteroid’ Approvers
This guide explains how to run any benchmark from Inspect AI using Asteroid’ approvers. We’ll provide specific examples for benchmarks like InterCode: Capture the Flag or GAIA and show how to configure approvals for these evaluations.
For setup instructions and details on how the approvers work and integrate, please refer to the Inspect AI documentation.
You can find the list of available evaluations here or in the Inspect AI repository.
Benchmarks
Below is a list of benchmarks with steps to run each one. Click on a benchmark to see the instructions.
First install the Inspect eval benchmark package:
Customizing Approvals
You can customize the approval configurations by editing the approval.yaml
files for each benchmark. This allows you to:
- Specify which tools require approval.
- Define allowlists for commands and functions.
- Configure human or LLM approval.
For details on how the approvers work and how to integrate them, please refer to the Inspect AI documentation.