Running Inspect AI Benchmarks with Asteroid’ Approvers

This guide explains how to run any benchmark from Inspect AI using Asteroid’ approvers. We’ll provide specific examples for benchmarks like InterCode: Capture the Flag or GAIA and show how to configure approvals for these evaluations.

For setup instructions and details on how the approvers work and integrate, please refer to the Inspect AI documentation.

You can find the list of available evaluations here or in the Inspect AI repository.

Benchmarks

Below is a list of benchmarks with steps to run each one. Click on a benchmark to see the instructions.

First install the Inspect eval benchmark package:

pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals

Customizing Approvals

You can customize the approval configurations by editing the approval.yaml files for each benchmark. This allows you to:

  • Specify which tools require approval.
  • Define allowlists for commands and functions.
  • Configure human or LLM approval.

For details on how the approvers work and how to integrate them, please refer to the Inspect AI documentation.

Additional Resources