Evaluators

Evaluators are quality assurance tools that determine how well your AI Browser Agent performed during a run. They use LLM-based scoring to assess agent executions against custom criteria, providing objective performance metrics.

Judge Your Runs

Assess and score how well your agent performed

LLM Configuration

Define custom prompts for performance evaluation

Score Configuration

Set up scoring methods and criteria

Agent Binding

Automatically bind evaluators to specific agents

What are Evaluators?

Evaluators analyze AI Browser Agent executions to provide objective performance assessments. They examine the agent’s actions, decisions, and outcomes to generate scores based on your specific criteria.

Each evaluator uses a custom LLM prompt to assess different aspects of agent performance, such as task completion accuracy, efficiency, or adherence to best practices.

Creating an Evaluator

Basic Information

Evaluator Identity

LLM Configuration

Configure how the Large Language Model will evaluate agent performance.

Evaluation Prompt

Evaluation Prompt Required

Write a comprehensive prompt that instructs the LLM how to evaluate the agent’s performance. This prompt should:

Define specific criteria for assessment
Explain what constitutes good vs. poor performance
Include examples when helpful
Specify the expected output format

Evaluate the AI Browser Agent's performance on the following criteria:

1. Task Completion: Did the agent successfully complete the assigned task?
2. Efficiency: Were the actions taken logical and efficient?
3. Error Handling: How well did the agent handle unexpected situations?

Rate the overall performance as True (successful) or False (unsuccessful).

Be specific about evaluation criteria. Vague prompts lead to inconsistent scoring.

Advanced Options

Score Configuration

Define how the evaluator should score the results.

Score Types

Boolean (True/False)

The evaluator returns a simple true or false result.

Use Cases:

Pass/fail evaluations
Task completion status
Compliance checks

{
  "score": true,
  "reason": "Agent successfully completed the checkout process"
}

Boolean (True/False)

The evaluator returns a simple true or false result.

Use Cases:

Pass/fail evaluations
Task completion status
Compliance checks

{
  "score": true,
  "reason": "Agent successfully completed the checkout process"
}

Integer Scoring

The evaluator returns a numeric score within a defined range.

Configuration Options:

Minimum Optional: Lowest possible score
Maximum Optional: Highest possible score
Step Optional: Increment between valid scores

Use Cases:

Quality ratings (1-10)
Performance percentages (0-100)
Multi-criteria scoring

{
  "score": 8,
  "reason": "Agent completed task efficiently with minor delays"
}

Define clear score ranges in your evaluation prompt (e.g., “Rate from 1-10 where 1 is poor and 10 is excellent”)

Enumeration Scoring

The evaluator returns one of predefined categorical values.

Configuration: You must define the enum values beforehand in your prompt and scoring configuration.

Use Cases:

Quality categories (Excellent, Good, Fair, Poor)
Performance tiers (A, B, C, D, F)
Custom classifications

{
  "score": "Excellent",
  "reason": "Agent demonstrated superior performance across all metrics"
}

Ensure your evaluation prompt clearly defines each enum value and when to use it

Best Practices

Writing Effective Evaluation Prompts

Be Specific and Measurable

❌ Poor: "Rate how well the agent performed"
✅ Good: "Rate the agent's performance on a scale of 1-10 based on:
- Accuracy of form completion (40%)
- Time to completion (30%) 
- Error recovery (30%)"

Include Context and Examples

Provide clear examples of different score levels to ensure consistent evaluation.

Define Edge Cases

Specify how to handle partial completions, errors, or unexpected scenarios.

Choosing Score Types

Evaluation Consistency

Examples

E-commerce Checkout Evaluator

Name: Checkout Completion Quality

Description: Evaluates the agent’s performance in completing e-commerce checkout processes

Evaluation Prompt:

Evaluate the AI Browser Agent's e-commerce checkout performance based on:

1. Form Accuracy (40%): Were all required fields filled correctly?
2. Payment Process (30%): Was payment information handled properly?
3. Confirmation (20%): Did the agent verify order details?
4. Error Handling (10%): How well were errors or issues resolved?

Score from 1-10 where:
- 9-10: Excellent - Flawless execution with optimal user experience
- 7-8: Good - Minor issues but task completed successfully  
- 5-6: Fair - Task completed with noticeable problems
- 3-4: Poor - Significant issues, task barely completed
- 1-2: Failed - Task not completed or major errors

Score Type: Integer (1-10)

Data Extraction Quality Evaluator

Name: Data Extraction Accuracy

Description: Measures accuracy and completeness of data extraction tasks

Evaluation Prompt:

Assess the data extraction quality by checking:

- Completeness: Was all available data extracted?
- Accuracy: Is the extracted data correct and properly formatted?
- Structure: Is the data organized according to requirements?

Return True if extraction meets quality standards (>95% accurate and complete),
False otherwise.

Score Type: Boolean Remove Finish Messages: Enabled

Get Started

Fundamentals

SDKs

Support & Security

Judge Your Runs

LLM Configuration

Score Configuration

Agent Binding

What are Evaluators?

Creating an Evaluator

Basic Information

LLM Configuration

Score Configuration

Best Practices

Examples

Get Started

Fundamentals

SDKs

Support & Security

Judge Your Runs

LLM Configuration

Score Configuration

Agent Binding

​What are Evaluators?

​Creating an Evaluator

​Basic Information

​LLM Configuration

​Score Configuration

​Best Practices

​Examples

What are Evaluators?

Creating an Evaluator

Basic Information

LLM Configuration

Score Configuration

Best Practices

Examples