Evaluators
Quality assurance and evaluation tools for AI Browser Agent performance
Evaluators are quality assurance tools that determine how well your AI Browser Agent performed during a run. They use LLM-based scoring to assess agent executions against custom criteria, providing objective performance metrics.
Judge Your Runs
Assess and score how well your agent performed
LLM Configuration
Define custom prompts for performance evaluation
Score Configuration
Set up scoring methods and criteria
Agent Binding
Automatically bind evaluators to specific agents
What are Evaluators?
Evaluators analyze AI Browser Agent executions to provide objective performance assessments. They examine the agent’s actions, decisions, and outcomes to generate scores based on your specific criteria.
Each evaluator uses a custom LLM prompt to assess different aspects of agent performance, such as task completion accuracy, efficiency, or adherence to best practices.
Creating an Evaluator
Basic Information
Evaluator Identity
Evaluator Identity
Name Required
A descriptive name for your evaluator (e.g., “Response Quality”, “Task Completion Rate”)
Description Required
Detailed explanation of what this evaluator measures and its purpose
Use clear, descriptive names that indicate the specific aspect being evaluated
LLM Configuration
Configure how the Large Language Model will evaluate agent performance.
Evaluation Prompt
Evaluation Prompt
Evaluation Prompt Required
Write a comprehensive prompt that instructs the LLM how to evaluate the agent’s performance. This prompt should:
- Define specific criteria for assessment
- Explain what constitutes good vs. poor performance
- Include examples when helpful
- Specify the expected output format
Be specific about evaluation criteria. Vague prompts lead to inconsistent scoring.
Advanced Options
Advanced Options
Remove Finish Messages
When enabled, this option:
- Focuses evaluation on the process rather than the final outcome
- Useful for assessing agent reasoning and decision-making
- Prevents bias from explicit success/failure declarations
Enable this when you want to evaluate the journey, not just the destination
Score Configuration
Define how the evaluator should score the results.
Score Types
Score Types
Boolean (True/False)
The evaluator returns a simple true or false result.
Use Cases:
- Pass/fail evaluations
- Task completion status
- Compliance checks
Boolean (True/False)
The evaluator returns a simple true or false result.
Use Cases:
- Pass/fail evaluations
- Task completion status
- Compliance checks
Integer Scoring
The evaluator returns a numeric score within a defined range.
Configuration Options:
- Minimum
Optional
: Lowest possible score - Maximum
Optional
: Highest possible score - Step
Optional
: Increment between valid scores
Use Cases:
- Quality ratings (1-10)
- Performance percentages (0-100)
- Multi-criteria scoring
Define clear score ranges in your evaluation prompt (e.g., “Rate from 1-10 where 1 is poor and 10 is excellent”)
Enumeration Scoring
The evaluator returns one of predefined categorical values.
Configuration: You must define the enum values beforehand in your prompt and scoring configuration.
Use Cases:
- Quality categories (Excellent, Good, Fair, Poor)
- Performance tiers (A, B, C, D, F)
- Custom classifications
Ensure your evaluation prompt clearly defines each enum value and when to use it
Best Practices
Writing Effective Evaluation Prompts
Writing Effective Evaluation Prompts
Be Specific and Measurable
Include Context and Examples
Provide clear examples of different score levels to ensure consistent evaluation.
Define Edge Cases
Specify how to handle partial completions, errors, or unexpected scenarios.
Choosing Score Types
Choosing Score Types
Boolean Scoring
- Simple pass/fail scenarios
- Compliance checks
- Basic functionality tests
Integer Scoring
- Nuanced performance assessment
- Comparative analysis across runs
- Detailed quality metrics
Enum Scoring
- Categorical performance levels
- Standardized grading systems
- Multi-dimensional assessments
Evaluation Consistency
Evaluation Consistency
Test Your Evaluators
Run the same agent execution through your evaluator multiple times to ensure consistent scoring.
Calibrate Scoring
Review evaluation results and adjust prompts to improve accuracy and reduce variability.
Document Criteria
Maintain clear documentation of what each score level means for your specific use case.
Examples
E-commerce Checkout Evaluator
E-commerce Checkout Evaluator
Name: Checkout Completion Quality
Description: Evaluates the agent’s performance in completing e-commerce checkout processes
Evaluation Prompt:
Score Type: Integer (1-10)
Data Extraction Quality Evaluator
Data Extraction Quality Evaluator
Name: Data Extraction Accuracy
Description: Measures accuracy and completeness of data extraction tasks
Evaluation Prompt:
Score Type: Boolean Remove Finish Messages: Enabled