Agent testing with Datasets, Evaluations, and Versioning

Learn how to simulate input variations and automatically verify structured outputs before deploying to production.

When building AI-driven logic, prompt behavior must be validated across multiple input variations. Unlike deterministic code, LLM outputs are probabilistic and sensitive to context shifts.

The Agent Component provides a structured testing mechanism through Datasets, Experiments, and Evaluations, enabling measurable validation before production release.

circle-info

Refer to the Agent Component configuration documentation before testing this component.

When should you use Datasets and Evaluations?

Use this testing structure when:

  • You want to measure how accurate your AI agent is.

  • You want to compare different implementation strategies, such as changes to the agent’s configuration or the introduction of new tools.

  • You are updating prompts and want to prevent regressions.

  • You need deterministic validation over probabilistic outputs.

  • You require auditability before production deployment.

  • You want to benchmark different Agent versions (for example, model upgrades or prompt refactors) using the same evaluation rules and dataset.

Conceptual overview

Core components

The testing structure is composed of three core elements:

  • Dataset: A collection of related test scenarios, organized as a single test suite.

  • Test Case: A defined input configuration that represents one execution case.

  • Evaluation scorer: A validation rule that checks a specific field or condition in the model’s output.

Together, these elements allow you to vary inputs in a controlled way and verify that the generated output meets defined structural requirements.

Execution behavior when using Datasets

Before creating a Dataset, ensure that your System or User prompt includes at least one variable, represented by a Double Braces expression. Test Cases can only be configured when variables are present, and Datasets are used to provide values for these variables. For this reason, creating a Dataset only makes sense when the prompt includes variables.

By default, when no Dataset exists, the component runs normally and does not require a Test Case. However, once a Dataset is created, the execution model changes: at least one Test Case must be configured for the component to run.

To return to the original execution mode, simply remove the created Datasets. Without any Dataset, the component runs normally and does not require a Test Case.

Implementation guide

1

Create a Dataset

A Dataset acts as a test suite for your agent. Each Dataset can contain multiple Test Cases and Evaluations.

To create a Dataset:

  1. In the Agent Component, navigate to the Datasets tab under Output Details.

  2. Click Create Dataset (or click Select a Dataset and then Create a new Dataset if one already exists).

  3. Provide a name and confirm creation.

The Dataset is now ready to receive Test Cases.

2

Create Test Cases

A Test Case represents a controlled input scenario.

In traditional software testing, you would vary parameters or payloads. Here, you vary prompt variables.

How to declare dynamic prompt variables using Double Braces

Corporate agents operate on inputs received from users or external systems. In the Digibee Platform, these inputs only become configurable variables when explicitly declared in the prompt using Double Braces syntax.

Double Braces define a variable placeholder:

This syntax is not optional decoration. It is what tells the platform:

  • This value is dynamic

  • This value can be injected from a Dataset

  • This value can be controlled by a Test Case

Without Double Braces, the value is treated as static text and cannot be configured or tested.

How it works at execution time

These expressions act as placeholders. During a full pipeline execution, they capture data from previous connectors. When executing the Agent Component in isolation, they are replaced by values defined in the Test Case, enabling controlled and repeatable evaluation.

Example

User Prompt

Test Case value

At execution time, {{ message.topic }} is replaced with this value.

This allows you to evaluate how the same prompt behaves across multiple semantic contexts without modifying previous connectors.

Create a Test Case

  1. Open the desired Dataset.

  2. Click New Test Case.

  3. Select the newly created Test Case.

  4. If Double Braces expressions exist in the prompt, they will be automatically detected as variables.

  5. Provide values for each variable.

  6. Click Save.

You can create as many Test Cases as necessary to simulate realistic input diversity.

circle-info

When running a Dataset, all Test Cases are executed. Each Test Case consumes tokens independently, and results are generated and stored separately.

3

Create Evaluations

Although LLM outputs are inherently non-deterministic, it is essential to establish a structured evaluation process to assess the quality of their responses. Evaluations introduce objective validation rules against structured outputs.

What an Evaluation defines

Each Evaluation is configured with a Scorer Type.

Currently, the platform supports Code Scorers, which define validation rules based on structured output inspection.

For a Code Scorer, you configure:

  • A JSONPath expression (such as, $.body.title)

  • The expected data type

  • The comparison operation (for example, Not Equals, Contains, Not Empty, Starts With)

During execution, the platform:

  1. Extracts the value using JSONPath.

  2. Applies the selected operation.

  3. Compares the result with the expected value (if applicable).

  4. Marks the Evaluation as passed or failed per Test Case.

circle-info

Important limitation

For Code Scorers, validation focuses on structural aspects of the output, such as field presence, data types, and deterministic value rules. They don't assess semantic quality or factual accuracy.

For example, a Code Scorer can verify that description is not empty, but it cannot determine whether the content is technically correct.

Structured validation improves reliability, but it doesn’t replace human review when semantic precision is required.

Create an Evaluation

  1. Open the Evaluations tab.

  2. Click Add Evaluations.

  3. Configure the following fields:

Eval Name Identifier for the rule. Must not contain spaces.

JSONPath Defines where in the output JSON the validation should be applied. The path must always start from the body object.

chevron-rightUsing JSONPath when a JSON Schema is configuredhashtag

If a JSON Schema is configured, the model returns a structured JSON object inside body. In this case, use:

Example:

You can target any field defined in your schema.

chevron-rightUsing JSONPath when no JSON Schema is configuredhashtag

If no JSON Schema is configured, the Agent output is treated as free text, even if it visually appears as structured JSON in the UI.

In this case:

  • The entire response is handled as a single string.

  • The only valid JSONPath is:

Because the output is not parsed as a structured object, you cannot target individual fields such as:

These paths will return null.

Without a JSON Schema, validation applies to the entire generated output at once. You can check conditions such as:

  • Not Empty

  • Contains

  • Starts With

But you cannot validate specific fields independently.

To validate individual fields, you must configure a JSON Schema so the output is returned as a structured JSON object.

Scorer Type Defines the expected data type:

  • String

  • Number

  • Boolean

  • Array

  • Object

Operator Defines the validation logic, depending on the Scorer Type.

  1. Click Create.

Attach the Evaluation to a Dataset

After creation:

  1. Click the three dots next to the evaluation name.

  2. Select Add to Dataset.

  3. Choose the target Dataset.

  4. Click Add.

Define the expected values per Test Case

Each Test Case must define the expected value for its associated evaluations.

  1. Open the Dataset.

  2. Access the Test Case.

  3. Go to the Evaluations section.

  4. Set the value that the system should compare against, if required.

When executed, the platform evaluates each Test Case independently and displays whether each rule passed or failed.

4

Execute the tests

You can execute tests before or after creating evaluations. A recommended workflow is:

  1. Create Test Cases.

  2. Execute them to inspect outputs.

  3. Define evaluations based on the expected structure.

  4. Re-run the Dataset to validate pass/fail results automatically.

To execute the Test Cases:

  1. Select the Dataset.

  2. Click Run.

After the execution completes, the results appear in the Executions tab. Click an execution to view the details.

Execution details

Trace results

All execution traces are grouped here, providing full visibility into component behavior. The traces include:

  • Configuration: Provider and model settings.

  • Input: Context details, tools configuration, and retrieval status.

  • System: The System Message sent to the LLM.

  • User: The User Message sent to the LLM.

  • Tool Call: Tool invocations, arguments, and results.

  • Input/Output Guardrail: Applied guardrails and their impact.

These traces help validate results and troubleshoot unexpected behavior.

Output

The final execution result is returned in JSON format. You can inspect and query specific fields using JSONPath expressions.

Eval Logs

Displays the evaluation results for each Test Case, including pass/fail status and configuration details.

How to manually evaluate test results

You can manually review an execution to rate its quality and record which Dataset performs better. This allows you to compare your assessment with the automatic evaluation configured using Code Scorers.

In each Test Case, you will find a rating option. Use it to mark the result as positive (thumbs-up) or negative (thumbs-down) based on your evaluation.

After you submit a rating, the system automatically adds a manual-evaluations column to the execution details in the Executions tab. This column reflects your manual assessment and calculates a success rate for the Dataset based on your ratings.

Example

Suppose a Dataset contains three Test Cases. After reviewing the results:

  • Two Test Cases are rated as positive.

  • One Test Case is rated as negative.

The system calculates a 67% manual success rate for that Dataset.

This percentage is then combined with the score generated by Code Scorers, providing a more complete view of the Dataset’s overall performance in the Total column.

How to save the test results

All executions are automatically captured in the Executions tab.

These are the retention rules:

  • Auto-save: Executions are temporarily stored for 5 days.

  • Persistent storage: To prevent deletion after 5 days, open the Executions tab, select the desired executions, and click Save.

circle-info

The pipeline must be saved before executions can be permanently stored.

5

Version for controlled experimentation

Datasets and Evaluations validate output structure. Versioning enables controlled evolution. When used together, they form a reproducible experimentation framework for Agent development, where changes can be measured instead of assumed.

circle-info

Reusing the same Dataset across versions

A single Dataset can be reused to evaluate multiple Agent configurations. This ensures that only the configuration changes, while the test criteria remain constant.

For example:

  • v1 → Model: gpt-4o

  • v2 → Model: gpt-5

  • v3 → Updated system prompt

All versions are executed against the same Test Cases, Evaluations, and validation logic. This isolates the impact of each change and enables objective comparison across iterations.

  1. Start simple Iterate on a single prompt without saving versions. Fix specification issues by refining instructions, structure, or constraints in the prompt.

  2. Create a baseline when needed When you encounter generalization issues that require structured comparison, save the current prompt as a version. Tag it clearly as your “main” or “latest” baseline.

  3. Test alternatives Try new prompt configurations to address the issue. Save versions only when necessary. If a configuration performs better than the baseline, promote it to your new main or latest version.

  4. Validate consistently Use the same Dataset to test different versions. Keeping test conditions constant allows reliable comparison.

Apply the same strategy whenever you update the Dataset or evaluation metrics: define a stable baseline, test alternatives under the same conditions, and promote improvements when validated.

Why this test matters

LLM outputs are inherently probabilistic. A prompt that works for one semantic domain may degrade in another.

By combining input variation (Test Cases), structured grouping (Datasets), and deterministic validation rules (Evaluations), you introduce:

  • Output format stability

  • Schema compliance enforcement

  • Early regression detection

  • Execution auditability

  • Reproducible validation workflows

This transforms prompt engineering from manual inspection into structured testing, bringing AI workflows closer to traditional software quality standards.

Next steps

Now that you understand both the conceptual model and the implementation process, learn how to build your first AI testing workflow using Datasets, Evaluations, and Versioning.

Last updated

Was this helpful?