Build your first AI testing workflow with Datasets and Evaluations
Build a working AI testing workflow that validates structured JSON outputs across multiple input variations.
This quickstart shows how to use Datasets and Evaluations to automatically verify that your agent consistently generates required fields, even when the topic changes, ensuring reliable, production-ready structured responses.

Prerequisites
Before you begin, make sure you have:
An API key from an LLM provider (for example, OpenAI, Anthropic, or Google).
The API key registered in Digibee as a Secret Key account. For details, see how to create a Secret Key account.
Read the conceptual guide “Agent testing with Datasets and Evaluations” to understand the testing structure and terminology used in this quickstart.
Initial setup
Add the Agent Component to your pipeline immediately after the trigger and configure it as follows:
Model: Select your preferred model (for example, OpenAI – GPT-4o Mini).
Account: Click the gear icon next to the Model parameter, go to Account, and select the Secret Key account you created in Digibee.
Once the basic configuration is complete, you are ready to configure your tests.
Scenario
You are building an AI agent that converts retrieved technical information into structured JSON documentation.
This output will be consumed by downstream systems, so structural consistency is essential. Even a missing field can break deterministic integrations.
To ensure reliability, configure the Agent with the following messages and JSON schema:
System Message
Defines the agent’s role and tone:
User Message
Defines the dynamic task and introduces a variable:
The {{ message.topic }} variable allows you to simulate different semantic contexts without modifying the prompt structure. This makes it ideal for controlled testing across multiple scenarios.
JSON Schema
Define the output schema (learn more about how to turn AI responses into a structured JSON output):
This schema enforces:
Required structural fields
Minimum content constraints
Strict property control (no unexpected fields)
Step-by-step
In the next steps, you will create structured tests to ensure that required fields — such as description — are always generated, regardless of the topic provided.
Create three Experiments in a Dataset
Create a new Dataset, which is a logical grouping of test scenarios, and name it Documentation Validator.
Inside this Dataset, create three Experiments to simulate different execution scenarios. Because your System Prompt contains the variable {{ message.topic }}, each Experiment can define a different value for it.
Use the following configuration:
Experiment 1
Event-Driven Architecture
Experiment 2
API Rate Limiting
Experiment 3
Database Indexing
Each Experiment simulates a different semantic domain while keeping the prompt structure unchanged. This allows you to validate structural consistency across varied contexts.
Create the Evaluation Rule
Now that your Dataset is configured, create an Evaluation — an automated rule that validates part of the model’s output — with the following configuration:
Eval Name
description_exists
JSONPath
$.body.description
Scorer Type String
Variant Not Empty
Considering that the AI is instructed to strictly follow the JSON Schema defined earlier, this rule verifies that the description field is present in the output JSON and contains a non-empty value.
Attach the Evaluation to the Dataset
Once the Evaluation is created, click the three dots next to it and select Add to Dataset. Then choose the Dataset created in Step 1.
Because the selected operation is Not Empty, no expected value needs to be defined in the Experiments. The evaluation will automatically pass as long as the targeted field exists and contains a non-empty value.
Execute the Dataset
In the Datasets tab, select your Dataset and click Run.
When the Dataset runs:
All three Experiments are executed sequentially.
For each execution, the platform extracts the value at
$.body.description.The evaluation verifies that the field is present and not empty.
Each Experiment is evaluated independently and marked as Passed or Failed.
This allows you to confirm that the description field is consistently generated across different topics.
You can check more detailed information about each execution in the Executions tab.
What this validates
This experiment confirms structural consistency across semantic variations.
Even though the topics differ significantly, the agent must always:
Respect the JSON schema
Populate all required fields
Produce a valid
descriptionstring
If any topic results in a missing description, the evaluation will fail, immediately surfacing a structural regression.
Why this matters
LLM outputs are probabilistic. A prompt that works for one topic may degrade for another.
By testing multiple semantic contexts with a single structural rule, you ensure:
Output format stability
Schema compliance
Regression detection when prompts are modified
This is a simple but powerful example of how Datasets and Evaluations introduce measurable reliability into AI-driven workflows.
Last updated
Was this helpful?