# Build your first AI testing workflow with Datasets, Evaluations, and Versioning

This quickstart shows how to use **Datasets**, **Evaluations**, and **Versioning** to automatically verify that your agent consistently generates structured outputs and keeps the generated content aligned with the input topic, ensuring reliable, production-ready responses.

<figure><img src="https://3750561495-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FaD6wuPRxnEQEsYpePq36%2Fuploads%2FvQJyObQa1NpCddYPSyhp%2Fqs-evaluations.gif?alt=media&#x26;token=d49bb88c-d876-4c36-ba08-0dc7f2253a14" alt=""><figcaption></figcaption></figure>

## **Prerequisites**

Before you begin, make sure you have:

* An API key from an LLM provider (for example, OpenAI, Anthropic, or Google).
* The API key registered in Digibee as a Secret Key account. For details, see [how to create a Secret Key account](https://app.gitbook.com/s/jvO5S91EQURCEhbZOuuZ/platform-administration/settings/accounts#secret-key).
* Read the conceptual guide [**“Agent testing with Datasets, Evaluations, and Versioning”**](https://app.gitbook.com/s/EKM2LD3uNAckQgy1OUyZ/connectors/ai-tools/llm/testing-your-agent) to understand the testing structure and terminology used in this quickstart.

## **Initial setup**

Add the **Agent Component** to your pipeline immediately after the trigger and configure it as follows:

* **Model:** Select your preferred model (for example, OpenAI – GPT-4o Mini).
* **Account:** Click the gear icon next to the Model parameter, go to **Account**, and select the Secret Key account you created in Digibee.

Once the basic configuration is complete, you are ready to configure your tests.

## **Scenario**

You are building an AI agent that converts retrieved technical information into structured JSON documentation.

This output will be consumed by downstream systems, so structural consistency and topic alignment are essential. Even a missing field can break deterministic integrations.

To ensure reliability, configure the Agent with the following messages and JSON schema:

#### **System Message**

Defines the agent’s role and tone:

{% code overflow="wrap" %}

```
You are a technical documentation generator. Always write in clear and concise English, using a professional but simple tone.
Your task is to transform raw information retrieved from external tools into well-structured documentation.
Ensure your output is consistent, accurate, and aligned with the requested format.
```

{% endcode %}

#### **User Message**

Defines the dynamic task and introduces a variable:

{% code overflow="wrap" %}

```
Use the information retrieved about the topic {{ message.topic }} to create a documentation section.
The documentation must include:
- A clear title.
- A concise description (2–3 sentences).
- At least three practical use cases.
- At least three best practices.
Format the response strictly following the provided JSON schema.
```

{% endcode %}

The `{{ message.topic }}` variable allows you to simulate different semantic contexts without modifying the prompt structure. This makes it ideal for controlled testing across multiple scenarios.

#### **JSON Schema**

Define the output schema (learn more about [how to turn AI responses into a structured JSON output](https://docs.digibee.com/documentation/resources/quickstarts/turn-ai-into-structured-output)):

{% code overflow="wrap" expandable="true" %}

```json
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "DocumentationSection",
  "type": "object",
  "required": ["title", "description", "use_cases", "best_practices"],
  "properties": {
    "title": {
      "type": "string",
      "description": "The title of the documentation section"
    },
    "description": {
      "type": "string",
      "description": "A concise description of the topic (2-3 sentences)"
    },
    "use_cases": {
      "type": "array",
      "description": "Practical use cases for the topic",
      "items": {
        "type": "string"
      },
      "minItems": 3
    },
    "best_practices": {
      "type": "array",
      "description": "Recommended best practices for the topic",
      "items": {
        "type": "string"
      },
      "minItems": 3
    }
  },
  "additionalProperties": false
}
```

{% endcode %}

This schema enforces:

* Required structural fields
* Minimum content constraints
* Strict property control (no unexpected fields)

{% hint style="warning" icon="lightbulb-on" %}
**Tip:** Enable the **JSON Schema** guardrail in the Guardrails Settings. The Agent validates the model’s output against the defined schema and automatically reprompts the LLM if the response format is invalid. If the model continues returning an invalid output, the execution fails.
{% endhint %}

## **Step-by-step**

In the next steps, you will create structured tests to ensure that the generated `description` remains relevant to the topic being tested.

{% stepper %}
{% step %}

### Create three Test Cases in a Dataset

Create a new **Dataset**, which is a logical grouping of test scenarios, and name it **Documentation Validator**.

Inside this Dataset, create three **Test Cases** to simulate different execution scenarios. Because your System Prompt contains the variable `{{ message.topic }}`, each Test Case can define a different value for it.

Use the following configuration:

| Test Cases  | message.topic             |
| ----------- | ------------------------- |
| Test Case 1 | Event-Driven Architecture |
| Test Case 2 | API Rate Limiting         |
| Test Case 3 | Database Indexing         |

Each Test Case simulates a different semantic domain while keeping the prompt structure unchanged. This allows you to validate structural consistency across varied contexts.
{% endstep %}

{% step %}

### Create the Evaluation Rule

Now that your Dataset is configured, create an **Evaluation** — an automated rule that validates part of the model’s output — with the following configuration:

**Eval Name**\
`description_contains`

**JSONPath**

`$.body.description`

**Scorer Type**\
String

**Operator**\
Contains

This rule verifies whether the `description` field contains a keyword related to the requested topic. This allows the test to confirm that the generated content is semantically aligned with the input topic.
{% endstep %}

{% step %}

### Attach the Evaluation to the Dataset

Once the Evaluation is created, click the three dots next to it and select **Add to Dataset**. Then choose the Dataset created in Step 1.

To configure the expected results:

1. Open the **Documentation Validator** dataset in the **Datasets** tab.
2. Open each Test Case and scroll down to the **Evaluations** section.
3. Enter the desired value. For our Test Cases, we will add the following values:

| Test Cases  | Evaluation value |
| ----------- | ---------------- |
| Test Case 1 | Event            |
| Test Case 2 | API              |
| Test Case 3 | Database         |

This way, we’ll be able to validate whether the Test Cases are returning the requested subjects.

{% hint style="info" %}
Keep in mind that the **“Contains”** operator is *case sensitive*. This means the evaluation considers differences between uppercase and lowercase letters. Therefore, even if the expected term is present, the evaluation may fail if the capitalization does not match the configured value.
{% endhint %}
{% endstep %}

{% step %}

### Execute the Dataset

In the **Datasets** tab, select your Dataset and click **Run**.

When the Dataset runs:

* All three Test Cases are executed sequentially.
* For each execution, the platform extracts the value at `$.body.description`.
* The evaluation verifies whether the description contains the expected keyword defined in the Test Case.
* Each Test Case is evaluated independently and marked as **Passed** or **Failed**.

This allows you to confirm that the generated description remains semantically aligned with the topic across different inputs.

You can check more detailed information about each execution in the **Executions** tab.
{% endstep %}

{% step %}

### Compare different prompt configuration versions

Now that your Dataset validates both structural consistency and response quality, you can iteratively refine the prompt to resolve specification issues identified by the validation rules. Once the prompt works as expected for the defined cases, modify the prompt in a controlled manner and measure its impact.

When generalization issues appear, save the current prompt version and compare its results with alternative prompt configurations.

#### **1. Save the baseline version**

Open the Agent Component and [save the current configuration as a version](https://app.gitbook.com/s/EKM2LD3uNAckQgy1OUyZ/connectors/ai-tools/llm#version-the-component) named `baseline`.

This version preserves the original System and User prompts and serves as your structural reference point.

#### **2. Modify the User Prompt to enforce stricter conciseness**

Replace the current User Message with the following:

{% code overflow="wrap" %}

```
Use the information retrieved about the topic {{ message.topic }} to generate a structured documentation section.
Requirements:
- Provide a short and precise title.
- Include at least three practical use cases. Each use case must be a single, direct sentence.
- Include at least three best practices. Each best practice must be a single, actionable sentence.
Avoid generic explanations. Do not add introductions or commentary outside the required fields.
Format the response strictly according to the provided JSON schema.
```

{% endcode %}

This modification removes the explicit instruction about the `description` field from the prompt while keeping it required in the JSON Schema.

As a result:

* The prompt no longer reinforces the field at the instruction level.
* The JSON Schema becomes the only constraint enforcing the presence of the `description` field, while the Evaluation rule validates its semantic content.

This increases constraint pressure and allows you to validate whether schema enforcement alone is sufficient to maintain structural stability and consistent responses across different topics.

#### **3. Save the new configuration**

Save this updated setup as a new version and tag it as `concise`.

You now have two comparable configurations:

* `baseline`: Prompt-level reinforcement of `description`.
* `concise`: Schema-level enforcement only.

#### **4. Re-run the same Dataset**

Go to the **Datasets** tab and run **Documentation Validator** again.

Because the Dataset, Test Cases, and Evaluation rule remain unchanged, any behavioral difference reflects only the prompt modification.

#### **What to observe**

Compare `baseline` and `concise`:

* Is the `description` field still generated with relevant content for the requested topic?
* Do all Test Cases pass the evaluation?
* Does the agent maintain consistent and accurate responses across different topics?
* Did removing prompt-level reinforcement introduce any structural or semantic regression?
* Is the output more concise while remaining fully schema-compliant?

If any Test Case fails, you have identified a regression caused exclusively by the prompt change.

This exercise demonstrates that Versioning is not limited to model swaps. It enables controlled experimentation with any Agent component configuration, including **prompts**, **model selection**, parameters such as **temperature** and **top K**, **JSON Schema**, **tools**, and **file uploads**.

This allows you to measure validation results and safely test new Test Cases without affecting production pipelines. Instead of assuming which prompt performs better, you validate agent consistency and response accuracy under identical testing conditions.
{% endstep %}
{% endstepper %}

## **What this validates**

This experiment confirms structural consistency across semantic variations.

Even though the topics differ significantly, the agent must always:

* Respect the JSON schema
* Populate all required fields
* Produce a `description` that includes a keyword related to the topic

If any topic results in a missing `description`, the evaluation will fail, immediately surfacing a structural regression.

## **Why this matters**

LLM outputs are probabilistic. A prompt that works for one topic may degrade for another.

By testing multiple semantic contexts with a single structural rule, you ensure:

* Agent consistency across different inputs
* Response accuracy and relevance
* Output format stability and schema compliance
* Regression detection when prompts or configurations are modified

This is a simple but powerful example of how Datasets and Evaluations introduce measurable reliability into AI-driven workflows.
