# Build Agent tests: Datasets, Test Cases, and Evaluations

## **Overview**

The Agent Component includes a testing structure with **Datasets**, **Test Cases**, and **Evaluations**. These help you test different inputs in a controlled way and check if the outputs meet your requirements before going to production.

Use this structure when you want to:

* Check how accurate your AI agent is with different inputs.
* Compare different approaches, like prompt changes or new tools.
* Avoid regressions when updating prompts.
* Compare different Agent versions using the same tests.
* Make sure everything is validated before deploying to production.

## **Core concepts**

* **Dataset:** A group of related test scenarios organized as a single test suite. Each Dataset includes Test Cases and Evaluations.
* **Test Case:** A specific input setup that represents one execution scenario. It defines the values for the variables used in your prompt.
* **Evaluation:** A rule used to check a specific field or condition in the model’s output. It is linked to a Dataset and runs for each Test Case.

## **Prerequisites**

Before creating a Dataset, your System or User prompt must include at least one variable declared using [Double Braces syntax](https://docs.digibee.com/documentation/connectors-and-triggers/double-braces/overview), for example:

```
{{ message.topic }}
```

This tells the Platform that the value is dynamic and can be injected from a Test Case. Without Double Braces, values are treated as static text and cannot be configured or tested.

Once a Dataset exists, at least one Test Case must be configured for the component to run. To return to normal execution mode, remove all Datasets.

## **Step 1: Create a Dataset**

A Dataset acts as a test suite for your agent. Each Dataset can contain multiple Test Cases and Evaluations.

1. In the Agent Component, navigate to the **Datasets** tab under **Output Details**.
2. Click **Create Dataset** (or click **Select a Dataset** and then **Create a new Dataset** if one already exists).
3. Provide a name and confirm.

## **Step 2: Create Test Cases**

A Test Case provides values for the variables declared in your prompt, simulating a specific input scenario.

### **Example**

Given this User Prompt:

{% code overflow="wrap" %}

```
Use the information retrieved about {{ message.topic }} to create a documentation section.
The documentation must include:
- A clear title
- A concise description (2–3 sentences)
- At least three practical use cases
- At least three best practices
```

{% endcode %}

A Test Case would supply:

```
Event-Driven Architecture
```

At execution time, `{{ message.topic }}` is replaced with this value.

### **Creating a Test Case:**

1. Open the Dataset and click **New Test Case**.
2. Select the newly created Test Case.
3. Provide values for each detected variable.
4. Click **Save**.

You can create as many Test Cases as needed to simulate realistic input diversity. When running a Dataset, all Test Cases are executed independently, and each one consumes tokens separately.

<figure><img src="https://3591250690-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEKM2LD3uNAckQgy1OUyZ%2Fuploads%2F3cu5sJV6Hzcs6C9KT9cC%2Fexperiment-variable.gif?alt=media&#x26;token=f5ae5b4b-b933-40ae-b374-8c5069a4feee" alt=""><figcaption></figcaption></figure>

## **Step 3: Create Evaluations**

An Evaluation is a rule that automatically checks whether the output generated by your Agent meets a specific criterion. For example, whether a field is filled in, whether the response contains a certain word, or whether the tone is appropriate.

Instead of manually reviewing each output, you define the rule once and the system applies it automatically across your test cases.

### **Creating an Evaluation:**

1. Open the **Evaluations** tab.
2. Click **Add Evaluations**.
3. Fill in the fields described in the next section.

<figure><img src="https://3591250690-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEKM2LD3uNAckQgy1OUyZ%2Fuploads%2F1DgFUsRXI6WlbUNhtP2M%2Fcreate-eval.gif?alt=media&#x26;token=0888c21c-4814-4281-8e8c-d206850989e8" alt=""><figcaption></figcaption></figure>

### **Configuring the Evaluation fields**

#### **Eval Name**

A label to identify this evaluation rule. Choose something descriptive, like `title_not_empty` or `tone_check`.

{% hint style="warning" %}
The name cannot contain spaces. Use underscores or hyphens instead (for example, `check-description`).
{% endhint %}

#### **JSONPath: Telling the system where to look**

When your Agent generates a response, that response is stored internally as a structured object called `body`. JSONPath is the address you write to tell the system *which part* of that response you want to evaluate.

Think of it like pointing to a cell in a spreadsheet: instead of saying "column B, row 3," you write a path like `$.body.title` to mean "the `title` field inside `body`."

**The path always starts with `$.body`**, followed by the name of the field you want to check.

<details>

<summary><strong>When a JSON Schema is configured</strong></summary>

If your Agent is configured with a [JSON Schema](https://app.gitbook.com/s/aD6wuPRxnEQEsYpePq36/quickstarts/turn-ai-into-structured-output), its output is structured, meaning each piece of information is stored in a named field. In this case, you can target any specific field:

```
$.body.title
$.body.description
$.body.use_cases
```

Each path points to one field, and each field can have its own independent evaluation rule.

</details>

<details>

<summary><strong>When no JSON Schema is configured</strong></summary>

**When no JSON Schema is configured**

If no [JSON Schema](https://app.gitbook.com/s/aD6wuPRxnEQEsYpePq36/quickstarts/turn-ai-into-structured-output) is configured, the Agent's entire output is treated as a single block of text, even if it visually looks like a structured JSON in the interface. In this case, the system does not recognize individual fields, so the only valid path is:

```
$.body.text
```

This means you can only evaluate the response as a whole. Paths like `$.body.title` or `$.body.description` will return empty (`null`) because there are no named fields to point to.

{% hint style="info" %}
**In summary:** If you need to validate specific fields independently, you must configure a JSON Schema for your Agent. Without it, you can only validate the entire output at once.
{% endhint %}

</details>

#### **Scorer Type: How the evaluation will judge the output**

The Scorer Type defines *what kind of check* will be performed on the field you selected. There are two categories: deterministic scorers and the LLM scorer.

<details>

<summary><strong>Deterministic scorers: String, Number, Boolean, Array, Object</strong></summary>

These are used when the check has a clear, objective answer, with no ambiguity. The system compares the output against a rule and returns pass or fail automatically, without any interpretation.

Choose the type that matches what the field contains:

| Type        | Use when the field contains… | Example                             |
| ----------- | ---------------------------- | ----------------------------------- |
| **String**  | Text                         | A title, a description, a name      |
| **Number**  | A numeric value              | A score, a count, a price           |
| **Boolean** | True or false                | Whether a flag is enabled           |
| **Array**   | A list of items              | A list of tags or categories        |
| **Object**  | A nested structure           | An address with multiple sub-fields |

After choosing a type, you configure an **Operator**, which is the actual logic of the check. Available operators depend on the type, but common examples include:

* **Not Empty**: Verifies that the field has content (any content)
* **Contains**: Checks whether the field includes a specific word or phrase
* **Starts With**: Checks whether the field begins with a specific string
* **Not Equals**: Checks that the value is not equal to something specific

{% hint style="warning" %}
**Important limitation:** Deterministic scorers can only verify *whether* a field has content or matches a pattern, not *whether that content is actually good*. For example, `Not Empty` will pass even if the description says "aaa" or contains gibberish. To evaluate quality, relevance, or tone, use the LLM scorer instead.
{% endhint %}

When configuring an Operator, you will not set the expected value here. That value is defined individually for each Test Case. This is covered in [**Setting expected values per Test Case**](#setting-expected-values-per-test-case).

</details>

<details>

<summary><strong>LLM scorer</strong></summary>

Use the LLM scorer when the evaluation requires judgment, something that a simple rule cannot determine. This is the right choice for checking things like:

* Is the tone professional?
* Does the response actually answer the user's question?
* Is the information factually accurate?
* Is the content relevant to the context?

Instead of a fixed rule, you provide a prompt that instructs a language model to act as a judge. The model reads the Agent's output and decides whether it passes or fails based on your criteria.

Configure the following:

* **LLM Model:** The model that will act as judge. The selected model must support structured output (JSON Schema). Click **Open settings** to adjust parameters such as the account used, temperature, and other advanced settings.
* **Prompt:** This is where you define what "passing" means. You can use our available **Templates** for guidance or freely write clear, specific instructions for the judge model.

Guidelines for writing an effective prompt:

* **Be explicit about the pass/fail condition.** Avoid vague instructions like "check if it's good". Write exactly what passing requires.
* **Include examples** of what passing and failure looks like.
* **List each condition separately** if multiple criteria must all be met.
* **Exclude irrelevant factors.** If formatting or length should not affect the result, say so explicitly; otherwise the model may factor them in.

**Example:**

```
You are evaluating whether the response directly addresses the user's question.
Pass if: the response answers the specific question asked.
Fail if: the response is generic, evasive, or answers a different question.
Do not penalize for formatting or length.
```

The LLM scorer will always return one of two verdicts: **Passed** or **Not Passed**. In [**Setting expected values per Test Case**](#setting-expected-values-per-test-case), you will configure which verdict is expected for each Test Case.

</details>

### **Attaching the Evaluation to a Dataset**

An Evaluation rule on its own doesn’t run automatically. It needs to be attached to a Dataset (the collection of Test Cases against which the Agent will be tested).

To attach an Evaluation:

1. Click the **three dots menu** next to the evaluation name.
2. Select **Add to Dataset** and choose the target Dataset.
3. Click **Add**.

Once attached, the Evaluation will be applied to every Test Case in that Dataset.

### **Setting expected values per Test Case**

Each Test Case can have its own expected outcome for a given Evaluation. This is where you define what "correct" looks like for that specific case.

To configure:

1. Open the Dataset and navigate into the Test Case.
2. Go to the **Evaluations** section.
3. Set the expected value based on the scorer type:

* **For String, Number, Boolean, Array, or Object scorers:** Enter the exact value the output must match for the test to pass. For example, if the operator is `Contains` and you're testing a product description, you might enter the product name as the expected value.
* **For the LLM scorer:** Select whether you expect the result to be **Passed** or **Not Passed** for this specific case. For example, if you're testing a known bad input, you might expect the verdict to be "Not Passed" — and the evaluation will succeed if the judge model agrees.

<figure><img src="https://3591250690-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEKM2LD3uNAckQgy1OUyZ%2Fuploads%2F6wN6Dd9UP4EbpC847wCl%2Fdefine-eval.gif?alt=media&#x26;token=53521a16-0c1c-456b-a582-dbc63deb1c9e" alt=""><figcaption></figcaption></figure>

## **Next steps**

Now that you know how to build tests, the next step is to [**run tests, analyze results, and manage versions**](https://docs.digibee.com/documentation/connectors-and-triggers/connectors/ai-tools/llm/results-analysis).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.digibee.com/documentation/connectors-and-triggers/connectors/ai-tools/llm/testing-your-agent.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
