# Run Agent tests: Results, analysis, and versioning

This guide shows how to run tests, review results, and use both automated and manual evaluations to understand what’s working and what needs improvement. You’ll also learn how to save results, identify patterns with annotations and insights, and compare different agent versions using the same dataset.

{% hint style="warning" icon="books" %}
**Prerequisite:** Read [**Build Agent tests: Datasets, Test Cases, and Evaluations**](https://docs.digibee.com/documentation/connectors-and-triggers/connectors/ai-tools/llm/testing-your-agent) to learn how to set up tests.
{% endhint %}

## **Running tests**

To run the Test Cases in a Dataset:

1. Select the Dataset.
2. Click **Run**.

A recommended workflow:

1. Create Test Cases.
2. Run them to review raw outputs.
3. Define Evaluations based on the expected structure.
4. Run again to validate pass/fail results automatically.

After the run finishes, results appear in the **Executions** tab. Click an execution to see the details.

<figure><img src="https://3591250690-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEKM2LD3uNAckQgy1OUyZ%2Fuploads%2FDEeM7QzWzzlykvUjR0qA%2Fexecution-details.gif?alt=media&#x26;token=b4a57a83-1159-4889-9ade-3be404fb9fd9" alt=""><figcaption></figcaption></figure>

## **Interpreting results**

### **Trace results**

Each execution includes traces that show how the component behaved:

* **Configuration**: Provider and model settings.
* **Input**: Context, tools configuration, and retrieval status.
* **System**: The System Message sent to the model.
* **User**: The User Message sent to the model.
* **Tool Call**: Tool calls, arguments, and results.
* **Input/Output Guardrail**: Applied guardrails and their impact.

Use these traces to validate behavior and troubleshoot issues.

#### **Output**

The final result is returned in JSON format. You can inspect and query fields using JSONPath.

#### **Eval Logs**

Shows evaluation results for each Test Case, including pass/fail status and configuration details.

### **Manual evaluation**

Manual evaluation lets domain experts and developers assess execution quality beyond automated checks. It captures qualitative observations like ambiguous reasoning, edge cases, and prompt design issues. It includes three tools: **ratings**, **annotations**, and **insights**.

#### **Ratings**

Each execution can be marked as positive (<i class="fa-thumbs-up">:thumbs-up:</i>) or negative (<i class="fa-thumbs-down">:thumbs-down:</i>). Focus on the output quality when rating. A successful execution can still receive a negative rating if the output is incorrect, incomplete, or not useful.

After rating, a **manual-evaluations** column is added to the **Executions** tab. It reflects your ratings and calculates a success rate for the Dataset.

**Example**

If a Dataset has three Test Cases:

* Two are rated positive
* One is rated negative

The result is a **67% manual success rate**. This is combined with automated scores in the **Total** column.

<figure><img src="https://3591250690-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEKM2LD3uNAckQgy1OUyZ%2Fuploads%2FsT12UN68UzpouNcVGYz9%2Fen-manual-evals.gif?alt=media&#x26;token=ddbc4c18-89ab-466c-b0e5-7b7cd52ce4e6" alt=""><figcaption></figcaption></figure>

#### **Annotations**

Annotations let you record findings directly on:

* **Tools**
* **System Message**
* **User Message**
* **Guardrails**

To add one, open an execution and click **Add Annotation** below the relevant section.

Focus on explaining *why* something worked or failed. Clear, specific notes become more valuable over time.

A single annotation reflects one execution. Multiple annotations reveal patterns, such as recurring issues or misinterpretations. These patterns make Insights more useful.

**Examples**

*Tools*

> The tool returned irrelevant data because the index is outdated. The source needs to be refreshed.

*System Message*

> The rule rejected a valid output because `Flatten` wasn’t recognized as a synonym for `Merge`. The prompt needs more flexibility.

*User Message*

> The input is clear. The issue comes from the System Message, not this part.

#### **Insights**

After adding several annotations, go to the **Annotation insights** tab and click **Generate Insights**.

<figure><img src="https://3591250690-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEKM2LD3uNAckQgy1OUyZ%2Fuploads%2FBMKjVwJ3igQGe7sL8JKE%2FInsights.gif?alt=media&#x26;token=5f9b535c-31d6-4aaf-80e0-ababf11451ea" alt=""><figcaption></figcaption></figure>

Insights groups patterns across annotations and highlights:

* What’s working
* What needs attention
* Top issues
* Recommended fixes

This reduces the need to review each execution individually.

{% hint style="warning" icon="lightbulb-on" %}
For better results, annotate at least 5 to 10 Test Cases. More variety leads to more reliable insights.
{% endhint %}

### **Saving results**

All executions are stored in the **Executions** tab with the following retention rules:

* **Auto-save**: Executions are stored for 5 days.
* **Persistent storage**: Select executions and click **Save** to keep them permanently. The pipeline must be saved before executions can be permanently stored.

## **Versioning and iteration**

[Versioning](https://docs.digibee.com/documentation/connectors-and-triggers/connectors/ai-tools/llm/..#version-the-component) helps you test changes in a controlled way. When used with Datasets and Evaluations, it gives you a consistent setup where you can measure the impact of changes instead of relying on assumptions.

You can reuse the same Dataset across different Agent versions. This keeps the test conditions the same while only the configuration changes. For example:

* **v1** → Model: gpt-4o
* **v2** → Model: gpt-5
* **v3** → Updated System Message

#### **Recommended approach**

1. **Start simple**: Work on a single prompt without creating versions at first. Focus on fixing issues by improving instructions, structure, or constraints.
2. **Create a baseline**: When you need to compare results, save the current setup as a version. Use it as your main reference point.
3. **Test alternatives**: Try different configurations. Only save new versions when you actually need to compare them.
4. **Keep tests consistent**: Use the same Dataset across versions so results are comparable.
5. **Promote improvements**: If a version performs better than your baseline, make it your new main version.

Use the same approach when updating Datasets or evaluation rules: define a stable baseline, test under the same conditions, and promote changes only after they are validated.

## **Next steps**

Now that you understand both the conceptual model and the implementation process, learn how to [**build your first AI testing workflow using Datasets, Evaluations, and Versioning**](https://app.gitbook.com/s/aD6wuPRxnEQEsYpePq36/quickstarts/first-ai-testing-workflow).
