> For the complete documentation index, see [llms.txt](https://docs.digibee.com/documentation/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.digibee.com/documentation/connectors-and-triggers/connectors/ai-tools/llm/results-analysis.md).

# Run Agent tests: Results, analysis, and versioning

This guide shows how to run tests, review results, and use both automated and manual evaluations to understand what’s working and what needs improvement. You’ll also learn how to save results, identify patterns with annotations and insights, and compare different agent versions using the same dataset.

{% hint style="warning" icon="books" %}
**Prerequisite:** Read [**Build Agent tests: Datasets, Test Cases, and Evaluations**](/documentation/connectors-and-triggers/connectors/ai-tools/llm/testing-your-agent.md) to learn how to set up tests.
{% endhint %}

## **Running tests**

To run the Test Cases in a Dataset:

1. Select the Dataset.
2. Click **Run**.

A recommended workflow:

1. Create Test Cases.
2. Run them to review raw outputs.
3. Define Evaluations based on the expected structure.
4. Run again to validate pass/fail results automatically.

After the run finishes, results appear in the **Executions** tab. Click an execution to see the details.

<figure><img src="/files/uRLTbrxhJ2GGTaGzap4w" alt=""><figcaption></figcaption></figure>

## **Interpreting results**

### **Output details**

At the top of the page, you can immediately see whether the execution was **successful** or **failed**. You will also see the **time** of day the execution occurred, the **duration** of the execution in milliseconds, and the number of **tokens** used, as reported by the LLM provider.

{% hint style="info" %}
If the LLM provider don’t return the usage information, the number of tokens won't be displayed in the component.
{% endhint %}

To get more detailed information about the execution, expand the following sections in the component:

#### **Execution logs**

Each execution includes traces that show how the component behaved:

* **Configuration**: Provider and model settings.
* **Input**: Context, tools configuration, and retrieval status.
* **System**: The System Message sent to the model.
* **User**: The User Message sent to the model.
* **Tool Call**: Tool calls, arguments, and results.
* **Input/Output Guardrail**: Applied guardrails and their impact.

Use these traces to validate behavior and troubleshoot issues.

#### **Output**

The final result is returned in JSON format. You can inspect and query fields using JSONPath.

#### **Eval Logs**

Shows evaluation results for each Test Case, including pass/fail status and configuration details.

### **Manual evaluation**

Manual evaluation lets domain experts and developers assess execution quality beyond automated checks. It captures qualitative observations like ambiguous reasoning, edge cases, and prompt design issues. It includes three tools: **ratings**, **annotations**, and **insights**.

#### **Ratings**

Each execution can be marked as positive (<i class="fa-thumbs-up">:thumbs-up:</i>) or negative (<i class="fa-thumbs-down">:thumbs-down:</i>). Focus on the output quality when rating. A successful execution can still receive a negative rating if the output is incorrect, incomplete, or not useful.

After rating, a **manual-evaluations** column is added to the **Executions** tab. It reflects your ratings and calculates a success rate for the Dataset.

**Example**

If a Dataset has three Test Cases:

* Two are rated positive
* One is rated negative

The result is a **67% manual success rate**. This is combined with automated scores in the **Total** column.

<figure><img src="/files/WDNUFucNnaHH0ot4iUrX" alt=""><figcaption></figcaption></figure>

#### **Annotations**

Annotations let you record findings directly on:

* **Tools**
* **System Message**
* **User Message**
* **Guardrails**

To add one, open an execution and click **Add Annotation** below the relevant section.

Focus on explaining *why* something worked or failed. Clear, specific notes become more valuable over time.

A single annotation reflects one execution. Multiple annotations reveal patterns, such as recurring issues or misinterpretations. These patterns make Insights more useful.

**Examples**

*Tools*

> The tool returned irrelevant data because the index is outdated. The source needs to be refreshed.

*System Message*

> The rule rejected a valid output because `Flatten` wasn’t recognized as a synonym for `Merge`. The prompt needs more flexibility.

*User Message*

> The input is clear. The issue comes from the System Message, not this part.

#### **Insights**

After adding several annotations, go to the **Annotation insights** tab and click **Generate Insights**.

<figure><img src="/files/ZTHpPV5fTaomf1HxSreo" alt=""><figcaption></figcaption></figure>

Insights groups patterns across annotations and highlights:

* What’s working
* What needs attention
* Top issues
* Recommended fixes

This reduces the need to review each execution individually.

{% hint style="warning" icon="lightbulb-on" %}
For better results, annotate at least 5 to 10 Test Cases. More variety leads to more reliable insights.
{% endhint %}

### **Saving results**

All executions are stored in the **Executions** tab with the following retention rules:

* **Auto-save**: Executions are stored for 5 days.
* **Persistent storage**: Select executions and click **Save** to keep them permanently. The pipeline must be saved before executions can be permanently stored.

## **Versioning and iteration**

[Versioning](/documentation/connectors-and-triggers/connectors/ai-tools/llm.md#version-the-component) helps you test changes in a controlled way. When used with Datasets and Evaluations, it gives you a consistent setup where you can measure the impact of changes instead of relying on assumptions.

You can reuse the same Dataset across different Agent versions. This keeps the test conditions the same while only the configuration changes. For example:

* **v1** → Model: gpt-4o
* **v2** → Model: gpt-5
* **v3** → Updated System Message

#### **Recommended approach**

1. **Start simple**: Work on a single prompt without creating versions at first. Focus on fixing issues by improving instructions, structure, or constraints.
2. **Create a baseline**: When you need to compare results, save the current setup as a version. Use it as your main reference point.
3. **Test alternatives**: Try different configurations. Only save new versions when you actually need to compare them.
4. **Keep tests consistent**: Use the same Dataset across versions so results are comparable.
5. **Promote improvements**: If a version performs better than your baseline, make it your new main version.

Use the same approach when updating Datasets or evaluation rules: define a stable baseline, test under the same conditions, and promote changes only after they are validated.

## **Next steps**

Now that you understand both the conceptual model and the implementation process, learn how to [**build your first AI testing workflow using Datasets, Evaluations, and Versioning**](/documentation/resources/quickstarts/first-ai-testing-workflow.md).


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.digibee.com/documentation/connectors-and-triggers/connectors/ai-tools/llm/results-analysis.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.