# Automate AI task validation with a quality gate

This quickstart shows you how to configure an **Agent Component** as an automated auditor to validate AI-generated tasks against a defined standard and how to diagnose its behavior after a run.

## **What you'll build**

You’ll configure an **Evaluator** Agent that verifies whether a list of tasks is correct before the pipeline continues.

In this example:

* A **Step Generator** creates tasks to consolidate sales data from multiple spreadsheets.
* The **Evaluator** compares those tasks with the expected ones and determines if they pass.

{% hint style="info" %}

#### **To keep this guide focused:**

* Generated tasks come from a **Dataset**
* Expected tasks are defined directly in the **User Message**

In a real scenario, both would come from external sources.
{% endhint %}

## **Prerequisites**

Before you begin, make sure you have:

* An API key from an LLM provider (for example, OpenAI, Anthropic, or Google).
* The API key registered in Digibee as a **Secret Key** account. For details, see [how to create a Secret Key account](https://app.gitbook.com/s/jvO5S91EQURCEhbZOuuZ/platform-administration/settings/accounts#secret-key).
* Read the conceptual guide [**“Agent testing with Datasets, Evaluations, and Versioning”**](https://app.gitbook.com/s/EKM2LD3uNAckQgy1OUyZ/connectors/ai-tools/llm/testing-your-agent) to understand the testing structure and terminology used in this quickstart.

## **Initial setup**

Add the **Agent Component** to your pipeline immediately after the trigger and configure it as follows:

* **Model:** Select `GPT-4o Mini` (recommended for this quickstart). This model was chosen because it tends to follow structured instructions consistently, which makes the evaluation results more predictable. Results may vary if you use a different model.
* **Account:** Click the gear icon next to the Model parameter, go to **Account**, and select the Secret Key account you created in Digibee.

Once the basic configuration is complete, configure the following messages and JSON Schema:

**System Message**

This version is intentionally strict. You'll see why this is a problem after the first run.

{% code overflow="wrap" %}

```
You are a strict compliance auditor. You must compare "Expected Tasks" with "Generated Tasks".

CRITICAL RULE: Only accept tasks that use the exact same terminology or very direct synonyms. If a task is phrased in a way that requires deep interpretation or technical jargon not present in the expected list, mark it as Missing.

You are not allowed to be "flexible". If in doubt, mark it as Not Passed.
Return JSON: {"reasoning": "...", "answer": "..."}.
```

{% endcode %}

**User Message**

The prompt includes both the Expected Tasks (hardcoded) and the `{{ message.agent1_output }}` variable, which is automatically injected from your Dataset at runtime.

{% code overflow="wrap" %}

```
### TASK EVALUATION REQUEST ###

**Context:** Consolidate quarterly sales data from three CSV files into a formatted master spreadsheet.

**Expected Tasks (Standard Protocol):**
1. Merge the 3 source CSV files into one primary worksheet tab.
2. Remove all rows containing duplicate Transaction IDs to ensure data integrity.
3. Convert all date columns to the DD/MM/YYYY format.
4. Format the "Total Value" column as local currency (BRL/R$).
5. Create a secondary tab containing a Pivot Table summarizing total sales per salesperson.

**AI-Generated Tasks (Agent 1 Output):**
{{ message.agent1_output }}

### EVALUATION INSTRUCTIONS ###
Compare the "AI-Generated Tasks" against the "Expected Tasks" provided above.
1. Identify **Matching** tasks (check for functional intent).
2. Identify **Missing** tasks (mandatory steps not found).
3. Identify **Extra** tasks (unnecessary steps).

Analyze carefully and return your result strictly as a structured JSON.
```

{% endcode %}

**JSON Schema**

Click the gear icon (⚙️) next to **Model**, enable **Use JSON Schema**, and define the schema below. This enforces a consistent output format so your pipeline can always parse the result reliably.

```json
{
  "type": "object",
  "properties": {
    "reasoning": { "type": "string" },
    "answer": { "type": "string", "enum": ["Passed", "Not Passed"] }
  },
  "required": ["reasoning", "answer"],
  "additionalProperties": false
}
```

## **Step-by-step**

In the next steps, you will create a mocked Dataset to simulate Agent 1's behavior and configure an Evaluation Rule to automate the verdict.

{% stepper %}
{% step %}

### Create two Test Cases in a Dataset

Create a new **Dataset** named `Task Accuracy Benchmark`. Inside it, create two **Test Cases** to simulate different Agent 1 behaviors.

Because your User Message contains the variable `{{ message.agent1_output }}`, each Test Case can inject a different generated output without modifying the prompt.

<figure><img src="https://3750561495-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FaD6wuPRxnEQEsYpePq36%2Fuploads%2FjDTMLVLFcwoZxK6wDn3M%2Fen-create-test-cases.gif?alt=media&#x26;token=fd1a3a57-2cfe-4f28-925d-ee2134208445" alt=""><figcaption></figcaption></figure>

<table><thead><tr><th width="134">Test Case name</th><th width="326">Value injected into message.agent1_output</th><th>What it tests</th></tr></thead><tbody><tr><td><strong>Semantic Consistency</strong></td><td><p><em>- Flatten the input trio of CSV files into a single unified array structure.</em> </p><p><em>- Enforce primary key uniqueness by purging all overlapping transaction records.</em> </p><p><em>- Apply an ISO-8601 derived mask to all temporal entries for a day-first display.</em> </p><p><em>- Standardize the float precision to Brazilian monetary specifications (R$).</em> </p><p><em>- Append a multidimensional summary worksheet to aggregate results per salesperson.</em></p></td><td>All 5 tasks are functionally correct, but described with different technical vocabulary. Tests whether the auditor resolves synonyms.</td></tr><tr><td><strong>Functional Omission</strong></td><td><p><em>- Merge the three provided documents into one main sheet.</em> </p><p><em>- Clean the database by deleting records with duplicate IDs.</em> </p><p><em>- Ensure all dates are set to the DD/MM/YYYY standard.</em> </p><p><em>- Format the financial columns to show the BRL symbol.</em> </p><p><em>- Change the font of the entire spreadsheet to Arial and make the headers bold.</em></p></td><td>The Pivot Table step is genuinely missing and there's one extra step. Tests whether the auditor catches both.</td></tr></tbody></table>
{% endstep %}

{% step %}

### Add an Evaluation Rule

Create an **Evaluation** to automate the verdict after each run:

| Field           | Value                |
| --------------- | -------------------- |
| **Eval Name**   | `final_status_check` |
| **JSONPath**    | `$.body.answer`      |
| **Scorer Type** | `STRING`             |
| **Operator**    | `Exact`              |

Then:

1. Click the three dots next to the Evaluation and select **Add to Dataset**. Choose the `Task Accuracy Benchmark` Dataset.
2. Open the **Datasets** tab, access each Test Case created in Step 1, and set the **Evaluation Value** to `Passed` for both.

<figure><img src="https://3750561495-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FaD6wuPRxnEQEsYpePq36%2Fuploads%2FsHarPieOnIbByYs9H6NH%2Fen-create-evals-v1.gif?alt=media&#x26;token=2b880f0e-b9c2-4465-9fe7-37be548c2881" alt=""><figcaption></figcaption></figure>

{% hint style="warning" icon="vial" %}

#### **What does this check?**

After each run, the platform extracts the value at `$.body.answer` from the Agent output and compares it against `Passed` using exact matching. If the values match, the Test Case passes. If not, it fails.

The **Evaluation Value** for both Test Cases is set to `Passed` because the Evaluator is expected to approve the tasks in both cases. When the actual results differ from that, the Dataset will automatically flag the discrepancy.
{% endhint %}
{% endstep %}

{% step %}

### Run and inspect the results

In the **Datasets** tab, select `Task Accuracy Benchmark` and click **Run**. Wait for both Test Cases to complete.

When the Dataset runs:

* Both Test Cases are executed sequentially.
* For each execution, the platform extracts the value at `$.body.answer`.
* The Evaluation checks whether it matches `Passed`.
* Each Test Case is marked as **Passed** or **Failed** independently.

<figure><img src="https://3750561495-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FaD6wuPRxnEQEsYpePq36%2Fuploads%2F763otj0o8dX6yRWRrRaK%2Fsemantic-consistency-output.png?alt=media&#x26;token=2f01e876-4781-4fbd-9b91-e32a2495bf0d" alt=""><figcaption></figcaption></figure>

Both will return `Not Passed`. The **Functional Omission** result is expected because a task is genuinely missing. Focus on **Semantic Consistency**: it fails even though all 5 tasks are correct. Open it in the **Executions** tab to inspect the output:

{% code overflow="wrap" %}

```json
{
  "reasoning": "The AI-Generated Tasks do not match the Expected Tasks due to the use of different terminology and technical jargon that requires interpretation. The terms used in the AI-Generated Tasks are not direct synonyms of those in the Expected Tasks, and crucial steps such as 'Create a secondary tab containing a Pivot Table summarizing total sales per salesperson' are missing. Therefore, all tasks from the AI output are marked as Missing.",
  "answer": "Not Passed"
}
```

{% endcode %}
{% endstep %}
{% endstepper %}

## **What did we learn?**

This is a **false negative**: all 5 tasks are functionally correct, but the Evaluator rejected them entirely.

It couldn't recognize different terminology, or that "a multidimensional summary worksheet" is the same as a Pivot Table. Because the prompt compares tasks literally instead of semantically, even correct outputs fail.

## **Next steps**

You have successfully built and tested a **quality gate** and identified exactly why it is not yet ready for production. Ready to fix it? In the [**Optimize the AI auditor with semantic intelligence**](https://docs.digibee.com/documentation/resources/quickstarts/optimize-the-ai-auditor-with-semantic-intelligence) guide, you will version the System Prompt and implement semantic refinement.
