starsAutomate AI task validation with a quality gate

Learn how to build an agent that checks a list of tasks and decides if they meet the expected criteria.

This quickstart shows you how to configure an Agent Component as an automated auditor to validate AI-generated tasks against a defined standard and how to diagnose its behavior after a run.

What you'll build

You’ll configure an Evaluator Agent that verifies whether a list of tasks is correct before the pipeline continues.

In this example:

  • A Step Generator creates tasks to consolidate sales data from multiple spreadsheets.

  • The Evaluator compares those tasks with the expected ones and determines if they pass.

circle-info

To keep this guide focused:

  • Generated tasks come from a Dataset

  • Expected tasks are defined directly in the User Message

In a real scenario, both would come from external sources.

Prerequisites

Before you begin, make sure you have:

Initial setup

Add the Agent Component to your pipeline immediately after the trigger and configure it as follows:

  • Model: Select GPT-4o Mini (recommended for this quickstart). This model was chosen because it tends to follow structured instructions consistently, which makes the evaluation results more predictable. Results may vary if you use a different model.

  • Account: Click the gear icon next to the Model parameter, go to Account, and select the Secret Key account you created in Digibee.

Once the basic configuration is complete, configure the following messages and JSON Schema:

System Message

This version is intentionally strict. You'll see why this is a problem after the first run.

User Message

The prompt includes both the Expected Tasks (hardcoded) and the {{ message.agent1_output }} variable, which is automatically injected from your Dataset at runtime.

JSON Schema

Click the gear icon (⚙️) next to Model, enable Use JSON Schema, and define the schema below. This enforces a consistent output format so your pipeline can always parse the result reliably.

Step-by-step

In the next steps, you will create a mocked Dataset to simulate Agent 1's behavior and configure an Evaluation Rule to automate the verdict.

1

Create two Test Cases in a Dataset

Create a new Dataset named Task Accuracy Benchmark. Inside it, create two Test Cases to simulate different Agent 1 behaviors.

Because your User Message contains the variable {{ message.agent1_output }}, each Test Case can inject a different generated output without modifying the prompt.

Test Case name
Value injected into message.agent1_output
What it tests

Semantic Consistency

- Flatten the input trio of CSV files into a single unified array structure.

- Enforce primary key uniqueness by purging all overlapping transaction records.

- Apply an ISO-8601 derived mask to all temporal entries for a day-first display.

- Standardize the float precision to Brazilian monetary specifications (R$).

- Append a multidimensional summary worksheet to aggregate results per salesperson.

All 5 tasks are functionally correct, but described with different technical vocabulary. Tests whether the auditor resolves synonyms.

Functional Omission

- Merge the three provided documents into one main sheet.

- Clean the database by deleting records with duplicate IDs.

- Ensure all dates are set to the DD/MM/YYYY standard.

- Format the financial columns to show the BRL symbol.

- Change the font of the entire spreadsheet to Arial and make the headers bold.

The Pivot Table step is genuinely missing and there's one extra step. Tests whether the auditor catches both.

2

Add an Evaluation Rule

Create an Evaluation to automate the verdict after each run:

Field
Value

Eval Name

final_status_check

JSONPath

$.body.answer

Scorer Type

STRING

Operator

Exact

Then:

  1. Click the three dots next to the Evaluation and select Add to Dataset. Choose the Task Accuracy Benchmark Dataset.

  2. Open the Datasets tab, access each Test Case created in Step 1, and set the Evaluation Value to Passed for both.

vial

What does this check?

3

Run and inspect the results

In the Datasets tab, select Task Accuracy Benchmark and click Run. Wait for both Test Cases to complete.

When the Dataset runs:

  • Both Test Cases are executed sequentially.

  • For each execution, the platform extracts the value at $.body.answer.

  • The Evaluation checks whether it matches Passed.

  • Each Test Case is marked as Passed or Failed independently.

Both will return Not Passed. The Functional Omission result is expected because a task is genuinely missing. Focus on Semantic Consistency: it fails even though all 5 tasks are correct. Open it in the Executions tab to inspect the output:

What did we learn?

This is a false negative: all 5 tasks are functionally correct, but the Evaluator rejected them entirely.

It couldn't recognize different terminology, or that "a multidimensional summary worksheet" is the same as a Pivot Table. Because the prompt compares tasks literally instead of semantically, even correct outputs fail.

Next steps

You have successfully built and tested a quality gate and identified exactly why it is not yet ready for production. Ready to fix it? In the Optimize the AI auditor with semantic intelligence guide, you will version the System Prompt and implement semantic refinement.

Last updated

Was this helpful?