Automate AI task validation with a quality gate
Learn how to build an agent that checks a list of tasks and decides if they meet the expected criteria.
This quickstart shows you how to configure an Agent Component as an automated auditor to validate AI-generated tasks against a defined standard and how to diagnose its behavior after a run.
What you'll build
You’ll configure an Evaluator Agent that verifies whether a list of tasks is correct before the pipeline continues.
In this example:
A Step Generator creates tasks to consolidate sales data from multiple spreadsheets.
The Evaluator compares those tasks with the expected ones and determines if they pass.
To keep this guide focused:
Generated tasks come from a Dataset
Expected tasks are defined directly in the User Message
In a real scenario, both would come from external sources.
Prerequisites
Before you begin, make sure you have:
An API key from an LLM provider (for example, OpenAI, Anthropic, or Google).
The API key registered in Digibee as a Secret Key account. For details, see how to create a Secret Key account.
Read the conceptual guide “Agent testing with Datasets, Evaluations, and Versioning” to understand the testing structure and terminology used in this quickstart.
Initial setup
Add the Agent Component to your pipeline immediately after the trigger and configure it as follows:
Model: Select
GPT-4o Mini(recommended for this quickstart). This model was chosen because it tends to follow structured instructions consistently, which makes the evaluation results more predictable. Results may vary if you use a different model.Account: Click the gear icon next to the Model parameter, go to Account, and select the Secret Key account you created in Digibee.
Once the basic configuration is complete, configure the following messages and JSON Schema:
System Message
This version is intentionally strict. You'll see why this is a problem after the first run.
User Message
The prompt includes both the Expected Tasks (hardcoded) and the {{ message.agent1_output }} variable, which is automatically injected from your Dataset at runtime.
JSON Schema
Click the gear icon (⚙️) next to Model, enable Use JSON Schema, and define the schema below. This enforces a consistent output format so your pipeline can always parse the result reliably.
Step-by-step
In the next steps, you will create a mocked Dataset to simulate Agent 1's behavior and configure an Evaluation Rule to automate the verdict.
Create two Test Cases in a Dataset
Create a new Dataset named Task Accuracy Benchmark. Inside it, create two Test Cases to simulate different Agent 1 behaviors.
Because your User Message contains the variable {{ message.agent1_output }}, each Test Case can inject a different generated output without modifying the prompt.

Semantic Consistency
- Flatten the input trio of CSV files into a single unified array structure.
- Enforce primary key uniqueness by purging all overlapping transaction records.
- Apply an ISO-8601 derived mask to all temporal entries for a day-first display.
- Standardize the float precision to Brazilian monetary specifications (R$).
- Append a multidimensional summary worksheet to aggregate results per salesperson.
All 5 tasks are functionally correct, but described with different technical vocabulary. Tests whether the auditor resolves synonyms.
Functional Omission
- Merge the three provided documents into one main sheet.
- Clean the database by deleting records with duplicate IDs.
- Ensure all dates are set to the DD/MM/YYYY standard.
- Format the financial columns to show the BRL symbol.
- Change the font of the entire spreadsheet to Arial and make the headers bold.
The Pivot Table step is genuinely missing and there's one extra step. Tests whether the auditor catches both.
Add an Evaluation Rule
Create an Evaluation to automate the verdict after each run:
Eval Name
final_status_check
JSONPath
$.body.answer
Scorer Type
STRING
Operator
Exact
Then:
Click the three dots next to the Evaluation and select Add to Dataset. Choose the
Task Accuracy BenchmarkDataset.Open the Datasets tab, access each Test Case created in Step 1, and set the Evaluation Value to
Passedfor both.

What does this check?
After each run, the platform extracts the value at $.body.answer from the Agent output and compares it against Passed using exact matching. If the values match, the Test Case passes. If not, it fails.
The Evaluation Value for both Test Cases is set to Passed because the Evaluator is expected to approve the tasks in both cases. When the actual results differ from that, the Dataset will automatically flag the discrepancy.
Run and inspect the results
In the Datasets tab, select Task Accuracy Benchmark and click Run. Wait for both Test Cases to complete.
When the Dataset runs:
Both Test Cases are executed sequentially.
For each execution, the platform extracts the value at
$.body.answer.The Evaluation checks whether it matches
Passed.Each Test Case is marked as Passed or Failed independently.

Both will return Not Passed. The Functional Omission result is expected because a task is genuinely missing. Focus on Semantic Consistency: it fails even though all 5 tasks are correct. Open it in the Executions tab to inspect the output:
What did we learn?
This is a false negative: all 5 tasks are functionally correct, but the Evaluator rejected them entirely.
It couldn't recognize different terminology, or that "a multidimensional summary worksheet" is the same as a Pivot Table. Because the prompt compares tasks literally instead of semantically, even correct outputs fail.
Next steps
You have successfully built and tested a quality gate and identified exactly why it is not yet ready for production. Ready to fix it? In the Optimize the AI auditor with semantic intelligence guide, you will version the System Prompt and implement semantic refinement.
Last updated
Was this helpful?