thumbs-upOptimize the AI auditor with semantic intelligence

Learn how to use Versioning to improve your evaluator agent, moving from rigid to semantic validation and comparing results with the same Dataset.

This is the second guide in a two-part series. Building on the Evaluator configured in Automate AI task validation with a quality gate, this guide shows you how to use Versioning to iterate on your auditor prompt and transform a rigid, literal evaluator into a Semantic Auditor that understands business intent, without losing its ability to catch real errors.

Prerequisites

Before you begin, make sure you have:

  • Completed the Automate AI task validation with a quality gate guide, where you configured the Evaluator Agent and ran the Task Accuracy Benchmark Dataset.

  • The Task Accuracy Benchmark Dataset available in your project with both Test Cases intact.

Scenario

In the Automate AI task validation with a quality gate guide, the Evaluator Agent returned a false negative on the Semantic Consistency test case: it rejected a set of fully correct tasks because they used technical vocabulary (Flatten, Purge, Append) instead of the exact terms in the Standard Protocol (Merge, Remove, Create).

The goal of this quickstart is to fix that behavior through prompt iteration, without changing the Dataset or the Evaluation Rule. Any difference in results will reflect only the prompt change.

circle-info

Both guides use GPT-4o Mini. This model was chosen because it follows structured instructions consistently, which makes the evaluation results more predictable. If you use a different model, your results may vary.

Step-by-step

1

Save the baseline version

Before making any changes, save the current Agent configuration as a named version. This preserves the original behavior as a reference point for comparison.

  1. Open the Agent Component (Evaluator).

  2. Navigate to the versions dropdown. If you have not saved any versions yet, it will be labeled Draft.

  3. Click Create new version, add the literal-auditor tag, and save the current configuration.

This is your baseline. Any regression introduced by the new prompt will be visible when you compare results against this version.

2

Update the System Message

Replace the current System Message with the prompt below. This version shifts the evaluation logic from literal word-matching to functional intent mapping. The auditor will now accept tasks that achieve the same technical result, even if they use different terminology.

You are a Senior Quality Assurance Analyst specializing in semantic evaluation. Your goal is to compare "Expected Tasks" with "AI-Generated Tasks" by focusing on functional intent rather than literal word-matching.

### EVALUATION RULES:
1. CONCEPTUAL MATCHING: You must identify tasks that achieve the same technical result, even if they use complex jargon, technical synonyms, or different phrasing. 
   - Example: "Flatten input files" matches "Merge CSV files".
   - Example: "ISO-8601 derived mask" matches "Date formatting".
2. MISSING: A task is only considered present if a functionally equivalent action can be clearly identified in the generated list. Vague or partial coverage does not count as a match.
3. EXTRA: Identify tasks that add new functionality not requested in the "Expected" list.

### DECISION LOGIC:
- If all core requirements from the "Expected" list are addressed (even through technical synonyms), set 'answer' to "Passed".
- If any core requirement is functionally omitted, set 'answer' to "Not Passed".

### OUTPUT FORMAT:
Return a JSON object with:
- "reasoning": A brief explanation of how you mapped the synonyms to the expected tasks.
- "answer": "Passed" or "Not Passed".

Once updated, save this configuration as a new version named semantic-auditor.

You now have two comparable configurations:

  • literal-auditor: Keyword-based evaluation, no synonym tolerance.

  • semantic-auditor: Intent-based evaluation, accepts functional equivalents.

3

Re-run the Dataset

Go to the Datasets tab, select Task Accuracy Benchmark, and click Run.

Because the Dataset, Test Cases, and Evaluation Rule remain unchanged, any behavioral difference reflects only the prompt modification.

When the Dataset runs:

  • Both Test Cases are executed sequentially against the updated Agent.

  • For each execution, the platform extracts the value at $.body.answer.

  • The Evaluation verifies whether it matches Passed.

  • Each Test Case is marked as Passed or Failed independently.

4

Compare the results

After the run completes, compare the outcomes across both versions:

Test Case
literal-auditor
semantic-auditor
What it means

Semantic Consistency

❌ Failed

✅ Passed

False negative eliminated. Valid synonyms are now accepted.

Functional Omission

❌ Failed

❌ Failed

Correct behavior maintained. The missing Pivot Table task is still caught.

The Semantic Consistency case now passes. Open the result to inspect the output:

{
  "reasoning": "The AI-Generated Tasks successfully cover all the core requirements from the Expected Tasks by using different terminology and technical jargon. Specifically, 'Flatten the input trio of CSV files into a single unified array structure' matches the merging of CSV files. 'Enforce primary key uniqueness by purging all overlapping transaction records' corresponds to removing duplicate Transaction IDs. 'Apply an ISO-8601 derived mask to all temporal entries for a day-first display' aligns with converting date columns to DD/MM/YYYY format. 'Standardize the float precision to Brazilian monetary specifications (R$)' matches formatting the 'Total Value' column as local currency. Finally, 'Append a multidimensional summary worksheet to aggregate results per salesperson' matches creating a Pivot Table for summarizing sales. There are no missing tasks, and no extra functionality was added. Therefore, all core requirements are addressed.",
  "answer": "Passed"
}

The Functional Omission case still returns Not Passed. That's the expected behavior. The updated prompt is now flexible enough to resolve synonyms, but still rigorous to detect a missing step. The Pivot Table requirement has no functional equivalent in the generated list, so it’s correctly flagged as absent.

circle-info

Making an auditor more semantically flexible introduces a risk: it might become too tolerant and stop catching real errors. The fact that Functional Omission still fails is your regression test confirming that didn't happen here.

What this validates

This experiment confirms that prompt iteration with controlled versioning produces measurable, comparable results.

By keeping the Dataset and Evaluation Rule unchanged across both runs, you isolated the variable to the prompt alone. The outcome proves that:

  • The semantic-auditor prompt resolves technical synonyms without introducing false positives.

  • The quality gate remains strict where strictness is required. Genuine omissions are still caught.

  • The literal-auditor version is preserved as a rollback point if the new behavior introduces regressions in other scenarios.

Why this matters

Prompt engineering is an iterative, experimental process. Without versioning and a reusable test Dataset, each iteration is a blind change and you can't tell whether a new prompt is better or just different.

By combining Versioning with Datasets and Evaluations, you can:

  • Measure the impact of every prompt change against identical test conditions.

  • Detect regressions before they reach production.

  • Build a documented history of auditor behavior across configurations.

  • Promote only the versions that pass all test cases.

This is the foundation of a reliable, production-ready AI evaluation pipeline.

Last updated

Was this helpful?