# Optimize the AI auditor with semantic intelligence

This is the second guide in a two-part series. Building on the Evaluator configured in [**Automate AI task validation with a quality gate**](https://docs.digibee.com/documentation/resources/quickstarts/automate-ai-task-validation-with-a-quality-gate), this guide shows you how to use **Versioning** to iterate on your auditor prompt and transform a rigid, literal evaluator into a **Semantic Auditor** that understands business intent, without losing its ability to catch real errors.

## **Prerequisites**

Before you begin, make sure you have:

* Completed the [**Automate AI task validation with a quality gate**](https://docs.digibee.com/documentation/resources/quickstarts/automate-ai-task-validation-with-a-quality-gate) guide, where you configured the Evaluator Agent and ran the `Task Accuracy Benchmark` Dataset.
* The `Task Accuracy Benchmark` Dataset available in your project with both Test Cases intact.

## **Scenario**

In the **Automate AI task validation with a quality gate** guide, the Evaluator Agent returned a false negative on the **Semantic Consistency** test case: it rejected a set of fully correct tasks because they used technical vocabulary (`Flatten`, `Purge`, `Append`) instead of the exact terms in the Standard Protocol (`Merge`, `Remove`, `Create`).

The goal of this quickstart is to fix that behavior through prompt iteration, without changing the Dataset or the Evaluation Rule. Any difference in results will reflect only the prompt change.

{% hint style="info" %}
Both guides use `GPT-4o Mini`. This model was chosen because it follows structured instructions consistently, which makes the evaluation results more predictable. If you use a different model, your results may vary.
{% endhint %}

## **Step-by-step**

{% stepper %}
{% step %}

### Save the baseline version

Before making any changes, save the current Agent configuration as a named version. This preserves the original behavior as a reference point for comparison.

1. Open the **Agent Component** (Evaluator).
2. Navigate to the versions dropdown. If you have not saved any versions yet, it will be labeled **Draft**.
3. Click **Create new version**, add the `literal-auditor` tag, and save the current configuration.

<figure><img src="https://3750561495-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FaD6wuPRxnEQEsYpePq36%2Fuploads%2FjoNqMS5zmS6XEEM0xuQW%2Fen-create-v1.gif?alt=media&#x26;token=7dc16722-951b-4863-b494-69cf2b4cc790" alt=""><figcaption></figcaption></figure>

This is your baseline. Any regression introduced by the new prompt will be visible when you compare results against this version.
{% endstep %}

{% step %}

### Update the System Message

Replace the current System Message with the prompt below. This version shifts the evaluation logic from literal word-matching to **functional intent mapping**. The auditor will now accept tasks that achieve the same technical result, even if they use different terminology.

{% code overflow="wrap" %}

```
You are a Senior Quality Assurance Analyst specializing in semantic evaluation. Your goal is to compare "Expected Tasks" with "AI-Generated Tasks" by focusing on functional intent rather than literal word-matching.

### EVALUATION RULES:
1. CONCEPTUAL MATCHING: You must identify tasks that achieve the same technical result, even if they use complex jargon, technical synonyms, or different phrasing. 
   - Example: "Flatten input files" matches "Merge CSV files".
   - Example: "ISO-8601 derived mask" matches "Date formatting".
2. MISSING: A task is only considered present if a functionally equivalent action can be clearly identified in the generated list. Vague or partial coverage does not count as a match.
3. EXTRA: Identify tasks that add new functionality not requested in the "Expected" list.

### DECISION LOGIC:
- If all core requirements from the "Expected" list are addressed (even through technical synonyms), set 'answer' to "Passed".
- If any core requirement is functionally omitted, set 'answer' to "Not Passed".

### OUTPUT FORMAT:
Return a JSON object with:
- "reasoning": A brief explanation of how you mapped the synonyms to the expected tasks.
- "answer": "Passed" or "Not Passed".
```

{% endcode %}

Once updated, save this configuration as a **new version** named `semantic-auditor`.

<figure><img src="https://3750561495-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FaD6wuPRxnEQEsYpePq36%2Fuploads%2Foa3I0IeA0gkCfS97CCOP%2Fen-create-v2.gif?alt=media&#x26;token=5291829e-b12c-4f4f-98d8-63715e0ac8e6" alt=""><figcaption></figcaption></figure>

You now have two comparable configurations:

* `literal-auditor`: Keyword-based evaluation, no synonym tolerance.
* `semantic-auditor`: Intent-based evaluation, accepts functional equivalents.
  {% endstep %}

{% step %}

### Re-run the Dataset&#xD;

Go to the **Datasets** tab, select `Task Accuracy Benchmark`, and click **Run**.

Because the Dataset, Test Cases, and Evaluation Rule remain unchanged, any behavioral difference reflects only the prompt modification.

When the Dataset runs:

* Both Test Cases are executed sequentially against the updated Agent.
* For each execution, the platform extracts the value at `$.body.answer`.
* The Evaluation verifies whether it matches `Passed`.
* Each Test Case is marked as **Passed** or **Failed** independently.

<figure><img src="https://3750561495-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FaD6wuPRxnEQEsYpePq36%2Fuploads%2FKcu4QPZPyhdlhmXbnC7O%2Fsemantic-consistency-output-v2.png?alt=media&#x26;token=b9f7753c-b549-442a-ab06-b6b60b8e3da0" alt=""><figcaption></figcaption></figure>

{% endstep %}

{% step %}

### Compare the results

After the run completes, compare the outcomes across both versions:

| Test Case                | literal-auditor | semantic-auditor | What it means                                                              |
| ------------------------ | --------------- | ---------------- | -------------------------------------------------------------------------- |
| **Semantic Consistency** | ❌ Failed        | ✅ Passed         | False negative eliminated. Valid synonyms are now accepted.                |
| **Functional Omission**  | ❌ Failed        | ❌ Failed         | Correct behavior maintained. The missing Pivot Table task is still caught. |

The **Semantic Consistency** case now passes. Open the result to inspect the output:

{% code overflow="wrap" %}

```json
{
  "reasoning": "The AI-Generated Tasks successfully cover all the core requirements from the Expected Tasks by using different terminology and technical jargon. Specifically, 'Flatten the input trio of CSV files into a single unified array structure' matches the merging of CSV files. 'Enforce primary key uniqueness by purging all overlapping transaction records' corresponds to removing duplicate Transaction IDs. 'Apply an ISO-8601 derived mask to all temporal entries for a day-first display' aligns with converting date columns to DD/MM/YYYY format. 'Standardize the float precision to Brazilian monetary specifications (R$)' matches formatting the 'Total Value' column as local currency. Finally, 'Append a multidimensional summary worksheet to aggregate results per salesperson' matches creating a Pivot Table for summarizing sales. There are no missing tasks, and no extra functionality was added. Therefore, all core requirements are addressed.",
  "answer": "Passed"
}
```

{% endcode %}

The **Functional Omission** case still returns `Not Passed`. That's the expected behavior. The updated prompt is now flexible enough to resolve synonyms, but still rigorous to detect a missing step. The Pivot Table requirement has no functional equivalent in the generated list, so it’s correctly flagged as absent.

{% hint style="info" %}
Making an auditor more semantically flexible introduces a risk: it might become too tolerant and stop catching real errors. The fact that **Functional Omission** still fails is your regression test confirming that didn't happen here.
{% endhint %}
{% endstep %}
{% endstepper %}

### **What this validates**

This experiment confirms that prompt iteration with controlled versioning produces measurable, comparable results.

By keeping the Dataset and Evaluation Rule unchanged across both runs, you isolated the variable to the prompt alone. The outcome proves that:

* The `semantic-auditor` prompt resolves technical synonyms without introducing false positives.
* The quality gate remains strict where strictness is required. Genuine omissions are still caught.
* The `literal-auditor` version is preserved as a rollback point if the new behavior introduces regressions in other scenarios.

### **Why this matters**

Prompt engineering is an iterative, experimental process. Without versioning and a reusable test Dataset, each iteration is a blind change and you can't tell whether a new prompt is better or just different.

By combining Versioning with Datasets and Evaluations, you can:

* Measure the impact of every prompt change against identical test conditions.
* Detect regressions before they reach production.
* Build a documented history of auditor behavior across configurations.
* Promote only the versions that pass all test cases.

This is the foundation of a reliable, production-ready AI evaluation pipeline.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.digibee.com/documentation/resources/quickstarts/optimize-the-ai-auditor-with-semantic-intelligence.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
