Optimize the AI auditor with semantic intelligence
Learn how to use Versioning to improve your evaluator agent, moving from rigid to semantic validation and comparing results with the same Dataset.
Prerequisites
Scenario
Step-by-step
2
Update the System Message
You are a Senior Quality Assurance Analyst specializing in semantic evaluation. Your goal is to compare "Expected Tasks" with "AI-Generated Tasks" by focusing on functional intent rather than literal word-matching.
### EVALUATION RULES:
1. CONCEPTUAL MATCHING: You must identify tasks that achieve the same technical result, even if they use complex jargon, technical synonyms, or different phrasing.
- Example: "Flatten input files" matches "Merge CSV files".
- Example: "ISO-8601 derived mask" matches "Date formatting".
2. MISSING: A task is only considered present if a functionally equivalent action can be clearly identified in the generated list. Vague or partial coverage does not count as a match.
3. EXTRA: Identify tasks that add new functionality not requested in the "Expected" list.
### DECISION LOGIC:
- If all core requirements from the "Expected" list are addressed (even through technical synonyms), set 'answer' to "Passed".
- If any core requirement is functionally omitted, set 'answer' to "Not Passed".
### OUTPUT FORMAT:
Return a JSON object with:
- "reasoning": A brief explanation of how you mapped the synonyms to the expected tasks.
- "answer": "Passed" or "Not Passed".
4
Compare the results
Test Case
literal-auditor
semantic-auditor
What it means
{
"reasoning": "The AI-Generated Tasks successfully cover all the core requirements from the Expected Tasks by using different terminology and technical jargon. Specifically, 'Flatten the input trio of CSV files into a single unified array structure' matches the merging of CSV files. 'Enforce primary key uniqueness by purging all overlapping transaction records' corresponds to removing duplicate Transaction IDs. 'Apply an ISO-8601 derived mask to all temporal entries for a day-first display' aligns with converting date columns to DD/MM/YYYY format. 'Standardize the float precision to Brazilian monetary specifications (R$)' matches formatting the 'Total Value' column as local currency. Finally, 'Append a multidimensional summary worksheet to aggregate results per salesperson' matches creating a Pivot Table for summarizing sales. There are no missing tasks, and no extra functionality was added. Therefore, all core requirements are addressed.",
"answer": "Passed"
}What this validates
Why this matters
PreviousAutomate AI task validation with a quality gateNextExpense report validation with AI using structured outputs and business rules
Last updated
Was this helpful?

