Traditional LLM Metrics don’t work for agent developers and here’s why…
Unlike simple LLM applications, agents work in dynamic environments, and take actions that modify the environment.
Traditional metrics (like LLM-as-a-Judge) fall short because they only measure what the agent outputs (think about answer correctness), not what it does.
For this reason agentic AI applications need a new evaluation paradigm.
Picture a customer support agent handling a refund request. The correct response might be:
“𝘠𝘦𝘴, 𝘵𝘩𝘦 𝘳𝘦𝘧𝘶𝘯𝘥 𝘩𝘢𝘴 𝘣𝘦𝘦𝘯 𝘱𝘳𝘰𝘤𝘦𝘴𝘴𝘦𝘥!”
But does this guarantee the refund was processed correctly? It could just be a hallucination.
These metrics don’t capture the full picture. For many agentic applications, you can’t even define a single “correct” output: think the case of web search, where the content might change all the time.
🚀𝗔𝗴𝗲𝗻𝘁 𝗖𝗼𝗻𝘁𝗿𝗮𝗰𝘁𝘀, 𝗮 𝗻𝗲𝘄 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝗽𝗮𝗿𝗮𝗱𝗶𝗴𝗺:
Inspired by formal methods, we’re introducing a new framework to measure and verify agentic systems: Agent Contracts.
Agent Contracts allow you to define:
• Module-Level Contracts: Specify the expected input-output relationship, preconditions, and postconditions of individual agent actions.
• Trace-Level Contracts: Capture the expected sequence of actions—the agent’s journey from start to finish.
Contracts are scenario specific, they are relevant only when some conditions are met, in this case the user asking for a refund.
🤖 𝗔 𝗖𝘂𝘀𝘁𝗼𝗺𝗲𝗿 𝗦𝘂𝗽𝗽𝗼𝗿𝘁 𝗘𝘅𝗮𝗺𝗽𝗹𝗲 𝗶𝗻 𝗔𝗰𝘁𝗶𝗼𝗻:
To make everything more concrete, let’s say we are developing a customer support agent. Suppose the user asks for a refund. Agent Contracts would define:
Module-Level Contract:
• Precondition: user asks for a refund
• Postconditions: the agent triggers the refund process (e.g., database update).
Trace-Level Contract:
• The agent calls the GetOrder tool to retrieve the order details, then
• The agent uses the ProcessRefund tool with the correct order information, then
• The agent collects feedback from the customer post-refund.
Think of it like this: Module contracts set the entry and exit rules for a room, while Trace contracts describe the full journey through the building.
💡𝗪𝗵𝘆 𝗧𝗵𝗶𝘀 𝗠𝗮𝘁𝘁𝗲𝗿𝘀:
By defining contracts specific to scenarios, we can ensure agent reliability, traceability, and correctness—even in complex, multi-step interactions.
This idea builds on what I studied during my PhD on agent reliability, tackling the challenge of evaluating AI systems in real-world settings.
🔥𝗝𝗼𝗶𝗻 𝗨𝘀 𝗶𝗻 𝘁𝗵𝗲 𝗕𝗲𝘁𝗮:
We’re building a library to enable Contract-based evaluation and observability for AI agents. If you’re working on dynamic agents and want to explore a new standard for measuring and verifying their performance, we’d love to hear from you.
#AIagents #formalmethods #LLMEvaluation #AIInnovation #AgentReliability #AIResearch #LLMOps