As LLM applications evolve into sophisticated agentic systems, function calling and multi-step reasoning have emerged as critical capabilities for building production-ready applications. Yet evaluating these complex behaviors remains one of the thorniest challenges in LLM development. Traditional LLM-as-judge approaches struggle with inconsistent scoring and poor correlation with real-world performance.
At Composo, we've developed a component-based evaluation framework that addresses these challenges head-on, leveraging our generative reward model technology to deliver deterministic, reliable evaluation metrics that teams can actually trust.
Modern agentic LLMs operate through multiple interconnected components:
Effective evaluation requires analyzing each component individually while understanding how they work together as a system.
When evaluating LLM function calls specifically, it's crucial to recognize that we're actually evaluating two distinct LLM steps:
The function execution itself sits between these steps but isn't part of the LLM evaluation challenge. Here's the complete pipeline:
In the pipeline above, this is evaluating B given A (though in practice, we recommend providing A, B, C, and D to Composo for improved evaluation accuracy).
At this stage, we evaluate whether a function call was formulated correctly (or in other words, have the right arguments been used, and has appropriate information been included within those arguments)? For example:
Composo's generative reward models can effectively evaluate formulation quality with criteria such as:
"Reward tool calls that only include information from the user's query"
Or for context utilization:
"Reward tool calls that extract and use all relevant entities mentioned in the user's query as appropriate parameters"
While these evaluations can't determine if the right function was called, they ensure technical correctness and proper parameter extraction—critical prerequisites for successful function execution.
*Important Implementation Note: While this evaluation conceptually focuses on B given A, in practice you should provide Composo with all available information (A, B, C, and D) when running these evaluations. The additional context from function returns and final outputs significantly improves the evaluation's ability to assess whether parameters were formulated optimally.
With the benefit of hindsight, evaluators gain significant information advantage. We can now evaluate the choice to use a tool (i.e. B, given A, C & D). We divide this into 2 metrics:
This is where evaluation truly shines. By seeing what the function returned and how it was used, evaluators gain significant leverage over the original LLM choosing which tool to use. This can be implemented in Composo with criteria such as:
For tool relevance:
"Reward responses where the function calls retrieved information relevant to answering the user's query"
For tool sufficiency:
"Reward responses where the function calls retrieved all necessary information for comprehensive answer generation"
Or additionally, for tool relevance:
"Penalize responses where function calls retrieved superfluous or unnecessary information"
Furthermore, advanced criteria can capture domain-specific patterns, e.g. in medical contexts:
"Reward responses that retrieve local guidelines & drug formularies that are up to date when providing medical dosing information"
In finance or data analysis:
"Reward function calls that retrieve both current and historical data when the user asks about trends or changes"
For customer support:
"Reward function call sequences that first check user account status before attempting order-specific lookups"
Here we evaluate how well the LLM uses the function returns, similar to how we evaluate generation quality in RAG systems, this is the quality of D, given A & C:
To implement this in Composo, Composo Implementation: These evaluations leverage our proven approaches from RAG evaluation:
For faithfulness:
"Reward responses where all claims are directly supported by the function returns without hallucination or speculation beyond the provided data"
For completeness:
"Reward responses that incorporate all relevant information from function returns needed to comprehensively answer the user's question"
For precision:
"Reward responses that include only the specific information from function returns that directly addresses the user's query, avoiding tangential details"
Here, independent of tool calls we evaluate D given A, on a range of quality dimensions. This evaluation ensures that regardless of how information was obtained, the final response meets quality standards.
Composo Implementation: Comprehensive output evaluation can span any custom dimension, including safety, tone, formatting, content quality, and domain-specific requirements. [See previous examples for detailed criteria.]
For agentic systems that expose reasoning traces or thinking tokens, evaluating the quality of deliberation becomes crucial. This component evaluates the LLM's internal reasoning process before it makes decisions about tool usage or response generation.
Composo Implementation: Key evaluation criteria for reasoning quality include:
For logical coherence:
"Reward reasoning traces that follow clear logical steps without contradictions or unsupported leaps in logic"
For comprehensive consideration:
"Reward reasoning that systematically considers multiple approaches or solutions before settling on a final decision"
For acknowledging limitations:
"Reward reasoning that explicitly acknowledges when information is missing or uncertain and articulates what additional data would be helpful"
For goal alignment:
"Reward reasoning that maintains clear focus on the user's original objective throughout the deliberation process"
For efficiency:
"Penalize reasoning that includes repetitive thoughts or unnecessarily verbose deliberation that doesn't add value to the decision-making"
Additional domain-specific reasoning criteria might include:
In analytical contexts:
"Reward reasoning that breaks down complex problems into manageable sub-components before attempting to solve them"
In safety-critical applications:
"Reward reasoning that explicitly considers potential risks or failure modes before proposing actions"
While component-level evaluation provides detailed insights, agentic systems also require holistic evaluation through composite metrics that reveal system-wide patterns and failure modes.
Core System Metrics: At the highest level, agentic systems must be evaluated on their fundamental purpose:
These top-level metrics are essential but insufficient on their own—when an agent fails, you need to understand why. This is where composite metrics become invaluable.
Composite Metrics Approach: Rather than evaluating the entire agent as a monolith, effective system-level analysis combines component metrics to identify where breakdowns occur:
Example implementations with Composo would involve full system level metrics such as:
"Reward task completions where all required information was gathered and correctly integrated into the final response"
"Reward goal achievement that required the minimum necessary tool calls without redundant operations"
These system level & composite metrics provide a bird's-eye view of your agentic system, helping you quickly identify whether issues stem from poor reasoning, incorrect tool choices, bad parameter formulation, or integration failures. By connecting goal success rates to component-level evaluations, you can rapidly diagnose and fix the root causes of agent failures.
We'll be publishing more comprehensive research on system-level agent evaluation methodologies in the coming weeks, including advanced techniques for multi-agent systems and long-running agentic workflows.
Beyond the core evaluation components, you can gain deeper insights by analyzing the patterns in your Tool Choice metrics. Specifically, the relationship between tool sufficiency and tool relevance scores reveals critical optimization opportunities for your function calling system.
The tool sufficiency and tool relevance metrics from your Tool Choice evaluation naturally exist in tension:
By analyzing the ratio and distribution of these metrics across your evaluation data, you can identify systematic patterns in how your LLM makes tool invocation decisions. This analysis reveals that LLMs manage tool invocation through implicit sensitivity thresholds—the point at which they decide a tool is worth calling.
Using your existing Tool Choice metrics, you can perform valuable threshold analysis:
Individual Tool Sensitivity Patterns: By examining sufficiency and relevance scores for each tool, you can identify tools that are:
Cross-Tool Dependencies: Analyzing metric patterns across multiple tools reveals:
Optimization Insights: This analysis of your evaluation metrics helps you:
The key insight is that unlike other quality dimensions where you can improve both aspects simultaneously, tool invocation is inherently classificatory—you're always trading sufficiency for relevance or vice versa. Your evaluation metrics help you find the optimal balance for your specific use case.
For teams seeking to push beyond threshold tuning, there's an advanced approach that leverages your evaluation data to train custom tool invocation models. This method uses your historical evaluation results to teach a model which tools are actually useful in your specific context.
The Historical Learning Framework:
Benefits of the Historical Approach:
Begin where evaluation provides the most immediate value and insights:
Response Integration Quality (D given A & C)
Output Quality (D given A)
Once basic evaluation is working, add powerful tool selection analysis alongside technical checks:
Tool Choice (B given A, C & D)
Tool Call Formulation (B given A, C & D)
For teams building sophisticated agentic systems:
Reasoning and Thinking Evaluation
System-Level Analysis
Sensitivity Threshold Tuning
Historical Learning Implementation
As LLM applications evolve into sophisticated agentic systems, evaluation must evolve too. By implementing our component-based evaluation framework—covering Tool Call Formulation, Tool Choice, Response Integration Quality, Output Quality, and Reasoning Evaluation—you gain comprehensive visibility into every aspect of your system's performance.
Start with Response Integration Quality and Output Quality evaluations that provide immediate value, then systematically add Tool Call Formulation checks and Tool Choice analysis. For agentic systems, incorporate Reasoning Evaluation and build composite metrics for system-level insights.
Use the patterns in your evaluation data to optimize tool invocation thresholds, and for advanced teams, train custom models based on your historical performance.
With Composo's generative reward models providing deterministic, reliable scoring at each component, you can build agentic applications with confidence.
Ready to transform your LLM evaluation? Try Composo Align today and experience the difference that deterministic, reliable evaluation can make for your agentic applications.