"The agent got complex search results with multiple data sources showing different GDP figures and metrics (growth rates vs. absolute values, different measurement methods), but it didn't compare or verify the different data sources - it essentially acted as a tool executor rather than an intelligent analyst."
What We're Building
In this guide, we'll take a LangGraph multi-agent workflow and evaluate its performance using Composo's agent evaluation framework. Our goal is to transform LangGraph's native message format into an OpenAI-compatible trace that Composo can analyze, then get detailed feedback on how well our agents performed.
What you'll end up with:
- OpenAI-formatted conversation traces from your LangGraph agents
- Quantitative scores (0-1) across 5 key agent performance dimensions
- Detailed explanations of where your agents excel and where they need improvement
Our Approach
We'll follow these key steps:
- Run a LangGraph workflow - Execute the multi-agent collaboration example and capture all events
- Convert to OpenAI format - Transform LangGraph messages and tools into OpenAI-compatible structures
- Fix format compliance - Clean up any formatting issues that prevent proper evaluation
- Evaluate with Composo - Run the trace through Composo's agent evaluation criteria
- Interpret results - Understand what the scores and explanations tell us about agent behavior
By the end, you'll have a clear picture of your agent's analytical capabilities, tool usage effectiveness, and goal-oriented behavior.
What You'll Learn
In this guide, you'll learn how to:
- Convert LangGraph messages and tools into a format Composo evals can ingest
- Trigger agent evaluations with Composo Align
- See Composo give live feedback on agent effectiveness
Setup
We'll work on evaluating the example from the LangGraph multi-agent collaboration tutorial.
You can find the finished code in this Google Colab.
Note: I had to change ChatAnthropic(model="claude-3-5-sonnet-latest")
to ChatOpenAI(model="gpt-4.1")
as LangGraph wouldn't accept my Anthropic API key (you might have better luck).
Storing Events for Evaluation
Let's adjust the existing code to store the events for future use (generators are single-use only).
Change this:
events = graph.stream(
{
"messages": [
(
"user",
"First, get the UK's GDP over the past 5 years, then make a line chart of it. "
"Once you make the chart, finish.",
)
],
},
# Maximum number of steps to take in the graph
{"recursion_limit": 150},
)
To this:
events = list(graph.stream(
{
"messages": [
(
"user",
"First, get the UK's GDP over the past 5 years, then make a line chart of it. "
"Once you make the chart, finish.",
)
],
},
# Maximum number of steps to take in the graph
{"recursion_limit": 150},
))
Conversion to OpenAI Format
LangChain provides convenient functions for casting LangChain native types to OpenAI compatible types. This should work for any format (OpenAI, Anthropic, Bedrock Converse, or VertexAI), however we only tested with OpenAI. See the LangChain documentation for more details.
Import Required Tools
from langchain_core.messages import convert_to_openai_messages
from langchain_core.utils.function_calling import convert_to_openai_tool
Choose Your Evaluation Strategy
We have the option of evaluating each agent separately, or, given that in research_node
we give the chart_generator access to the research node's conversation history, we can evaluate all in one go by simply evaluating the chart_generator.
Let's convert both the messages and the tools to OpenAI format (both will be needed for Composo Align evaluation):
messages = convert_to_openai_messages(events[1]['chart_generator']['messages'])
tools = [convert_to_openai_tool(python_repl_tool), convert_to_openai_tool(tavily_tool)]
Fix OpenAI Format Compliance
Before continuing, we need to apply one small adjustment: LangChain doesn't properly comply with OpenAI format for tool responses as it includes the name in the tool response object (it cannot be included in the OpenAI official data structures).
This creates invalid messages like:
{
"role": "tool",
"name": "tavily_search",
"tool_call_id": "call_5HCSRU30DDx4J48vA7Gpus2q",
"content": "..."
}
Therefore, before sending to Composo we need to remove the name
field:
for m in messages:
if m["role"] == "tool":
del m["name"]
Evaluating with Composo
Composo recommends five criteria for agent evaluation:
- Exploration - Reward agents that plan effectively: exploring new information and capabilities, and investigating unknowns despite uncertainty
- Exploitation - Reward agents that plan effectively: exploiting existing knowledge and available context to create reliable plans with predictable outcomes
- Tool Use - Reward agents that operate tools correctly in accordance with the tool definition, using all relevant context available in tool calls
- Goal Pursuit - Reward agents that work towards the goal specified by the user
- Faithfulness - Reward agents that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation
Evaluating these metrics with Composo is as simple as executing the following snippet:
from composo import AsyncComposo, criteria
composo_client = AsyncComposo()
evaluations = await composo_client.evaluate(messages=messages, tools=tools, criteria=criteria.agent)
print("Evaluation results\n")
for result, criterion in zip(evaluations, criteria.agent):
print(f"Criterion: {criterion}")
print(f"Score: {result.score}")
print(f"Explanation: {result.explanation}")
print("-" * 40)
Example Results
This prints the following output:
Evaluation results
Criterion: Reward agents that plan effectively: exploring new information and capabilities, and investigating unknowns despite uncertainty
Score: 0.72
Explanation: The assistant demonstrated good planning by using appropriate tools in sequence to accomplish the task, but missed an opportunity to show more initiative in extracting and verifying the data from the search results before proceeding to visualization.
----------------------------------------
Criterion: Reward agents that plan effectively: exploiting existing knowledge and available context to create reliable plans with predictable outcomes
Score: 0.87
Explanation: The assistant demonstrated excellent planning capabilities by efficiently gathering information, adapting to new input, and executing a clear visualization strategy with predictable and successful outcomes. It used appropriate tools at each step and produced a high-quality result that fulfilled all requirements of the task.
----------------------------------------
Criterion: Reward agents that operate tools correctly in accordance with the tool definition, using all relevant context available in tool calls
Score: 1.0
Explanation: The assistant demonstrated excellent tool usage, correctly operating both the search and Python tools according to their definitions, and effectively utilizing all relevant context from the tool calls and user inputs to complete the requested task.
----------------------------------------
Criterion: Reward agents that work towards the goal specified by the user
Score: 0.92
Explanation: The assistant demonstrated excellent goal-oriented behavior by systematically working through each step of the user's request. It used appropriate tools for each subtask (search for data gathering, Python for visualization), created a well-formatted and accurate chart based on the data, and recognized when the goal was complete. The assistant's approach was methodical and directly aligned with achieving the user's specified goal.
----------------------------------------
Criterion: Reward agents that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation
Score: 1.0
Explanation: The assistant performed excellently by strictly adhering to the provided data and making no claims beyond what was directly supported by the source material and tool call returns. It focused solely on the visualization task without adding any unsupported commentary or interpretation.
Key Insights
Notably, Composo penalizes the agent for not investigating uncertainties or demonstrating creative problem-solving. Diving into the trace, the reason is clear: the agent got complex search results with multiple data sources showing different GDP figures and metrics (growth rates vs. absolute values, different measurement methods), but it didn't:
- Compare or verify the different data sources
- Investigate why sources might differ
It essentially acted as a tool executor rather than an intelligent analyst.
Summary
LangGraph agents can be evaluated with Composo simply, with a small amount of plumbing code and Composo's evaluation package. Composo provides effective feedback on agent performance, helping you identify areas where your agents could be more thorough, analytical, and intelligent in their approach to complex tasks.
The evaluation framework gives you concrete, actionable insights into how well your agents are performing across critical dimensions like exploration, tool use, and goal pursuit - enabling you to build more sophisticated and reliable AI systems.