How It Works
Understand the execution model behind Agentest's agent simulation and evaluation.
The Simulation Loop
┌─ SimulatedUser (LLM) generates a message
│
├─ Agentest POSTs message to your agent endpoint
│
├─ Agent responds with tool_calls?
│ ├─ YES → MockResolver resolves each tool call
│ │ Results injected back → POST again
│ │ (loop until agent returns text)
│ └─ NO → Record the text response
│
├─ SimulatedUser sees agent's reply
│ ├─ Goal achieved? → stop
│ └─ Continue → next turn
│
└─ After all turns: evaluate with LLM-as-judge metricsAgentest owns the tool-call loop. It intercepts tool calls from the agent response, resolves them through your mocks, and sends results back. No instrumentation needed in your agent code — your agent just sees normal tool results.
Detailed Flow
1. Configuration Loading
When you run npx agentest run, Agentest:
- Discovers
agentest.config.ts(or.yaml) in your project root - Validates the configuration with Zod schemas
- Interpolates environment variables (
${VAR}syntax) - Sets up the LLM provider for simulation and evaluation
2. Scenario Discovery
Agentest finds all scenario files using glob patterns:
- Default:
**/*.sim.ts - Excludes:
node_modules/** - Uses jiti to import TypeScript files directly (no build step)
3. Simulation Execution
For each scenario, Agentest runs conversationsPerScenario independent conversations (default: 3).
Each conversation proceeds as follows:
Turn Loop
for (turn = 1; turn <= maxTurns; turn++) {
// 1. SimulatedUser generates a message
const userMessage = await simulatedUser.generateMessage(history)
// 2. Send to agent endpoint
let agentResponse = await agentClient.sendMessage([...history, userMessage])
// 3. Tool call resolution loop
while (agentResponse.tool_calls) {
const toolResults = []
for (const toolCall of agentResponse.tool_calls) {
// Resolve through mock
const result = mockResolver.resolve(toolCall.name, toolCall.arguments)
toolResults.push({ tool_call_id: toolCall.id, result })
}
// Inject results and call agent again
agentResponse = await agentClient.sendMessage([
...history,
agentResponse,
{ role: 'tool', content: toolResults }
])
}
// 4. Record the turn
history.push(userMessage, agentResponse)
// 5. Check if goal is met
const shouldStop = await simulatedUser.checkGoalCompletion(history)
if (shouldStop) break
}Key Points
- Mock state resets at the start of each conversation (so
sequence()mocks start fresh) - Tool calls are synchronous from the agent's perspective (one turn = multiple tool calls resolved)
- Simulated user decides when to stop based on goal completion
- Unmocked tools throw
AgentestErrorby default (or passthrough if configured)
4. Evaluation
After all conversations complete, Agentest evaluates each turn:
Per-Turn Metrics
For each turn, all configured metrics run in parallel (bounded by concurrency):
const metrics = await Promise.all([
helpfulness.score(input),
coherence.score(input),
relevance.score(input),
faithfulness.score(input),
verbosity.score(input),
agentBehaviorFailure.evaluate(input),
toolCallBehaviorFailure.evaluate(input), // only if turn has tool calls
...customMetrics.map(m => m.score(input)),
])Each metric receives:
interface ScoreInput {
goal: string
profile: string
knowledge: KnowledgeItem[]
userMessage: string
agentMessage: string
toolCalls: ToolCallRecord[]
history: Turn[]
turnIndex: number
}Per-Conversation Metrics
goal_completion runs once after the full conversation:
const goalCompletion = await goalCompletionMetric.score({
goal,
profile,
fullConversationHistory,
})Trajectory Matching
Trajectory assertions are deterministic (no LLM calls):
const trajectoryResult = trajectoryMatcher.match(
actualToolCalls,
scenario.assertions.toolCalls
)Checks:
- Match mode (
strict,unordered,contains,within) - Argument matching (
exact,partial,ignore)
Returns:
matched: booleanmissingCalls: string[]extraCalls: string[]orderingIssues: string[]
5. Error Deduplication
After evaluation, qualitative metric failures are grouped by root cause:
const uniqueErrors = await errorDeduplicator.deduplicate(allFailures)This LLM call produces:
interface UniqueError {
label: string // "Providing weather info without real-time access"
description: string // Detailed explanation
severity: 'low' | 'medium' | 'high' | 'critical'
occurrences: number
examples: TurnResult[]
}6. Pass/Fail Logic
A scenario passes when:
- No conversation threw an error
- All trajectory assertions matched
- No errors at or above
failOnErrorSeverity - All metric averages meet configured thresholds
A scenario fails if any condition is violated.
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ Runner │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Discovery │ │ Simulator │ │ Evaluator │ │
│ │ *.sim.ts │→ │ Multi-turn │→ │ Metrics + │ │
│ │ files │ │ loop │ │ Trajectory │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Reporters │ │
│ │ Console │ JSON │ GitHub Actions │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
↓ ↓ ↓
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Simulated │ │ Agent │ │ Judge LLM │
│ User (LLM) │ → │ Endpoint │ ← │ (Metrics) │
└─────────────┘ └─────────────┘ └─────────────┘
↑ ↑
│ │
┌─────────────────────────┘
│ MockResolver
│ (Tool calls → Mocks)
└─────────────────────────Concurrency Model
Agentest runs scenarios in parallel, bounded by the concurrency setting:
const results = await pLimit(config.concurrency).map(
scenarios,
scenario => simulator.runScenario(scenario)
)Within each scenario:
- Conversations run sequentially (to maintain conversation state isolation)
- Evaluation metrics run in parallel per turn (each metric is independent)
This balances throughput with resource usage. Typical settings:
concurrency: 20— good for cloud LLM APIs with high rate limitsconcurrency: 5— better for local LLMs or strict rate limits
Next Steps
- Configuration - Configure simulation and evaluation settings
- Scenarios - Write realistic test scenarios
- Mocks - Control tool behavior