Run evaluations to assess agent performance
Evaluations help you understand how well your AI agent responds to questions so you can improve its effectiveness. While the test feature lets you simulate a real conversation with your agent, evaluations allow you to generate responses for up to 50 questions at once. A large language model (LLM) judge will review the responses, determine whether each question was answered satisfactorily, and provide an overall resolution rate.
Evaluations are a fast and effective way to test how your agent responds to a variety of questions. They’re also useful to:
see how well different versions of agents respond to questions
compare results after you’ve made changes to your agent, such as adding knowledge sources or guidance
Create a dataset
A dataset is a set of questions created for testing your customer service agent’s responses. You need to prepare your dataset as a CSV file with the following requirements:
The file must have only one column.
The file can contain up to 50 questions.
To create a dataset:
Prepare a CSV file with your test questions.
In your agent settings, select Evaluation.
From the Datasets tab, select Create dataset.
Give your dataset a unique name and upload your CSV file.
Select Create.
Your dataset will appear on the page. You can expand it to see all of the questions in the dataset and remove any unwanted questions.
Run an evaluation
Once you have a dataset, you can run an evaluation to see how your agent responds to your test questions.
To run an evaluation:
Go to the Evaluations tab.
Select the version of your agent you want to test.
Select the dataset you want to test your agent against.
Select Run evaluation.
You can run up to 5 evaluations at once across all customer experiences. When an evaluation is finished, it will appear in the table on the page.
Reviewing evaluation results
After the evaluation is complete, you can review the results to see how your agent performed. The evaluation uses a LLM judge to review each response and determine whether the agent’s answer resolves the question.
To review results, find the evaluation in the table and select View results.
You’ll see a summary of the agent’s performance, including:
The total number of questions, the number of resolved and unresolved questions, and the overall resolution rate.
A list of the questions with their resolution status.
Occasionally there may be an error and the LLM will be unable to judge a response. When this happens, the question will not be included in the resolution rate calculation. To get a judgment from the LLM, you’ll need to run a new evaluation.
Reviewing individual responses in detail
For each question, you can view the response your agent provided and the LLM judge’s reasoning for the resolution status.
To view these details, select the icon in the Review column. This will take you to the Conversation review page where you’ll be able to see the question and the response from the agent.
In the Conversation details panel you can see the resolution status and the LLM judge’s reason for the status. The reason includes details about the agent’s response and how it addressed the question. Read more about conversation review.
Was this helpful?