Evaluate the performance of your Rovo agent

Once you create an agent, you can run it through a set of evaluation tools to measure how well it responds to prompts.

Evaluations can help you:

quickly spot issues and make improvements before you launch your agent
see how well different versions of agents respond to prompts
compare results after you’ve made changes to your agent

To evaluate your agent’s performance, you must:

upload a dataset
run an evaluation
review evaluation results

Upload a dataset

To evaluate your agent, you must upload a dataset, which is a set of prompts created for testing your agent’s responses. Prompts can be either questions you expect your customers to ask or instruction-style prompts that tell the agent to carry out a specific action.

A dataset must be in a CSV format. It must have one column for prompts and may have a second column for expected responses. Your CSV can’t exceed 50 prompts.

To upload a dataset:

In your agent settings, select Evaluation from the sidebar navigation.
From the Dataset tab, select Create dataset. A modal will appear.
Give your dataset a name and upload your CSV file.
Select Create.

Your dataset will appear on the page. You can expand it to view all the prompts in the dataset and remove any unwanted ones.

Run an evaluation

Once you have a dataset, you can now run an evaluation to see how your agent responds to your prompts.

To run an evaluation:

Go to the Evaluations tab.
Select a dataset.
Select an evaluation type
1. Response accuracy: Test accuracy against responses in the dataset.
2. Resolution rate: Test the rate at which the agent can resolve support requests.
3. Manual testing: Bulk generate responses without any scoring (manually evaluate).
4. Select Run evaluation.

You can run up to 3 evaluations at once.

Review evaluation results

After the evaluation is complete, you can review the results to see how your agent performed.

To review results, find the evaluation in the table and select View results.

Occasionally, an error may occur, and the LLM will be unable to judge a response. When this happens, the prompt will not be included in the resolution rate calculation. To get a judgment from the LLM, you’ll need to run a new evaluation.

Reviewing individual responses in detail

For each prompt, you can view the response your agent provided and the LLM judge’s reasoning for the score.

To view these details, select the icon in the Review column. This will take you to the Conversation review page where you’ll be able to see the prompt and the response from the agent. You can also download the results to a CSV.

In the Conversation details panel, you can see the score and the LLM judge’s reason for the evaluation status. The reason includes details about the agent’s response and how it addressed the prompt.

Was this helpful?

It wasn't accurateIt wasn't clearIt wasn't relevant

Still need help?

The Atlassian Community is here for you.

Ask the Community