Understand and manage flaky tests in Bitbucket Pipelines

Tests is in open beta and available to Bitbucket Pipelines customers on Standard and Premium plans.

A flaky test is one that doesn’t yield the same result every time it is run. Sometimes it passes, and sometimes it fails, even though nothing has changed in the code, the test, or the environment.

This instability typically arises from timing issues, shared state, external dependencies, or environmental quirks, rather than genuine bugs in your application.

Because you can’t trust whether a failure is real, these unstable tests are confusing, noisy, and time‑consuming for developers to investigate.

Why do flaky tests matter?

Over time, flaky tests can hide real bugs and allow broken behavior to reach your users, eroding confidence in your entire test suite. Flaky tests also slow teams down: developers re‑run pipelines to “get a green build,” spend time triaging non‑issues, and may start ignoring failures altogether.

As your CI/CD pipelines and test suites grow, the wasted time and compute add up, increasing costs and delaying releases. Left unchecked, flaky tests can also mask real regressions, which ultimately impacts the quality of the code you ship.

Manage flaky tests

Tests in Bitbucket Pipelines enables you to manually mark and filter tests as flaky, and then take action on them.

The Test summaries view aggregates per‑test data across the latest executions (failure rate, average duration, variance) for up to 250 runs within the last 90 days. Patterns like intermittent failures and high variance make flaky tests stand out without you having to dig through raw logs.
The Test executions view lets you drill into individual runs (build, commit, outcome, duration, last executed) so you can confirm whether a test is truly flaky or failing consistently.

View and mark tests as flaky

To view flaky tests in Bitbucket, go to the Tests page, then filter for Flaky. All tests marked Flaky appear in the Test state column.

To mark a test as flaky:

From the Tests page, find the test.
In the Test state column, select the dropdown.
Select Flaky.

Once you identify and mark a test as flaky, you can fix the test immediately. However, for cases where that’s not possible, you can quarantine the test.

Read about getting started with tests in Pipelines.

Automatic flaky test detection

Bitbucket Pipelines automatically evaluates recent runs of each test to identify unstable (flaky) outcomes. Every time a new test execution completes, the system recalculates a flakiness score ranging from 0 to 100.

The score is based on how frequently the test outcome flips, meaning how frequently it switches from pass to fail or vice versa. Recent flips are weighted more heavily than past flips, so the score reflects the test's current stability rather than the stale (or past) history. A test that frequently alternates between passing and failing will score high (flaky), while one that consistently passes or consistently fails will score low meaning that the build is (stable). For example, a test scoring 100 is certainly flaky, while a test scoring 0 is not flaky at all.

If a test crosses the default threshold of 80 that we have set for flaky tests and has a minimum of 10 executions, the test will automatically be marked as flaky by the system. Automatic flaky test detection is enabled for all repositories by default.

Access and view auto-detected flaky tests

The flaky test detection is enabled by default for all repositories, and the threshold to classify a test as flaky is set at 80 by default. Also, the minimum number of test executions is set to 10.

On the left sidebar, select Tests.
Review each test’s flaky status and score. You may notice tests that have been automatically marked as flaky because their threshold has exceeded the default score of 80.
If you feel that a certain test is marked incorrectly, you can still override its status or the execution strategy.

If the status or execution strategy of a test has been manually updated, the system will not update the status and execution strategy of the test until exactly 15 days after the update was made. The flakiness score will continue to update as is (prior to the update) over those 15 days.

Configure automatic flaky test and quarantine detection settings (Admin only)

As the repository administrator, you may want to adjust the current flakiness score threshold or other automatic flaky test detection settings. To configure the current automatic flaky test detection settings, you can either select the Configure detection button in the upper-right corner of the Tests page, or select Repository settings, then Tests, and then Detection settings.

You can toggle and update the following automatic flaky test detection settings:

Auto-flaky test detection: You can toggle the setting to enable or disable the automatic flaky test detection.
Auto-quarantine test detection: You can toggle the setting to enable or disable the automatic quarantine test detection. This setting is not enabled by default.
Flakiness score threshold: The minimum score above which the test will be marked flaky.
Minimum Runs: Requires a minimum sample size before a test can be marked flaky.
Flakiness score threshold: The minimum score above which the test will be marked as quarantined.
Minimum Runs: Requires a minimum sample size before a test can be marked as quarantined.

Quarantine flaky tests

For more details, refer to Quarantining flaky tests.

この内容はお役に立ちましたか?

正確ではなかった明確ではなかった関係なかった

さらにヘルプが必要ですか?

アトラシアンコミュニティをご利用ください。

コミュニティに質問