Recover missing node caused by split-brain in Bitbucket Data Center

Platform Notice: Data Center Only - This article only applies to Atlassian products on the Data Center platform.

Note that this KB was created for the Data Center version of the product. Data Center KBs for non-Data-Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.

*Except Fisheye and Crucible

Summary

In a Data Center cluster, you might hit a split-brain issue if nodes stop talking to each other, causing them to form their own mini-clusters. You'll notice this when logs show timed-out connections and some nodes go missing from the admin page, even though they're still handling requests.

Diagnosis

Here are a few items to diagnose if you are experiencing a split-brain scenario from the most accessible identifier to instance behavior. View this as a collective as any single investigation point could be the cause of an entirely different issue.

Node missing from Clustering page

If you suspect you have a split-brain scenario, navigate to Administration > Settings > Clustering and take note of the nodes listed.

  • If a node is missing, connect to it via the command line and ensure it is up and the Bitbucket service is running.

  • If your configuration bypasses the proxy/load balancer, connect to the node via the user interface (UI) and navigate to Administration > Settings > Clustering to identify the nodes listed.

If the missing node is up and running and its clustering page shows only itself or a subset of nodes, you are in split-brain mode.

Inconsistent state and stale caches

The cluster will become inconsistent as there is now your central cluster and the sub-cluster (split-brain) serving requests:

  • Change of logging may only apply to sub-cluster

  • Pull request rescopes may execute only on the sub-cluster

    • Depending on when the split happened, this may cause rescopes to run multiple times

  • Crowd/LDAP sync may run simultaneously across both clusters, causing inconsistency with users and groups

  • User/group renames will become stale because of broken cache syncing

  • Deleted projects and repositories will still be visible on the other cluster

Application logs report heartbeat timeouts

The atlassian-bitbucket.log (there is one on each node in your cluster) will report any communication issues it has with other nodes. Here is an example of a heartbeat timeout.

2023-01-19 10:14:48,808 WARN [hz.hazelcast.cached.thread-128] c.h.i.cluster.impl.MembershipManager [127.0.0.2]:5701 [bitbucket-cluster-name] [3.12.12] Member [127.0.0.1]:5701 - 52252540-6226-4e19-8e28-73f880aae99f is suspected to be dead for reason: Suspecting Member [127.0.0.1]:5701 - 52252540-6226-4e19-8e28-73f880aae99f because it has not sent any heartbeats since 2023-01-19 10:13:48.443. Now: 2023-01-19 10:14:48.806, heartbeat timeout: 60000 ms, suspicion level: 1.00 2023-01-19 10:14:50,501 WARN [hz.hazelcast.cached.thread-1] c.h.i.c.impl.ClusterHeartbeatManager [127.0.0.3]:5701 [bitbucket-cluster-name] [3.12.12] Suspecting Member [127.0.0.1]:5701 - 52252540-6226-4e19-8e28-73f880aae99f because it has not sent any heartbeats since 2023-01-19 10:13:48.410. Now: 2023-01-19 10:14:50.499, heartbeat timeout: 60000 ms, suspicion level: 1.00 2023-01-19 10:14:50,231 WARN [hz.hazelcast.cached.thread-18] c.h.i.c.impl.ClusterHeartbeatManager [127.0.0.4]:5701 [bitbucket-cluster-name] [3.12.12] Suspecting Member [127.0.0.1]:5701 - 52252540-6226-4e19-8e28-73f880aae99f because it has not sent any heartbeats since 2023-01-19 10:13:48.450. Now: 2023-01-19 10:14:50.229, heartbeat timeout: 60000 ms, suspicion level: 1.00

These are just a few of the issues you may experience. Overall, the cluster will be inconsistent, and cache synchronization to keep all nodes updated will break down.

Cause

The cause is network partitioning, where the network is split so that one set of nodes can't see the other. This network failure could have many reasons, but your network team must investigate internally to determine why the nodes lost communication.

  • The network prohibits multicast communication when enabling multicast discovery (AWS, Azure, etc.).

  • Multiple network interfaces are present. Both multicast capability and incorrect OS-level routing.

  • The firewall is blocking the incoming or outgoing ports for Hazelcast.

Solution

Recover from split-brain - Verify network connectivity works

  • Double-check that all the IPs are listed in $BITBUCKET_HOME/shared/bitbucket.properties for parameter hazelcast.network.tcpip.members (for tcp_ip node discovery).

    • If using hazelcast.network.multicast=true (for multicast node discovery), verify that all the nodes use the same multicast address. Investigate with your networking team's multicast expert.

  • Restart the nodes that left the cluster one at a time, and ensure that each one rejoins the cluster (go to Administration > Settings > Clustering) before starting the next node.

Updated on June 27, 2025

Still need help?

The Atlassian Community is here for you.