Recover missing node caused by split-brain in Bitbucket Data Center
Platform Notice: Data Center Only - This article only applies to Atlassian products on the Data Center platform.
Note that this KB was created for the Data Center version of the product. Data Center KBs for non-Data-Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.
*Except Fisheye and Crucible
Summary
In a Data Center cluster, you might hit a split-brain issue if nodes stop talking to each other, causing them to form their own mini-clusters. You'll notice this when logs show timed-out connections and some nodes go missing from the admin page, even though they're still handling requests.
Diagnosis
Here are a few items to diagnose if you are experiencing a split-brain scenario from the most accessible identifier to instance behavior. View this as a collective as any single investigation point could be the cause of an entirely different issue.
Node missing from Clustering page
If you suspect you have a split-brain scenario, navigate to Administration > Settings > Clustering and take note of the nodes listed.
If a node is missing, connect to it via the command line and ensure it is up and the Bitbucket service is running.
If your configuration bypasses the proxy/load balancer, connect to the node via the user interface (UI) and navigate to Administration > Settings > Clustering to identify the nodes listed.
If the missing node is up and running and its clustering page shows only itself or a subset of nodes, you are in split-brain mode.
Inconsistent state and stale caches
The cluster will become inconsistent as there is now your central cluster and the sub-cluster (split-brain) serving requests:
Change of logging may only apply to sub-cluster
Pull request rescopes may execute only on the sub-cluster
Depending on when the split happened, this may cause rescopes to run multiple times
Crowd/LDAP sync may run simultaneously across both clusters, causing inconsistency with users and groups
User/group renames will become stale because of broken cache syncing
Deleted projects and repositories will still be visible on the other cluster
Application logs report heartbeat timeouts
The atlassian-bitbucket.log
(there is one on each node in your cluster) will report any communication issues it has with other nodes. Here is an example of a heartbeat timeout.
2023-01-19 10:14:48,808 WARN [hz.hazelcast.cached.thread-128] c.h.i.cluster.impl.MembershipManager [127.0.0.2]:5701 [bitbucket-cluster-name] [3.12.12] Member [127.0.0.1]:5701 - 52252540-6226-4e19-8e28-73f880aae99f is suspected to be dead for reason: Suspecting Member [127.0.0.1]:5701 - 52252540-6226-4e19-8e28-73f880aae99f because it has not sent any heartbeats since 2023-01-19 10:13:48.443. Now: 2023-01-19 10:14:48.806, heartbeat timeout: 60000 ms, suspicion level: 1.00
2023-01-19 10:14:50,501 WARN [hz.hazelcast.cached.thread-1] c.h.i.c.impl.ClusterHeartbeatManager [127.0.0.3]:5701 [bitbucket-cluster-name] [3.12.12] Suspecting Member [127.0.0.1]:5701 - 52252540-6226-4e19-8e28-73f880aae99f because it has not sent any heartbeats since 2023-01-19 10:13:48.410. Now: 2023-01-19 10:14:50.499, heartbeat timeout: 60000 ms, suspicion level: 1.00
2023-01-19 10:14:50,231 WARN [hz.hazelcast.cached.thread-18] c.h.i.c.impl.ClusterHeartbeatManager [127.0.0.4]:5701 [bitbucket-cluster-name] [3.12.12] Suspecting Member [127.0.0.1]:5701 - 52252540-6226-4e19-8e28-73f880aae99f because it has not sent any heartbeats since 2023-01-19 10:13:48.450. Now: 2023-01-19 10:14:50.229, heartbeat timeout: 60000 ms, suspicion level: 1.00
These are just a few of the issues you may experience. Overall, the cluster will be inconsistent, and cache synchronization to keep all nodes updated will break down.
Cause
The cause is network partitioning, where the network is split so that one set of nodes can't see the other. This network failure could have many reasons, but your network team must investigate internally to determine why the nodes lost communication.
The network prohibits multicast communication when enabling multicast discovery (AWS, Azure, etc.).
Multiple network interfaces are present. Both multicast capability and incorrect OS-level routing.
The firewall is blocking the incoming or outgoing ports for Hazelcast.
Solution
Recover from split-brain - Verify network connectivity works
Double-check that all the IPs are listed in
$BITBUCKET_HOME/shared/bitbucket.properties
for parameterhazelcast.network.tcpip.members
(for tcp_ip node discovery).If using
hazelcast.network.multicast=true
(for multicast node discovery), verify that all the nodes use the same multicast address. Investigate with your networking team's multicast expert.
Restart the nodes that left the cluster one at a time, and ensure that each one rejoins the cluster (go to Administration > Settings > Clustering) before starting the next node.
Was this helpful?