Nodes in the Bitbucket Data Center abruptly leave the cluster
Platform Notice: Data Center Only - This article only applies to Atlassian products on the Data Center platform.
Note that this KB was created for the Data Center version of the product. Data Center KBs for non-Data-Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.
*Except Fisheye and Crucible
Summary
After a certain time interval, nodes in the Bitbucket Data Center automatically leave the cluster.
Environment
Bitbucket Data Center 8.9.5
Diagnosis
The
atlassian-bitbucket.log
will have the following error stack trace on one of the nodes.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
2024-01-25 14:32:36,133 WARN [hz.hazelcast.InvocationMonitorThread] c.h.s.i.o.impl.InvocationMonitor [172.XX.XX.7]:5701 [<Hazelcast-Group-Name>] [3.12.13] MonitorInvocationsTask delayed 93449 ms 2024-01-25 14:33:17,416 WARN [hz.hazelcast.cached.thread-4] c.a.s.i.h.OptimisticOutOfMemoryHandler OutOfMemoryError occurred attempting to continue operating as normal java.lang.OutOfMemoryError: Java heap space 2024-01-25 14:33:17,416 ERROR [httpclient-io:thread-1] o.a.h.i.n.c.InternalHttpAsyncClient I/O reactor terminated abnormally org.apache.http.nio.reactor.IOReactorException: I/O dispatch worker terminated abnormally at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:359) at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221) at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64) at java.base/java.lang.Thread.run(Thread.java:840) Caused by: java.lang.OutOfMemoryError: Java heap space 2024-01-25 14:33:17,417 ERROR [httpclient-io:thread-1] o.a.h.i.n.c.InternalHttpAsyncClient I/O reactor terminated abnormally org.apache.http.nio.reactor.IOReactorException: I/O dispatch worker terminated abnormally at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:359) at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221) at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64) at java.base/java.lang.Thread.run(Thread.java:840) Caused by: java.lang.OutOfMemoryError: Java heap space
The other node(s) will have the following event in their
atlassian-bitbucket.log
file.1 2 3 4
2024-01-25 14:32:03,010 WARN [hz.hazelcast.cached.thread-2] c.h.i.c.impl.ClusterHeartbeatManager [172.XX.XX.6]:5701 [<Hazelcast-Group-Name>] [3.12.13] Suspecting Member [172.XX.XX.7]:5701 - dfb29ffe-ffdc-49a4-b79b-c5dc065f5d1b because it has not sent any heartbeats since 2024-01-25 14:30:58.217. Now: 2024-01-25 14:32:02.145, heartbeat timeout: 60000 ms, suspicion level: 1.00 2024-01-25 14:32:03,010 WARN [hz.hazelcast.cached.thread-2] c.h.i.cluster.impl.MembershipManager [172.XX.XX.6]:5701 [<Hazelcast-Group-Name>] [3.12.13] Member [172.XX.XX.7]:5701 - dfb29ffe-ffdc-49a4-b79b-c5dc065f5d1b is suspected to be dead for reason: Suspecting Member [172.XX.XX.7]:5701 - dfb29ffe-ffdc-49a4-b79b-c5dc065f5d1b because it has not sent any heartbeats since 2024-01-25 14:30:58.217. Now: 2024-01-25 14:32:02.145, heartbeat timeout: 60000 ms, suspicion level: 1.00 2024-01-25 14:32:03,019 INFO [hz.hazelcast.event-4] c.a.s.i.c.HazelcastClusterService Node '/172.XX.XX.7:5701 (node2)' was REMOVED from the cluster. Updated cluster: [/172.XX.XX.6:5701 master this name='node1' uuid='70ff0ac5-9712-4c18-a05a-95bf50fc44f0' vm-id='855cfaec-fc0a-40b6-85cb-2459005b1980']
Cause
The first stack trace occurs while synchronizing an external directory that has a big user and group pool and cannot be handled by the current heap size defined in Bitbucket. Due to insufficient heap, the hazelcast thread is unable to transmit the heartbeat to other nodes, and as a result, other nodes in the cluster mark this node as dead and remove it from the cluster.
Solution
Solution 1:
Limit the user/group synchronization in Bitbucket by defining the correct filter in the external directory's configuration defined on the Bitbucket side.
Solution 2:
Increase the heap size to a comparatively higher value to handle a large user or group tier during directory synchronization.
For further information on increasing the heap size in Bitbucket, refer to the Set heap size for Java on Bitbucket Server and Data Center article.
We do not recommend increasing the heap to a very high value as it may constrain the amount of memory available for Git processes, which may result in poor performance.
Was this helpful?