Bitbucket Data Center cluster node located in another subnet is not joining the cluster
Platform Notice: Data Center Only - This article only applies to Atlassian products on the Data Center platform.
Note that this KB was created for the Data Center version of the product. Data Center KBs for non-Data-Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.
*Except Fisheye and Crucible
Summary
Even though the simplest, and recommended by default, approach is to deploy all the cluster nodes within the same subnet, sometimes there is a requirement to deploy the cluster nodes in multiple subnets (i.e. migration between subnets or implementation of high availability on network level).
So if you have a complex network installation and multiple cluster nodes, in certain cases you can observe the issue that the new node is hanging in "Starting
" status (http://<IP_of_the_node>/status) or getting kicked out of the cluster immediately after joining. Alternatively, an already running node can be kicked out of the cluster with the error Unexpected bytes from remote node, closing socket
.
Diagnosis
Before proceeding with the troubleshooting, please verify the following prerequisites:
The routing table between multiple subnets is correct.
Each node in the cluster can reach another one by IP address and port 5701.
Node is correctly added and reachable from the load balancer.
If all of the prerequisites are met, please consult with your network administrator.
In addition, you may still see the node failing to join the cluster with following warning written in the atlassian-bitbucket.log
.
The logs of the impacted node:
1
2020-10-20 09:00:37,885 WARN [hz.hazelcast.InvocationMonitorThread] c.h.s.i.o.impl.Invocation [10.103.137.232]:5701 [stash-cluster] [3.11.1] Retrying invocation: Invocation{op=com.hazelcast.map.impl.operation.MapGetInvalidationMetaDataOperation{serviceName='hz:impl:mapService', identityHash=428345166, partitionId=-1, replicaIndex=0, callId=208097, invocationTime=1608273566453 (2020-12-18 06:39:26.453), waitTimeout=-1, callTimeout=600000}, tryCount=250, tryPauseMillis=500, invokeCount=110, callTimeoutMillis=600000, firstInvocationTimeMs=1608236246025, firstInvocationTime='2020-12-17 20:17:26.025', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 01:00:00.000', target=[10.200.34.176]:5701, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=8, connection=null}, Reason: com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target: [10.200.34.176]:5701, partitionId: -1, operation: com.hazelcast.map.impl.operation.MapGetInvalidationMetaDataOperation, service: hz:impl:mapService
The logs of any other node in the cluster:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
2020-10-20 09:02:10,023 WARN [hz.hazelcast.cached.thread-1] com.hazelcast.nio.tcp.TcpIpAcceptor [10.200.33.241]:5701 [stash-cluster] [3.11.1] com.atlassian.stash.internal.cluster.NodeConnectionException: Unexpected bytes from remote node, closing socket com.atlassian.stash.internal.cluster.NodeConnectionException: Unexpected bytes from remote node, closing socket at com.atlassian.stash.internal.cluster.DefaultClusterJoinManager.accept(DefaultClusterJoinManager.java:102) at com.atlassian.stash.internal.hazelcast.ClusterJoinSocketInterceptor.onAccept(ClusterJoinSocketInterceptor.java:49) at com.hazelcast.nio.NodeIOService.interceptSocket(NodeIOService.java:193) at com.hazelcast.nio.tcp.TcpIpAcceptor$AcceptorIOThread.configureAndAssignSocket(TcpIpAcceptor.java:291) at com.hazelcast.nio.tcp.TcpIpAcceptor$AcceptorIOThread.access$1400(TcpIpAcceptor.java:131) at com.hazelcast.nio.tcp.TcpIpAcceptor$AcceptorIOThread$1.run(TcpIpAcceptor.java:280) at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.lang.Thread.run(Thread.java:748) at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64) at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80) ... 1 frame trimmed
Cause
There are two potential root causes for the issue:
If you see only "
Unexpected bytes from remote node, closing socket"
while all the nodes are successfully joined to the cluster and online, then the warning can be caused by the security scanner reaching the node on the port 5701 or some other Hazelcast enabled application without correct cluster group configuration in Hazelcast.If you see both "
Unexpected bytes from remote node, closing socket"
in the logs and one of the nodes cannot join the cluster or there are multiple clusters being created in each subnet, then the issue is caused by the MTU set on the network interface of the application node server OS and is not supported on the gateway between two or more subnets.
Solution
In the first scenario, the error can be disregarded since you have identified the application trying to join wrong cluster. The issue needs to be fixed on the remote application side or the port can be excluded from the security scans.
In the second scenario, you need to find out the MTU value, which is supported by both application server OS and the OS of the gateway. After the MTU value is set, Bitbucket should be restarted on the node where the MTU was adjusted.
Was this helpful?