Recovering from a Data Center cluster split-brain
Platform Notice: Data Center Only - This article only applies to Atlassian products on the Data Center platform.
Note that this KB was created for the Data Center version of the product. Data Center KBs for non-Data-Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.
*Except Fisheye and Crucible
This article applies to Confluence Data Center 5.8.5 or later.
Symptoms
Confluence Data Center node will not start up and you see the following message in the Confluence logs (<confluence-home>/logs/atlassian-confluence.log
):
1
2
3
2014-08-15 15:23:00,023 ERROR [scheduler_Worker-6] [confluence.cluster.safety.ClusterPanicListener] onClusterPanicEvent Received a panic event, stopping processing on the node: Clustered Confluence: Database is being updated by an instance which is not part of the current cluster. You should check network connections between cluster nodes, especially multicast traffic.
2014-08-15 15:23:00,035 WARN [scheduler_Worker-6] [confluence.cluster.safety.ClusterPanicListener] onClusterPanicEvent com.atlassian.confluence.cluster.hazelcast.HazelcastClusterInformation@29f82619
2014-08-15 15:23:00,036 WARN [scheduler_Worker-6] [confluence.cluster.safety.ClusterPanicListener] onClusterPanicEvent Shutting down Quartz scheduler
This is known as cluster split-brain (sometimes known as cluster panic), and can happen on any node (for example if you restart a node you may see the cluster split-brain message above on the same node or on a different node).
Background
The cluster safety mechanism is designed to ensure that Confluence cannot become inconsistent because updates by one user are not visible to another. A failure of this mechanism is a fatal error in Confluence and is called cluster split-brain. Because the cluster safety mechanism helps prevents data inconsistency whenever any two copies of Confluence running against the same database, it is enabled in all instances of Confluence, not just Confluence Data Center.
How the cluster safety mechanism works...
A scheduled task, ClusterSafetyJob, runs every 30 seconds. In a cluster, this job is run only on one of the nodes. The scheduled task operates on a safety number – a randomly generated number that is stored both in the database and in the distributed cache used across a cluster. It does the following:
Generate a new random number
Compare the existing safety numbers, if there is already a safety number in both the database and the cache.
If the numbers differ, publish a ClusterPanicEvent. Currently in Confluence, this causes the following to happen on each node in the cluster:
disable all access to the application
disable all scheduled tasks
In Confluence 5.5 and earlier, update the database safety number to a new value, which will cause all nodes accessing the database to fail. From Confluence 5.6 onwards, the database safety number is not updated, to allow the other Confluence node/s to continue processing requests.
If the numbers are the same or aren't set yet, update the safety numbers:
set the safety number in the database to the new random number
set the safety number in the cache to the new random number.
Diagnosis
Cluster split-brain can have a number of causes.
If confluence.cluster.join.type is set to multicast you should:
Check that the network connectivity for multicast traffic is working between the nodes.
Check that the same multicast address is being used by all the nodes.
To determine the multicast address being used by a node, look in the Confluence logs (
<confluence-home>/logs/atlassian-confluence.log
) for the stringConfiguring Hazelcast with
. For example:1 2 3
2014-08-15 15:20:08,140 INFO [RMI TCP Connection(4)-127.0.0.1] [confluence.cluster.hazelcast.HazelcastClusterManager] configure Configuring Hazelcast with instanceName [nutella-buster], multicast address 238.150.128.250:54327, multicast TTL [1], network interfaces [fe80:0:0:0:0:0:0:1%1, 0:0:0:0:0:0:0:1, 127.0.0.1] and local port 580
If confluence.cluster.join.type is set to tcp_ip you should:
Check that all the nodes of the cluster can reach each other on hazelcast port (default 5801) with one of the below command:
1 2 3 4
telnet <cluster.node.ip.address> 5801 nc <cluster.node.ip.address> 5801 curl <cluster.node.ip.address>:5801 nmap <cluster.node.ip.address> -p 5081
Resolution
To recover from a cluster split-brain:
Verify that the network connectivity is fine.
Double check parameter confluence.cluster.peers (for tcp_ip node discovery) that all the IPs are listed and confluence.cluster.address (for multicast node discovery) that the same multicast address is being used by all the nodes in confleunce.cfg.xml
Double check parameter confluence.cluster.join.type in confleunce.cfg.xml for all the nodes as its the same
Restart the nodes that panicked one at a time, and ensure that each one rejoins the cluster (go to
> General Configuration > Clustering) before starting the next node.
Was this helpful?