Cluster Cache Replication HealthCheck fails due to ConnectIOException: Exception creating connection

Platform Notice: Data Center Only - This article only applies to Atlassian products on the Data Center platform.

Note that this KB was created for the Data Center version of the product. Data Center KBs for non-Data-Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.

*Except Fisheye and Crucible

Summary

Cluster Cache replication health check fails on one of the nodes and the nodes cannot communicate with each other to replicate the cache.

1 2 3 4 5 Name: Cluster Cache Replication NodeId: null Is healthy: false Failure reason: The node node2 is not replicating Severity: CRITICALAdditional links: []

The following errors appear in the atlassian-jira.log:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 2022-05-09 11:59:07,774+0200 localq-reader-14 INFO      [c.a.j.c.distribution.localq.LocalQCacheOpReader] [LOCALQ] [VIA-COPY] Checked exception: RecoverableFailure occurred when processing: LocalQCacheOp{cacheName='com.atlassian.jira.cluster.dbr.DBRMessage', action=PUT, key=DBR, value == null ? false, replicatePutsViaCopy=true, creationTimeInMillis=1652090341285} from cache replication queue: [queueId=queue_node1_4_164546f60261c7e4be0c5f5f9aaeec86_put, queuePath=C:\JIRA\JIRA-HOME\localq\queue_node1_4_164546f60261c7e4be0c5f5f9aaeec86_put], failuresCount: 1. Will not retry as this is a cache replicated by value. Removing from queue. com.atlassian.jira.cluster.distribution.localq.LocalQCacheOpSender$RecoverableFailure: java.rmi.ConnectIOException: Exception creating connection to: JiraNode1.prod; nested exception is:      java.net.SocketTimeoutException: connect timed out     at com.atlassian.jira.cluster.distribution.localq.rmi.LocalQCacheOpRMISender.send(LocalQCacheOpRMISender.java:88)     at com.atlassian.jira.cluster.distribution.localq.LocalQCacheOpReader.run(LocalQCacheOpReader.java:96)     at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)     at java.base/java.util.concurrent.FutureTask.run(Unknown Source)     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)     at java.base/java.lang.Thread.run(Unknown Source) Caused by: java.rmi.ConnectIOException: Exception creating connection to: JiraNode1Prod.prod.lt.trans; nested exception is:      java.net.SocketTimeoutException: connect timed out     at java.rmi/sun.rmi.transport.tcp.TCPEndpoint.newSocket(Unknown Source)     at java.rmi/sun.rmi.transport.tcp.TCPChannel.createConnection(Unknown Source)     at java.rmi/sun.rmi.transport.tcp.TCPChannel.newConnection(Unknown Source)     at java.rmi/sun.rmi.server.UnicastRef.newCall(Unknown Source)     at com.atlassian.jira.cluster.distribution.localq.rmi.BasicRMICachePeerProvider.lookupRemoteCachePeer(BasicRMICachePeerProvider.java:65     ... 6 more Caused by: java.net.SocketTimeoutException: connect timed out     at java.base/java.net.PlainSocketImpl.waitForConnect(Native Method)     at java.base/java.net.PlainSocketImpl.socketConnect(Unknown Source)     at java.base/java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)     at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)     at java.base/java.net.AbstractPlainSocketImpl.connect(Unknown Source)     at java.base/java.net.Socket.connect(Unknown Source)

Environment

Jira (Data Center)

Diagnosis

The stack trace shows that there’s an exception creating a connection to JiraNode1.prod from node2. Telnet or netcat tests would help to identify if there’s an issue with the network communication.

From node 2's server, please run the command below:

1 2 3 4 5 6 7 telnet <hostname or IP  of node1> ehcache.listener.port telnet <hostname or IP of node1> ehcache.object.port Or, nc -vnz -w 1 <IP  of node1> ehcache.listener.port nc -vnz -w 1 <IP  of node1> ehcache.object.port

To identify ehcache.listener.port and ehcache.object.port values, please either reference cluster.properties file under $JIRA_HOME of the node, or use default values documented in this article https://confluence.atlassian.com/kb/ports-used-by-atlassian-applications-960136309.html

(Auto-migrated image: description temporarily unavailable)

Depending on the number of nodes, the admins will have to perform additional network tests to make sure that the replication ports are bidirectionally open between the nodes to ensure successfull replication

Also, please run the following SQL query in Jira's database:

1 2 SELECT * FROM clusternode; SELECT * FROM clusternodeheartbeat; 

Cause

The nodes are unable to communicate with each other due to a problem in the connection. Telnet test is failing.

Solution

  • If any of the telnet tests failed, instruct the Jira admin to reach out to the network/server admin to ensure that the Jira in separate servers is able to communicate with each other. Once the underlying network issues are identified and resolved, Jira nodes may need to be restarted.

  • On each machine, check whether there are entries in /etc/hosts file for direct resolution of cluster node hostnames and see if the entries need to modified or removed.

  • If the telnet test is passed, check if the hostname is specified in the "cluster. properties" match the hostname value in the IP column in "clusternode" table.

  • Also, check if there’s any whitespace besides the hostname. In this case, removing the whitespace will resolve the issue.

  • Finally, this problem can also be caused if the value in /etc/hostname is not unique. Make sure that all of the nodes in your cluster have a unique hostname configured in /etc/hostname.

Updated on April 8, 2025

Still need help?

The Atlassian Community is here for you.