Jira Data Center node startup is slow, appearing as unresponsive due to index recovery

Platform Notice: Data Center Only - This article only applies to Atlassian products on the Data Center platform.

Note that this KB was created for the Data Center version of the product. Data Center KBs for non-Data-Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.

*Except Fisheye and Crucible

Summary

Sometimes it may take a rather large amount of time for a Jira Data Center Cluster node to startup.

The purpose of this article is to provide a high level overview why it is happening as it is very important to be able to distinguish between an unresponsive node from a node that is simply taking a while to startup.

Environment

This can happen on any Jira version. While the slowness may appear as a software failure, it is not necessarily the case especially in the following scenarios:

Scenario A: A new node is added to the cluster

Scenario B: A node is started up after prolonged period of downtime

Scenario C: A full locked re-index on one of the cluster nodes has been performed while one or all other cluster nodes were offline

Scenario D: There was an upgrade, first node was upgraded then full locked re-index performed, then another node upgraded and that 'next node' appears as unresponsive

Scenario E: There was an upgrade, and then all other cluster nodes were upgraded and started up at once

There could be other scenarios, not covered here, in all cases, please follow the Diagnosis section below to verify if the node is still starting or is indeed unresponsive

Diagnosis

After starting the node up, it might appear as unresponsive and will show on the 'Clustering' page in Jira without Load / Memory / Uptime stats and non-ACTIVE Node status. For example:

(Auto-migrated image: description temporarily unavailable)

While the above may indicate a problem with the last two nodes, it could be a false positive, and the node actual state needs to be validated to confirm whether it is truly unresponsive before trying to fix it.

Step 1:

Monitor atlassian-jira.log (tail -f, or use a text editor that is able to view the log realtime in Windows (notepad++ for example can somewhat help with this)). You may find the following exceptions:

1 2 3 4 5 6 7 8 2022-10-30 10:34:27,297-0400 NodeReindexServiceThread:thread-0 INFO [c.a.j.index.ha.DefaultNodeReindexService] [INDEX-REPLAY] Node re-index service is not running: currentNode.isClustered=true, notRunningCounter=24, paused=true, lastPausedStacktrace=java.lang.Throwable at com.atlassian.jira.index.ha.DefaultNodeReindexService.pause(DefaultNodeReindexService.java:218) at com.atlassian.jira.cluster.DefaultClusterManager.checkIndex(DefaultClusterManager.java:154) at com.atlassian.jira.startup.ClusteringLauncher.start(ClusteringLauncher.java:37) at com.atlassian.jira.startup.DefaultJiraLauncher.postDBActivated(DefaultJiraLauncher.java:166) at com.atlassian.jira.startup.DefaultJiraLauncher.lambda$postDbLaunch$2(DefaultJiraLauncher.java:144) at com.atlassian.jira.config.database.DatabaseConfigurationManagerImpl.doNowOrEnqueue(DatabaseConfigurationManagerImpl.java:307) at com.atlassian.jira.config.database.DatabaseConfigurationManagerImpl.doNowOrWhenDatabaseActivated(DatabaseConfigurationManagerImpl.java:202)

There could be exceptions like this as well:

1 2 3 4 5 6 022-10-30 19:30:45,747-0400 localhost-startStop-1 INFO [c.a.j.c.cache.ehcache.ReplicatorFactory] [Cache-replication] Cache replication using default cache replicator: LocalQCacheReplicator 2022-10-30 19:30:46,528-0400 localq-reader-1 INFO [c.a.j.c.distribution.localq.LocalQCacheOpReader] [LOCALQ] [VIA-INVALIDATION] Checked exception: RecoverableFailure occurred when processing: LocalQCacheOp Unknown macro: {cacheName='com.atlassian.jira.project.util.CachingProjectKeyStore.cache', action=REMOVE, key=com.atlassian.cache.impl.ReferenceKey@0, value == null ? true, replicatePutsViaCopy=false, creationTimeInMillis=1667157551614} from cache replication queue: [queueId=queue_i041027585b8d8472b_2_b859c71de5933aa473f5b97b9042cc8c, queuePath=/var/atlassian/application-data/jira/localq/queue_i041027585b8d8472b_2_b859c71de5933aa473f5b97b9042cc8c], failuresCount: 1. Will retry indefinitely. com.atlassian.jira.cluster.distribution.localq.LocalQCacheOpSender$RecoverableFailure: java.rmi.ConnectIOException: Exception creating connection to:XX.XX.XX.XX; nested exception is:java.net.NoRouteToHostException: No route to host (Host unreachable)

Potentially there could be other exceptions, and while they seem to indicate a potential issue, it is not necessarily the case and it is important to review what exactly Jira is doing by capturing a set of CPU and thread dumps in Step 2 below.

Step 2:

Capture a set of CPU and thread dumps (for example 6 CPU and thread dumps, 10 seconds apart, see the linked article for details) while the node appears to be hanging. Once captured, please try and identify long running and top most CPU consuming threads (feel free to reach out to Atlassian Support at this stage to assist with this).

If the node is still starting, the following threads will be evident across all the thread dumps taken

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 22:11:22 - localhost-startStop-1 State:RUNNABLE CPU usage:20.0% Running for: 3:18.51 Waiting for This thread is not waiting for notification on any lock Locks held This thread holds [0x14df612940, 0x100082ce78, 0x100082ce78] Stack trace java.net.SocketInputStream.socketRead0(java.base@11.0.13/Native Method) java.net.SocketInputStream.socketRead(java.base@11.0.13/Unknown Source) java.net.SocketInputStream.read(java.base@11.0.13/Unknown Source) java.net.SocketInputStream.read(java.base@11.0.13/Unknown Source) sun.security.ssl.SSLSocketInputRecord.read(java.base@11.0.13/Unknown Source) sun.security.ssl.SSLSocketInputRecord.readHeader(java.base@11.0.13/Unknown Source) sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(java.base@11.0.13/Unknown Source) sun.security.ssl.SSLSocketImpl.readApplicationRecord(java.base@11.0.13/Unknown Source) sun.security.ssl.SSLSocketImpl$AppInputStream.read(java.base@11.0.13/Unknown Source) org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:161) org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:128) org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:113) org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:73) org.postgresql.core.PGStream.receiveChar(PGStream.java:443) org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2069) org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:322) org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:481) org.postgresql.jdbc.PgStatement.execute(PgStatement.java:401) org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:164) org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:114) org.apache.commons.dbcp2.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:83) org.apache.commons.dbcp2.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:83) com.atlassian.jira.ofbiz.sql.PreparedStatementWrapper.executeQuery(PreparedStatementWrapper.java:42) com.atlassian.jira.diagnostic.connection.DiagnosticPreparedStatement.lambda$executeQuery$5(DiagnosticPreparedStatement.java:59) com.atlassian.jira.diagnostic.connection.DiagnosticPreparedStatement$$Lambda$2896/0x00000017c3e35440.execute(Unknown Source) com.atlassian.diagnostics.internal.platform.monitor.db.DefaultDatabaseDiagnosticsCollector.recordExecutionTime(DefaultDatabaseDiagnosticsCollector.java:70) com.atlassian.jira.diagnostic.connection.DatabaseDiagnosticsCollectorDelegate.recordExecutionTime(DatabaseDiagnosticsCollectorDelegate.java:55) com.atlassian.jira.diagnostic.connection.DiagnosticPreparedStatement.executeQuery(DiagnosticPreparedStatement.java:59) org.ofbiz.core.entity.jdbc.SQLProcessor.executeQuery(SQLProcessor.java:527) org.ofbiz.core.entity.GenericDAO.createEntityListIterator(GenericDAO.java:881) org.ofbiz.core.entity.GenericDAO.selectListIteratorByCondition(GenericDAO.java:861) org.ofbiz.core.entity.GenericDAO.selectByAnd(GenericDAO.java:733) org.ofbiz.core.entity.GenericHelperDAO.findByAnd(GenericHelperDAO.java:166) org.ofbiz.core.entity.GenericDelegator.findByAnd(GenericDelegator.java:913) org.ofbiz.core.entity.GenericDelegator.findByAnd(GenericDelegator.java:891) org.ofbiz.core.entity.GenericDelegator.findByAnd(GenericDelegator.java:860) com.atlassian.jira.ofbiz.DefaultOfBizDelegator.findByAnd(DefaultOfBizDelegator.java:83) com.atlassian.jira.ofbiz.WrappingOfBizDelegator.findByAnd(WrappingOfBizDelegator.java:62) com.atlassian.jira.association.NodeAssociationStoreImpl.getAssociations(NodeAssociationStoreImpl.java:238) com.atlassian.jira.association.NodeAssociationStoreImpl.getAssociationsFromSource(NodeAssociationStoreImpl.java:214) com.atlassian.jira.association.NodeAssociationStoreImpl.getSinkIdsFromSource(NodeAssociationStoreImpl.java:178) com.atlassian.jira.issue.index.indexers.impl.BaseFieldIndexer.indexDependentEntities(BaseFieldIndexer.java:223) com.atlassian.jira.issue.index.indexers.impl.FixForVersionsIndexer.addIndex(FixForVersionsIndexer.java:23) com.atlassian.jira.issue.index.indexers.FieldIndexer.addIndex(FieldIndexer.java:114) com.atlassian.jira.issue.index.indexers.FieldIndexerWithStats.addIndex(FieldIndexerWithStats.java:58) com.atlassian.jira.issue.index.DefaultIssueDocumentFactory$Builder.add(DefaultIssueDocumentFactory.java:323) com.atlassian.jira.issue.index.DefaultIssueDocumentFactory$Builder.addAll(DefaultIssueDocumentFactory.java:299) com.atlassian.jira.issue.index.DefaultIssueDocumentFactory.getDocument(DefaultIssueDocumentFactory.java:134) com.atlassian.jira.issue.index.DefaultIssueDocumentFactory.lambda$createDocuments$1(DefaultIssueDocumentFactory.java:119) com.atlassian.jira.issue.index.DefaultIssueDocumentFactory$$Lambda$3215/0x00000017c3fc7c40.accept(Unknown Source) java.util.Collections$SingletonList.forEach(java.base@11.0.13/Unknown Source) com.atlassian.jira.issue.index.DefaultIssueDocumentFactory.createDocuments(DefaultIssueDocumentFactory.java:118) com.atlassian.jira.issue.index.DefaultIssueDocumentFactory.createDocument(DefaultIssueDocumentFactory.java:94) com.atlassian.jira.issue.index.DefaultIssueIndexer$DefaultDocumentCreationStrategy.get(DefaultIssueIndexer.java:1310) com.atlassian.jira.issue.index.DefaultIssueIndexer$IssuesOperation.createDocument(DefaultIssueIndexer.java:981) com.atlassian.jira.issue.index.DefaultIssueIndexer$EntityOperation.perform(DefaultIssueIndexer.java:829) com.atlassian.jira.issue.index.DefaultIssueIndexer.lambda$null$4(DefaultIssueIndexer.java:517) com.atlassian.jira.issue.index.DefaultIssueIndexer$$Lambda$3084/0x00000017c3f63440.get(Unknown Source) com.atlassian.jira.index.SimpleIndexingStrategy.apply(SimpleIndexingStrategy.java:7) com.atlassian.jira.index.SimpleIndexingStrategy.apply(SimpleIndexingStrategy.java:5) com.atlassian.jira.issue.index.DefaultIssueIndexer.lambda$perform$5(DefaultIssueIndexer.java:515) com.atlassian.jira.issue.index.DefaultIssueIndexer$$Lambda$3079/0x00000017c3f64c40.consume(Unknown Source) com.atlassian.jira.index.ha.NullAwareIssueIdsIssueIterable.foreach(NullAwareIssueIdsIssueIterable.java:27) com.atlassian.jira.issue.index.DefaultIssueIndexer.perform(DefaultIssueIndexer.java:489) com.atlassian.jira.issue.index.DefaultIssueIndexer.reindexIssues(DefaultIssueIndexer.java:352) com.atlassian.jira.issue.index.DefaultIndexManager.lambda$reIndexIssues$5(DefaultIndexManager.java:661) com.atlassian.jira.issue.index.DefaultIndexManager$$Lambda$3078/0x00000017c3f64840.get(Unknown Source) com.atlassian.jira.issue.index.DefaultIndexManager.executeWithIndexLock(DefaultIndexManager.java:919) com.atlassian.jira.issue.index.DefaultIndexManager.reIndexIssues(DefaultIndexManager.java:661) jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(java.base@11.0.13/Native Method) jdk.internal.reflect.NativeMethodAccessorImpl.invoke(java.base@11.0.13/Unknown Source) jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.13/Unknown Source) java.lang.reflect.Method.invoke(java.base@11.0.13/Unknown Source) com.atlassian.jira.config.component.SwitchingInvocationHandler.invoke(SwitchingInvocationHandler.java:38) com.sun.proxy.$Proxy39.reIndexIssues(Unknown Source) com.atlassian.jira.index.ha.DefaultIndexRecoveryManager.reindexIssueByIds(DefaultIndexRecoveryManager.java:345) com.atlassian.jira.index.ha.DefaultIndexRecoveryManager.reindexOutdatedEntities(DefaultIndexRecoveryManager.java:326) com.atlassian.jira.index.ha.DefaultIndexRecoveryManager.reindexWithVersionCheckEntitiesUpdatedInTheLast(DefaultIndexRecoveryManager.java:266) com.atlassian.jira.index.ha.DefaultIndexRecoveryManager.reindexWithVersionCheckEntitiesUpdatedInTheLast(DefaultIndexRecoveryManager.java:225) com.atlassian.jira.index.ha.DefaultIndexRecoveryManager.recoverIndexFromBackup(DefaultIndexRecoveryManager.java:185) com.atlassian.jira.index.DefaultIndexFetcher.recoverIndexFromMostRecentSnapshot(DefaultIndexFetcher.java:109)

The three key giveaways here are:

  • Thread name: localhost-startStop-1 - which means that the node is in 'starting' state

  • The thread state - RUNNABLE, means it's not unresponsive

  • The stack trace references "com.atlassian.jira.index.ha.DefaultIndexRecoveryManager" which means that the Node is undergoing through index recovery process (see details below)

Step 3 (optional) but to be on the safe side:

Navigate to the local indexing folder on the node, usually it's $JIRA_HOME/caches/IndexesV1 and pick one of the folders, for example 'Issues'. List the content of the folder with 'ls -lah' command (or DIR on Windows) redirecting the output for example to ls1.txt (ls -lah $JIRA_HOME/caches/IndexesV1/issues > ls1.txt), then wait for some time (maybe a minute), then repeat the command but this time redirect the output to ls2.txt. Once both files have been generated, perform a comparison (diff in Linux, or fc in Windows).

If you see that the content of the folders changed, that means that JIRA is simply starting up and no action required.

Cause

If you are undergoing one of the above listed scenarios (under the Environment section) and it takes a while for a node to startup, the slowness is actually not abnormal (i.e. it is an expected behaviour) and it is not unusual for a newly added node to take a while to start up (Performance is relative, in some cases on certain Jira version it could take hours for a cluster node to startup, while other customers on the same version, smaller data set and different environment could take 3 minutes; As an example - 1 hour for starting a new node in a Jira Data Center cluster with 11 million of issues is not unusual) as the node needs to retrieve the up to date indexes from either $SHARED_HOME folder, or directly from a working node in the cluster.

High level overview, a new node first needs to transfer the indexing snapshot file over the network (if the the snapshot exists), then it needs to extract it on the node, and then the node needs to catch up with the rest of the nodes by updating the indexes directly from the database.

There are potentially edge cases where multiple nodes are starting up at the same time, which could result in a 'race' condition whereby by a last started node would be requesting an index snapshot from a node that also just started and needs to recover its indexes in the first place, this could aggravate the slowness as well

Solution

If you are undergoing through one of the above listed scenarios, the solution is to simply wait for a reasonable amount of time for the node to complete it's startup sequence. If the node still doesn't start up, please review the above linked articles under the cause and contact Atlassian Support for assistance.

ℹ️ For upgrades, please also avoid starting all nodes at the same time, ideally start one node, wait for it to start and join the cluster, then proceed with the next node until the cluster is fully upgraded.

Further details on potential bugs and documentation that could be contributing to the slowness:

Reference Kbs:

Updated on April 8, 2025

Still need help?

The Atlassian Community is here for you.