Troubleshooting disk I/O bottlenecks In Jira Data Center

Platform Notice: Data Center Only - This article only applies to Atlassian products on the Data Center platform.

Note that this KB was created for the Data Center version of the product. Data Center KBs for non-Data-Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.

*Except Fisheye and Crucible

Problem

Troubleshooting disk performance bottlenecks is difficult and be tricky in large and complex environments. In certain scenarios, the problem is not easily noticeable and several side-effects might happen before the actual root cause can be identified. Some metrics related to cache replication are available since Jira 7.6 and some indexing specific metrics (index-writer-stats) were added in Jira 8.10. Starting in Jira 8.12, additional metrics have been added to logs that help to troubleshoot these kinds of problems. These metrics are part of the new Document-based replication in Jira Data Center functionality.

We have observed severe disk I/O problems in customers using SSD in a RAID-1 array, with Dell controllers (in this case, Perc 6/i). Using the usual tools to evaluate the disk performance, we haven't identified any performance problems. In fact, the results were fairly good with an average of 500 MB/s.

However, nasty problems were observed in Jira, such as:

  • Cluster index replication synchronisation problem

  • Triggering a Service Desk-related custom field change, such as an SLA update, causing the node's index to be locked for several minutes due to the poor throughput to process the reindex operations

  • During peak hours, the same situation of the node's index being locked was observed. When bulk operations happened such as releases, project imports, bulk issue creation, the situation usually led the environment to a crash requiring the cluster to be restarted

In Jira atlassian-jira.log log file, errors such as below were seen when bulk updates were triggered:

1 2 2020-11-11 10:46:44,929-0300 http-nio-8080-exec-177 url:/jira/servicedesk/portal/12/create/22 username:atlassian [c.a.j.issue.index.DefaultIndexManager] Wait attempt timed out - waited 90000 milliseconds com.atlassian.jira.issue.index.IndexException: Wait attempt timed out - waited 90000 milliseconds

The index across the cluster nodes was locked for ~30 minutes due to the reindex operation requested by bulk updates.

This problem was more noticeable after an upgrade to Jira 8.11, and aggravated after a new update was performed to Jira 8.12. Previously, though, searching the index was known to be slow, and cluster index synchronization was constantly falling behind for several minutes.

The background for this is that for versions such as Jira 8.10 and below (down to Jira 8.0), adding or updating the index was an in-memory operation – adding documents to Lucene buffer, which was flushing asynchronously. However, as of 8.11, Jira performs conditional updates before every write, which consists of a search performed in the index to check if the index contains a document with a higher version, and performing such searches may trigger an index flush – which is a disk operation. This guarantees index consistency when there are concurrent index updates, and although it might slightly decrease the index performance, it resolved a lot of indexing inconsistency bugs constantly seen in large Jira data center environments.

With DBR, introduced in Jira 8.12, this added more indexing events (adding documents to index) across the cluster. This aggravated any possible pre-existing performance problems in the cluster, as the disk performance wasn't able to handle that additional load.

Troubleshooting

For all metrics below, Jira records snapshot and total statistics. The snapshot stats cover the last ~5min and the total stats cover the time from the start of a node. You should check the last entry for each metric, because every 5 minutes, new metrics will be logged.

If you're troubleshooting an ongoing Jira performance problem, it's best to look at the total stats as the snapshot might be affected by a heavy operation that might have been processed in Jira at that time.

Cache Replication via invalidation Stats Logs (available since Jira 7.6)

These are metrics related to cache replication via invalidation but can be used as real disk performance statistics.

We can look at the timeToAddMillis metric, which is the time to store a cache replication message in a local store before sending it to the other node – in this case, it's really disk performance, as there is no OS buffer. In healthy clusters, we expect the average to be less than1 millisecond.

See an example of a cache replication log

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 ["LOCALQ"]["VIA-INVALIDATION"] "Cache replication queue stats per node":atl-node1 total stats:{ "timestampMillis":1605104041350, "nodeId":"atl-node2", "queueSize":0, "startQueueSize":0, "startTimestampMillis":1604096513275, "startMillisAgo":1007528075, "closeCounter":0, "addCounter":803366, "droppedOnAddCounter":0, "criticalAddCounter":0, "criticalPeekCounter":0, "criticalRemoveCounter":0, "peekCounter":0, "peekOrBlockCounter":803369, "removeCounter":803366, "backupQueueCounter":0, "closeErrorsCounter":0, "addErrorsCounter":0, "peekErrorsCounter":0, "peekOrBlockErrorsCounter":0, "removeErrorsCounter":0, "backupQueueErrorsCounter":0, "lastAddTimestampMillis":1605103205923, "lastAddMillisAgo":835427, "lastPeekTimestampMillis":0, "lastPeekMillisAgo":0, "lastPeekOrBlockTimestampMillis":1605103205924, "lastPeekOrBlockMillisAgo":835426, "lastRemoveTimestampMillis":1605103205937, "lastRemoveMillisAgo":835413, "lastBackupQueueTimestampMillis":0, "lastBackupQueueMillisAgo":0, "timeToAddMillis":{ "count":803366, "min":2, "max":783, "sum":11871332, "avg":14, "distributionCounter":{ "10":8996, "20":770471, "50":23344, "100":279, "9223372036854775807":276 } }, "timeToPeekMillis":{ "count":0, "min":0, "max":0, "sum":0, "avg":0, "distributionCounter":{ } }, "timeToPeekOrBlockMillis":{ "count":803369, "min":0, "max":38161261, "sum":10047103930, "avg":12506, "distributionCounter":{ } }, "timeToRemoveMillis":{ "count":803366, "min":0, "max":518, "sum":10561931, "avg":13, "distributionCounter":{ "10":21124, "20":771005, "50":10772, "100":251, "9223372036854775807":214 } }, "timeToBackupQueueMillis":{ "count":0, "min":0, "max":0, "sum":0, "avg":0, "distributionCounter":{ } }, "staleCounter":0, "sendCounter":803364, "droppedOnSendCounter":0, "timeToSendMillis":{ "count":803364, "min":0, "max":1345, "sum":355865, "avg":0, "distributionCounter":{ "10":802038, "20":251, "50":363, "100":447, "200":207, "500":51, "1000":6, "5000":1 } }, "sendRuntimeExceptionCounter":1, "sendCheckedExceptionCounter":0, "sendNotBoundExceptionCounter":4 }

Check a grep command to get the averages from the log

1 grep -o -E 'VIA-INVALIDATION.*snapshot stats.*timeToAddMillis":{"count":\d+,"min":\d+,"max":\d+,"sum":\d+,"avg":\d+' atlassian-jira.log |grep -oE "\d+$"

Index Writer Stats Logs (issue stats) (available since Jira 8.10)

These metrics are related to the indexer performance. Each index is updated by a single thread and these metrics describe all direct index write operations statistics for the given index.

For the ISSUE index we can look at the updateDocumentsWithVersionMillis metric, which measures the time of updating the Lucene index with a single issue (single document). We expect the average to be less than 10 milliseconds, and 99% to be less than 100 milliseconds. In a nutshell, this number shows how many index updates there can be in the cluster. For example, if the average is at 10 milliseconds, it means that Jira can update the issue index 100 times per second at max. However, if we have, for example, 110 milliseconds, this means we can have only 8 issue reindex operations per second in the whole cluster. If you consider that a single issue create can reindex the issue from 3 to 6 times (the last, for JSD/JSM issues), Jira would be able to only create 1 issue per second without affecting the index queue (i.e. delaying other index updates arriving from user requests). This impacts also other issue updated operations, such as bulk updates, which can trigger a lot more than 1 issue index update per second.

Supposing that an update that affects 2000 issues is requested, and this gets replicated through the nodes. If our time to write to the index is at 110 milliseconds, we would be able to update the index for only 9 issues per second, and it would take at least 4 minutes for the queue to be processed, only for that replication operation triggered, and users would be still trying to trigger operations on the UI that usually led to two scenarios:

  1. Duplicated issues being created as users hit the create button multiple times as Jira was hanging when trying to update the index

  2. Slowness being reported by users as Jira was unable to acquire the read lock for the index, several errors in logs about the failure to acquire the read lock for more than 30 seconds

See an example of a LocalQ cache invalidation log

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 INFO["c.a.jira.index.WriterWithStats"]["index-writer-stats"] "ISSUE":"total stats":{ "addDocumentsMillis":{ "count":0, "min":0, "max":0, "sum":0, "avg":0, "distributionCounter":{ } }, "deleteDocumentsMillis":{ "count":502, "min":0, "max":5, "sum":62, "avg":0, "distributionCounter":{ } }, "updateDocumentsMillis":{ "count":0, "min":0, "max":0, "sum":0, "avg":0, "distributionCounter":{ } }, "updateDocumentConditionallyMillis":{ "count":0, "min":0, "max":0, "sum":0, "avg":0, "distributionCounter":{ } }, "updateDocumentsWithVersionMillis":{ "count":2125, "min":0, "max":923, "sum":238331, "avg":112, "distributionCounter":{ "10":85268, "100":132996, "500":29132, "1000":4652, "5000":10, "10000":0, "30000":0, "60000":0 } }, "updateDocumentsWithVersionSize":{ "count":252058, "min":1, "max":1, "sum":252058, "avg":1, "distributionCounter":{ "1":252058, "10":0, "100":0, "1000":0, "10000":0 } }, "replaceDocumentsWithVersionMillis":{ "count":0, "min":0, "max":0, "sum":0, "avg":0, "distributionCounter":{ "10":0, "100":0, "500":0, "1000":0, "5000":0, "10000":0, "30000":0, "60000":0 } }, "replaceDocumentsWithVersionSize":{ "count":0, "min":0, "max":0, "sum":0, "avg":0, "distributionCounter":{ "1":0, "10":0, "100":0, "1000":0, "10000":0 } }, "optimizeMillis":{ "count":0, "min":0, "max":0, "sum":0, "avg":0, "distributionCounter":{ } }, "closeMillis":{ "count":0, "min":0, "max":0, "sum":0, "avg":0, "distributionCounter":{ } }, "commitMillis":{ "count":15110, "min":0, "max":1227, "sum":1727819, "avg":114, "distributionCounter":{ } } }, "index writer version cache stats":{ "put":251520, "get":252058, "getFound":191259, "clear":502 }

Check a grep command to get the averages from the log

1 grep -o -E 'updateDocumentsWithVersionMillis":{"count":\d+,"min":\d+,"max":\d+,"sum":\d+,"avg":\d+' atlassian-jira.log |grep -oE "\d+$"

DBR Receiver Stats Logs (8.12 and up only)

Starting with Jira 8.12, there is a new metric related to the DBR functionality that can help us in this troubleshooting. The DBR receiver logs describe the part responsible for receiving DBR messages with the Lucene document(s) from other nodes. Please see more information about how DBR works at Document-based replication in Jira Data Center.

The snapshot stats cover the last ~5min and the total stats cover the time from the start of the Jira application on that node.

For these stats, we can look at the processDBRMessageUpdateWithRelatedIndexInMillis metric, as it measures the partial time to process the DBR message locally - the time to perform a conditional update of the local index with the documents from the DBR message. In healthy clusters, we expect the average to be less than 10 milliseconds.

See an example of a DBR receiver log

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 ["DBR"]["RECEIVER"] "total stats period":PT282H15M27.996S, "data":{ "receiveDBRMessage":195939, "receiveDBRMessageUpdate":18937, "receiveDBRMessageUpdateWithRelated":177002, "receiveDBRMessageDelayedInMillis":{ "count":195939, "min":2, "max":10225, "sum":38425076, "avg":196, "distributionCounter":{ "500":189797, "1000":824, "1500":360, "2000":261, "3000":381, "4000":347, "5000":361, "9223372036854775807":3608 } }, "skipDBRMessageWhenIndexNotAvailable":0, "skipDBRMessageWhenIndexReplicationPaused":0, "processDBRMessageUpdateIssueIndex":0, "processDBRMessageUpdateCommentIndex":18260, "processDBRMessageUpdateWorklogIndex":677, "processDBRMessageUpdateBytes":{ "count":18937, "min":809, "max":52332, "sum":28856470, "avg":1523, "distributionCounter":{ "0":0, "1000":598, "50000":18336, "100000":3, "500000":0, "1000000":0, "2000000":0, "5000000":0, "10000000":0 } }, "processDBRMessageUpdateSerializeInMillis":{ "count":18937, "min":0, "max":127, "sum":163, "avg":0, "distributionCounter":{ "0":18923, "1":7, "5":5, "10":0, "50":1, "100":0, "500":1, "1000":0 } }, "processDBRMessageUpdateIndexInMillis":{ "count":18937, "min":0, "max":1101, "sum":136421, "avg":7, "distributionCounter":{ "0":1516, "1":4250, "5":2789, "10":8290, "50":1996, "100":25, "500":48, "1000":21, "9223372036854775807":2 } }, "processDBRMessageUpdateErrors":0, "processDBRMessageUpdateWithRelatedIssueIndex":{ "count":177001, "min":1, "max":1, "sum":177001, "avg":1, "distributionCounter":{ "0":0, "1":177001 } }, "processDBRMessageUpdateWithRelatedCommentIndex":{ "count":177001, "min":0, "max":1090, "sum":1127869, "avg":6, "distributionCounter":{ "0":99000, "1":30338, "10":43573, "100":2883, "9223372036854775807":1207 } }, "processDBRMessageUpdateWithRelatedWorklogIndex":{ "count":177001, "min":0, "max":18, "sum":2798, "avg":0, "distributionCounter":{ "0":175336, "1":1105, "10":554, "100":6 } }, "processDBRMessageUpdateWithRelatedChangesIndex":{ "count":177001, "min":1, "max":707, "sum":802138, "avg":4, "distributionCounter":{ "0":0, "1":62899, "10":100910, "100":13059, "9223372036854775807":133 } }, "processDBRMessageUpdateWithRelatedBytes":{ "count":177001, "min":148202, "max":1476974, "sum":38540709056, "avg":217742, "distributionCounter":{ "0":0, "1000":0, "50000":0, "100000":0, "500000":175898, "1000000":425, "2000000":678, "5000000":0, "10000000":0 } }, "processDBRMessageUpdateWithRelatedSerializeInMillis":{ "count":177001, "min":2, "max":367, "sum":1117486, "avg":6, "distributionCounter":{ "0":0, "1":0, "5":44638, "10":127280, "50":4862, "100":150, "500":71, "1000":0 } }, "processDBRMessageUpdateWithRelatedIndexInMillis":{ "count":177001, "min":0, "max":94874, "sum":344333681, "avg":1945, "distributionCounter":{ "0":158, "1":9, "5":41213, "10":6664, "50":63143, "100":6424, "500":38848, "1000":11506, "9223372036854775807":9036 } }, "processDBRMessageUpdateWithRelatedErrors":0, "maxErrorsSample":10, "processDBRMessageUpdateErrorsSample":{ }, "processDBRMessageUpdateWithRelatedErrorsSample":{ } }

Check a grep command to get the averages from the log

1 grep -o -E 'processDBRMessageUpdateWithRelatedIndexInMillis":{"count":\d+,"min":\d+,"max":\d+,"sum":\d+,"avg":\d+' atlassian-jira.log |grep -oE "\d+$"

For baselines, in one customer experiencing the problem here detailed, we were seeing the following statistics:

  • timeToAddMillis - 15 milliseconds average

  • updateDocumentsWithVersionMillis - 165 milliseconds average

After fixing IO problems the above statistics dropped to:

  • timeToAddMillis - 0 millisecond average

  • updateDocumentsWithVersionMillis - 5 milliseconds average

You can use a tool such as JSON formatters to have a more human-readable format. If you have homebrew installed, you can install jsonpp and have the JSON already formatted directly in your Terminal, as in the example below:

1 2 3 4 5 6 7 8 9 10 cat atlassian-jira.log | grep "LOCALQ" | grep INVALIDATION | grep total| tail -1|cut -d' ' -f 22|jsonpp { "timestampMillis": 1605104041350, "nodeId": "atl-node2", "queueSize": 0, "startQueueSize": 0, "startTimestampMillis": 1604096513275, "startMillisAgo": 1007528075, "closeCounter": 0, .....

Resolution

The problem was resolved after improving the overall disk performance of the cluster. For Jira Data Center nodes, disk performance is crucial for the overall health of the cluster. The local disk used by nodes has data that can be rebuilt in case of failure, as High Availability is guaranteed by the cluster, critical data is stored in the database or shared home.

Therefore, you're not required to use RAID arrays which might impact the node's performance. Also if old RAID controllers are used, even if you are using very fast SSDs, you might be impacted by disk performance problems depending on the sizes and number of files being written on high load hours, as files can be buffered into the controller memory but when the controller flushes the data to the disk, the operation takes time.

We recommend reviewing the logs above when assessing cluster replication problems as well, because as seen the Disk Performance test doesn't always show the bottleneck on complex environments.

This problem is also mentioned on Background reindex is slow after upgrading to Jira 8.10 and later.

Updated on April 17, 2025

Still need help?

The Atlassian Community is here for you.