Confluence Unresponsive due to stuck Workbox Notifications threads
Platform Notice: Data Center Only - This article only applies to Atlassian products on the Data Center platform.
Note that this KB was created for the Data Center version of the product. Data Center KBs for non-Data-Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.
*Except Fisheye and Crucible
Summary
Confluence went unresponsive and returning HTTP 503 errors while being accessed from the Load Balancer. The issue continues to occur even after Confluence restart.
In addition, the following stuck threads detections messages are also being thrown in the catalina.out
log:
1
2
3
4
5
6
7
8
9
10
11
12
13
WARNING [ContainerBackgroundProcessor[StandardEngine[Standalone]]] org.apache.catalina.valves.StuckThreadDetectionValve.notifyStuckThreadDetected Thread "http-nio-8090-exec-147" (id=5169) has been active for 63,774 milliseconds (since 11/6/18 7:42 PM) to serve the same request for https://example.atlassian.com/rest/mywork/latest/status/notification/count?_=1541562126288 and may be stuck (configured threshold for this StuckThreadDetectionValve is 60 seconds). There is/are 185 thread(s) in total that are monitored by this Valve and may be stuck.
java.lang.Throwable
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1385)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1413)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1397)
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:394)
Diagnosis
Confluence had been configured to have a maximum of 200 HTTP threads:
<Confluence-Installation>/conf/server.xml
1
2
3
4
5
<Connector port="8090"
...
maxThreads="200"
...
/>
Looking at the thread dumps taken during the incident window, a vast number of long-running HTTP threads are shown and most of them are stuck waiting on SSL connection. In specific, these RUNNABLE
threads are for Workbox Notifications:
Thread counts per status
1
2
3
4
5
6
7
8
9
10
11
12
13
$ grep -A1 '-exec-' conf_threads.* | grep State | sort | uniq -c
200 conf_threads.1541570464.txt- java.lang.Thread.State: RUNNABLE
10 conf_threads.1541570464.txt- java.lang.Thread.State: WAITING (parking)
200 conf_threads.1541570479.txt- java.lang.Thread.State: RUNNABLE
10 conf_threads.1541570479.txt- java.lang.Thread.State: WAITING (parking)
200 conf_threads.1541570490.txt- java.lang.Thread.State: RUNNABLE
10 conf_threads.1541570490.txt- java.lang.Thread.State: WAITING (parking)
200 conf_threads.1541570501.txt- java.lang.Thread.State: RUNNABLE
10 conf_threads.1541570501.txt- java.lang.Thread.State: WAITING (parking)
200 conf_threads.1541570512.txt- java.lang.Thread.State: RUNNABLE
10 conf_threads.1541570512.txt- java.lang.Thread.State: WAITING (parking)
200 conf_threads.1541570523.txt- java.lang.Thread.State: RUNNABLE
10 conf_threads.1541570523.txt- java.lang.Thread.State: WAITING (parking)
Example RUNNABLE thread
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
http-nio-8090-exec-200 - priority:5 - threadId:0x00007f6f9e2e6800 - nativeId:0x355e - state:RUNNABLE
stackTrace:
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
- locked <0x00000004f564e008> (a java.lang.Object)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1385)
- locked <0x00000004f564e018> (a java.lang.Object)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1413)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1397)
...
at com.atlassian.sal.core.net.HttpClientRequest.executeAndReturn(HttpClientRequest.java:103)
at com.atlassian.plugins.rest.module.jersey.JerseyRequest.executeAndReturn(JerseyRequest.java:131)
at com.atlassian.applinks.core.auth.ApplicationLinkRequestAdaptor.execute(ApplicationLinkRequestAdaptor.java:58)
at com.atlassian.applinks.oauth.auth.OAuthRequest.execute(OAuthRequest.java:58)
at com.atlassian.mywork.host.service.AppLinkHelper.execute(AppLinkHelper.java:64)
at com.atlassian.mywork.host.service.AppLinkHelper.execute(AppLinkHelper.java:42)
at com.atlassian.mywork.host.service.ClientServiceImpl.verifyAuth(ClientServiceImpl.java:171)
at com.atlassian.mywork.host.service.ClientServiceImpl.verifyAuth(ClientServiceImpl.java:136)
...
at com.atlassian.mywork.host.service.LocalNotificationServiceImpl.loadCount(LocalNotificationServiceImpl.java:385)
at com.atlassian.mywork.host.service.LocalNotificationServiceImpl.lambda$_getCount$2(LocalNotificationServiceImpl.java:375)
at com.atlassian.mywork.host.service.LocalNotificationServiceImpl$$Lambda$1096/27140524.get(Unknown Source)
...
- locked <0x0000000438ab9e90> (a org.apache.tomcat.util.net.NioChannel)
Cause
This issue can occur when a linked application (via Application Link) is down, for example Jira. The issue occurs as each request coming from the Confluence's Workbox notifications is waiting to connect to the linked application (e.g. Jira). This then results in Confluence not having enough threads free in order to service users, leading to performance degradation.
Solution
Please make sure that the linked application has been configured correctly in the Linking to Another Application page.
Please ensure that Confluence can reach all linked instances (via Application Link) over https.
Check this by using httpclienttest.
If the other end (e.g. Jira) is down or no longer in service, remove the application link.
If you are unable to remove the application link from UI due to performance issues, follow Alternatives method of deleting application links under Confluence to remove the application link.
After resolving the issue above, you'll need to restart Confluence for the threads to be released and available to users again.
Was this helpful?