Bamboo Data Center NodeAliveWatchdog shuts down Bamboo during DB scheduled backups
Platform Notice: Data Center Only - This article only applies to Atlassian apps on the Data Center platform.
Note that this KB was created for the Data Center version of the product. Data Center KBs for non-Data-Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.
*Except Fisheye and Crucible
Summary
Bamboo Data Center shuts down with a message in <bamboo-home>/logs/atlassian-bamboo.log stating it could not refresh the state in the DB.
Environment
Bamboo Data Center 8.0 and later.
Diagnosis
The <bamboo-home>/logs/atlassian-bamboo.log file contains a message similar to:
2023-03-23 06:17:46,556 ERROR [scheduler_Worker-6] [NodeAliveWatchdog] Current node failed to refresh its state in DB within last 3 minutes. This node will now go downCause
The Bamboo NodeAliveWatchdog monitors the database for read and write ability. If the Database is unavailable or read-only for more than 3 minutes, the node will shut down to allow the cold standby node, if one is available, to take over.
Solution
Prior to Bamboo 9.5
If your database is anticipated to be unavailable for more than 3 minutes you can increase or disable the NodeAliveWatchdog timeout by adding a Bamboo System Property. For example, the snippet below will set the timeout to 5 minutes.
-Dbamboo.node.alive.watchdog.timeout=5Setting the property value of 0 disables the check, that should stop it from shutting down during periods where it cannot get database connections but it's not a recommended approach as we're just masking/working around a potentially serious underlying issue. A number greater than 0 will be the number of minutes.
Bamboo 9.5 and later
We can disable the health-check that is causing the instance to shutdown, as well as increase the node lock and cluster heartbeat timeout value with the below property:
-Dbamboo.node.alive.watchdog.enabled=false -Dbamboo.primary.node.lock.timeout.seconds=600 -Dbamboo.cluster.heartbeat.alive.timeout.seconds=600This will prevent the nodes and the cluster to remain active till 10 minutes post which it will shutdown if the DB is still unavailable. You can set the timeout value to a higher number if you foresee the DB to be down for a long time.That should stop it from shutting down during periods where it cannot get database connections but it's not a recommended approach as we're just masking/working around a potentially serious underlying issue.
-Dbamboo.node.alive.watchdog.enabled :- is the one when enabled monitors the database for read and write ability, checks whether the DB is unavailable or readonly.
-Dbamboo.cluster.heartbeat.alive.timeout.seconds :-is the duration (in seconds) after which a node is considered dead if no heartbeat is received. Default 300 seconds.
-Dbamboo.primary.node.lock.timeout.seconds :- is the one that specify how long the secondary nodes waits until they take over the primary role. Default 120 seconds. It is not recommended to have a high value of this parameter in the warm standby setup as it prevents secondary nodes from taking over
Bamboo 10.0 - 10.2.7 (single-node)
Users can encounter a scenario where their single node becomes secondary due to a DB outage or maintenance.
The problem arises due to a mismatch between the default values for
bamboo.primary.node.lock.timeout.seconds(120 seconds) andbamboo.cluster.heartbeat.alive.timeout.seconds(300 seconds).During the maintenance window, the database became available again after the primary lock timeout had elapsed, causing the primary lock scheduler job to stop.
However, if the DB becomes available before the heartbeat alive timeout was reached (but after the primary lock timeout exceeds), Bamboo does not shut down and as a result, the primary node gets incorrectly set to secondary.
To resolve this, we recommend either of the following:
matching the two timeout values
upgrading to 10.2.8 and later
Was this helpful?