Autoscaler for Runners on Kubernetes

Managing numerous self-hosted runners can be time consuming and require custom tooling, the Autoscaler for Runners on Kubernetes is designed to improve the orchestration and management of runners hosted in Kubernetes.

The Autoscaler is a service deployed into an existing Kubernetes cluster that the customer administrates and manages. Once deployed, the autoscaler will automatically scale the number of Linux agents depending on the available jobs to be executed.

Getting Started

The scaling policy is configurable via configuration files, which allows the autoscaler to poll the Bitbucket Cloud Runner API and automatically increase or decrease the runners available to receive builds.

Create your Base64 password

Choose one of the authentication methods, either app password or oauth, and encode it in base64. These values will be used in the installation steps below.

Example of how to encode in base64

For oauth:

echo -n $BITBUCKET_OAUTH_CLIENT_ID | base64
echo -n $BITBUCKET_OAUTH_CLIENT_SECRET | base64

For app password:

echo -n $BITBUCKET_USERNAME | base64
echo -n $BITBUCKET_APP_PASSWORD | base64

Install the Kubernetes resources

Clone the git repository:

git clone git@bitbucket.org:bitbucketpipelines/runners-autoscaler.git

2. Go to the kustomize folder.

cd kustomize

3. Configure the files in the values folder.

In the runner_config.yaml file
- Setup the workspace uuid.
- If they are repository runners, set up repository uuid.
- Read the comments and review the other parameters.
In the kustomization.yaml file
- If you’re using OAuth
  - Uncomment the code under Option 1 in the Secret section and set value fields of /data/bitbucketOauthClientId and /data/bitbucketOauthClientSecret paths with the base64 encoded values generated previously.
  - Uncomment the code under Option 1 in the Deployment section.
- If you’re using an App Password
  - Uncomment the code under Option 2 in the Secret section and set value fields of /data/bitbucketUsername and /data/bitbucketAppPassword paths with the base64 encoded values generated previously.
  - Uncomment the code under Option 2 in the Deployment section.

4. Verify the generated output.

kubectl kustomize values

5. Apply it.

kubectl apply -k values

6. Verify if the runner-controller pod works.

kubectl logs -f -l app=runner-controller -n bitbucket-runner-control-plane

7. Verify if runners are being created.

kubectl logs -l runner_uuid -c runner

Recommended actions

Scaling Kubernetes nodes (highly recommended)

Your Kubernetes cluster will need to have a node horizontal autoscaler configured. We recommend using a tool that is optimized for large batch or job-based workloads such as escalator. Refer to the deployment docs for more details.

If you are using AWS, you'll need to use this deployment aws/escalator-deployment-aws.yaml instead of the one you would normally use.

Configuring Kubernetes nodes

In the job configuration map, you will notice there's a nodeSelector, which means the nodes that the runners will be running on need to have a label that matches it. In AWS EKS, this can be configured via EKS Managed Node Groups.

For example, in the config map job template, next label present: customer=shared. So the Kubernetes node should be updated with this label: kubectl label nodes <kubernetes-node> customer=shared

Note: This label also must match the one you configured in escalator config map.

Configuring Memory & CPU resources (optional)

Inside the config map template, you will notice that the resources tag is defined. The memory and cpu limits can be configured according to your needs; although, the default configuration of 2Gi of memory and 1000m CPU should work for most scenarios.

Consider the density of the runners on an instance, for example, if you want to use an 8Gb instance size, it might not be worth using 4Gi for runners since it will take slightly more than half of the allocatable memory, and therefore it would allow only 1 runner pod per instance. We recommend a current minimum of 1Gi memory for a runner.

Configuring Constants (optional)

Inside the config map template, you will notice that the constants tag is defined. The runner_api_polling_interval is used to define how often the runner autoscaler controller fetches data from Bitbucket API. The default value is 600 seconds, but this can be changed according to your needs. We don’t recommend setting this value lower than 120 seconds, because this can lead to requests being throttled.

Another useful parameter to configure is runner_cool_down_period which is used to determine how long it should wait before scaling down a recently created runner. We recommend using at least 300 seconds.

Using the autoscaler cleaner

The Runner Autoscaler Cleaner is the application configured in cleaner deployment, that allows you to automatically clean (delete) unhealthy runners and linked jobs.

No custom configuration is currently required prior to launching the Autoscaler Cleaner; however, configuration options may be added in the future.

Configuration example and explanation

With the initial setup of the runners autoscaler tool, we have the next infrastructure:

1 k8s node
1 runner online with 1 k8s job
escalator pod to create additional k8s nodes if required

escalator configMap:

data:
  nodegroups_config.yaml: |
    node_groups:
      - name: "pipes-runner-node"
        ...
        scale_up_threshold_percent: 70
        max_nodes: 5

Within the escalator configMap provided above, the autoscaler will scale up jobs to fill the capacity of the k8s node once the k8s node exceeds 70% utilization of CPU. The escalator will then provision additional k8s nodes.

How does the Autoscaler decide to scale up or down?

The way the autoscaler works is based on the following calculation:

runners scale threshold value = BUSY_ONLINE_RUNNERS / ALL_ONLINE_RUNNERS

The autoscaler then compares it with scale_up_threshold and scale_down_threshold present in the provided configuration file.

If the runners scale threshold value is greater than scale_up_threshold, it means that most "ONLINE" runners are BUSY (executing some pipelines job) and new runners will be created. If the runner's scale threshold value is less than the scale_down_threshold, it means that most "ONLINE" runners are in IDLE state and the count of online runners should be decreased to the min count.

The speed to increase and decrease the count of runners can be turned up and down using the scale_up_multiplier and scale_down_multiplier values in the configuration file.

Finally, the desired count of runners calculated by the autoscaler:

desired count of runners = ALL_ONLINE_RUNNERS * scale_up_multiplier  # scale up case
or
desired count of runners = ALL_ONLINE_RUNNERS * scale_down_multiplier # scale down case

Troubleshooting and limitations

The Kubernetes autoscaler has only been tested on Linux runners.
The autoscaler tool will create and manage the runners lifecycle to achieve the required amount of runners to support the available builds based on load. This will continue up to the maximum value declared in the configuration file or the infrastructure limit of Bitbucket API runners (currently 100).
The maximum value of 100 is under review, our team will continue to work on it to provide a fix in a future release.

Support Model

The Kubernetes autoscaler is a tool that requires customers to utilize their existing competency with Kubernetes, this is an option for advanced users running self-hosted Runners. Challenges with networking, custom infrastructure configuration, and other issues are outside of the scope of Bitbucket Support. You can view and discuss issues with the community group: Bitbucket Pipelines: Runner Autoscaler for Kubernetes.

Similar to Bitbucket Pipes integrations | Bitbucket the source code of bitbucketpipelines/runners-autoscaler is public, and the intention is that you can fork and adjust code to suit your needs as required.

Please note that support for Runners Autoscaler is outside of our support scope and we will not be able to provide consulting services related to your custom configuration. If you have a feature request in mind or have identified a bug you are welcome to share those in the community group, which we will track.

Was this helpful?

It wasn't accurate

It wasn't clear

It wasn't relevant

Still need help?

The Atlassian Community is here for you.

Ask the Community