Skip to content

buildkitd: Adjust resource requests/limits and scaling behavior

Dduvall requested to merge review/adjust-buildkitd-hpa-and-resources into main

Reduced memory resource request to ~ 1/3rd of the allocatable memory on nodes in the storage-optimized node pool with the hopes to decouple slightly pod scaling which is quick from node pool scaling which is quite slow. At any given time, it's more likely the node pool will be overprovisioned to allow for a degree of responsive HPA scaling.

Removed the memory utilization metric from the HPA due to buildkitd aggressively using and holding onto memory. TCP connections alone seems like it will be a better metric for overall saturation. We'll see.

Added behavior policy to the HPA to mitigate flapping. Scaling up now uses a very small stabilization window to reduce immediately scaling due to connection probes.

Scaling down is much more conservative, using a large stabilization window of 10 minutes and a pod reduction policy of 1 per minute to avoid scaling down prematurely. The thought here is that CI workflows come in the shape of large-ish pipelines, and image builds beget image builds, especially when you consider failures and retries.

Bug: T327416

Merge request reports