Skip to content

buildkitd: Scale based on active incoming TCP connections

Dduvall requested to merge review/buildkitd-connection-saturation-metric into main

Propagate the standard app.kubernetes.io/instance label which is used by Helm as the release name to Envoy metrics.

Configure prometheus-adapter to discover any envoy_cluster_upstream_cx_active metrics for Helm deployed services and make them available to autoscalers.

Configure the DestinationRule for buildkitd to cap active TCP connections per pod at 10, and configure the HPA to scale at a much lower average of 4. Note this should mitigate TCP connection failures but it won't prevent them. If an influx of connections occur and overflow the rule, client connections will be closed prematurely. We should investigate whether there is a reliable way to have Istio/Envoy reroute these connections but it doesn't seem there is for opaque TCP connections. If there isn't a way, we can look into implementing preflight checks on the client side for available connections with some reasonable backoff.

Note that Helm's chart not specifies appProtocol: tcp to force protocol selection as opaque TCP instead of gRPC or HTTP2. This is because buildkit is highly session based, and routing subsequent requests of the same session to different backends breaks builds. Connections must only be limited at the TCP layer.

Bug: T327416

Edited by Dduvall

Merge request reports