1. 27 Jan, 2023 1 commit
    • Dduvall's avatar
      buildkitd: Adjust resource requests/limits and scaling behavior · 37c289c6
      Dduvall authored
      Reduced memory resource request to ~ 1/3rd of the allocatable memory on
      nodes in the storage-optimized node pool with the hopes to decouple
      slightly pod scaling which is quick from node pool scaling which is
      quite slow. At any given time, it's more likely the node pool will be
      overprovisioned to allow for a degree of responsive HPA scaling.
      
      Removed the memory utilization metric from the HPA due to buildkitd
      aggressively using and holding onto memory. TCP connections alone seems
      like it will be a better metric for overall saturation. We'll see.
      
      Added behavior policy to the HPA to mitigate flapping. Scaling up now
      uses a very small stabilization window to reduce immediately scaling due
      to connection probes.
      
      Scaling down is much more conservative, using a large stabilization
      window of 10 minutes and a pod reduction policy of 1 per minute to avoid
      scaling down prematurely. The thought here is that CI workflows come in
      the shape of large-ish pipelines, and image builds beget image builds,
      especially when you consider failures and retries.
      
      Bug: T327416
      37c289c6
  2. 26 Jan, 2023 2 commits
    • Dduvall's avatar
      buildkitd: Fix HPA label selectors by splitting envoy cluster_name · 98140488
      Dduvall authored
      Kubernetes did not like the label selector on `cluster_name` that
      contained '|' delimited information.
      
      Configure the "envoy-stats-monitor" `PodMonitor` to relabel scraped
      metrics and split the `cluster_name` label into distinct host/port/name
      labels.
      
      Reconfigure HPA to select on `cluster_name: "inbound"` and
      `cluster_port: "1234"` to only sum active connections to buildkitd.
      
      Bug: T327416
      98140488
    • Dduvall's avatar
      buildkitd: Scale based on active incoming TCP connections · 92330271
      Dduvall authored
      Propagate the standard `app.kubernetes.io/instance` label which is used
      by Helm as the release name to Envoy metrics.
      
      Configure prometheus-adapter to discover any
      `envoy_cluster_upstream_cx_active` metrics for Helm deployed services
      and make them available to autoscalers.
      
      Configure the `DestinationRule` for buildkitd to cap active TCP
      connections per pod at 10, and configure the HPA to scale at a much
      lower average of 4. Note this should mitigate TCP connection failures
      but it won't prevent them. If an influx of connections occur and
      overflow the rule, client connections will be closed prematurely. We
      should investigate whether there is a reliable way to have Istio/Envoy
      reroute these connections but it doesn't seem there is for opaque TCP
      connections. If there isn't a way, we can look into implementing
      preflight checks on the client side for available connections with some
      reasonable backoff.
      
      Note that Helm's chart not specifies `appProtocol: t...
      92330271
  3. 25 Jan, 2023 2 commits
  4. 24 Jan, 2023 4 commits
  5. 19 Jan, 2023 6 commits
  6. 18 Jan, 2023 5 commits
  7. 13 Jan, 2023 2 commits
  8. 12 Jan, 2023 4 commits
  9. 11 Jan, 2023 3 commits
  10. 16 Dec, 2022 2 commits
  11. 15 Dec, 2022 2 commits
  12. 13 Dec, 2022 3 commits
  13. 11 Dec, 2022 3 commits
  14. 09 Dec, 2022 1 commit