- 27 Jan, 2023 1 commit
-
-
Dduvall authored
Reduced memory resource request to ~ 1/3rd of the allocatable memory on nodes in the storage-optimized node pool with the hopes to decouple slightly pod scaling which is quick from node pool scaling which is quite slow. At any given time, it's more likely the node pool will be overprovisioned to allow for a degree of responsive HPA scaling. Removed the memory utilization metric from the HPA due to buildkitd aggressively using and holding onto memory. TCP connections alone seems like it will be a better metric for overall saturation. We'll see. Added behavior policy to the HPA to mitigate flapping. Scaling up now uses a very small stabilization window to reduce immediately scaling due to connection probes. Scaling down is much more conservative, using a large stabilization window of 10 minutes and a pod reduction policy of 1 per minute to avoid scaling down prematurely. The thought here is that CI workflows come in the shape of large-ish pipelines, and image builds beget image builds, especially when you consider failures and retries. Bug: T327416
-
- 26 Jan, 2023 2 commits
-
-
Dduvall authored
Kubernetes did not like the label selector on `cluster_name` that contained '|' delimited information. Configure the "envoy-stats-monitor" `PodMonitor` to relabel scraped metrics and split the `cluster_name` label into distinct host/port/name labels. Reconfigure HPA to select on `cluster_name: "inbound"` and `cluster_port: "1234"` to only sum active connections to buildkitd. Bug: T327416
-
Dduvall authored
Propagate the standard `app.kubernetes.io/instance` label which is used by Helm as the release name to Envoy metrics. Configure prometheus-adapter to discover any `envoy_cluster_upstream_cx_active` metrics for Helm deployed services and make them available to autoscalers. Configure the `DestinationRule` for buildkitd to cap active TCP connections per pod at 10, and configure the HPA to scale at a much lower average of 4. Note this should mitigate TCP connection failures but it won't prevent them. If an influx of connections occur and overflow the rule, client connections will be closed prematurely. We should investigate whether there is a reliable way to have Istio/Envoy reroute these connections but it doesn't seem there is for opaque TCP connections. If there isn't a way, we can look into implementing preflight checks on the client side for available connections with some reasonable backoff. Note that Helm's chart not specifies `appProtocol: t...
-
- 25 Jan, 2023 2 commits
- 24 Jan, 2023 4 commits
-
-
Dduvall authored
Install a `PodMonitor` to monitor all Istio/Envoy proxies and gather metrics into the Prometheus instance deployed by kube-prometheus-stack. Note this is necessary since we're using kube-prometheus-stack and not the Prometheus provided by Istio. Limit concurrent buildkitd connections to 4 for now using a `DestinationRule`. We'll get the connection saturation metric working and suss out a better limit later. Bug: T327416
-
Jeena Huneidi authored
-
Dduvall authored
Bug: T327416
-
Dduvall authored
-
- 19 Jan, 2023 6 commits
-
-
Dduvall authored
Bug: T327416
-
Dduvall authored
-
Dduvall authored
Since we're relying on auto upgrades of kubernetes, we should not specify an exact version but use the latest patch version of a given major/minor version prefix. Otherwise, every auto upgrade will be followed by a downgrade on the next deployment. See https://registry.terraform.io/providers/digitalocean/digitalocean/latest/docs/resources/kubernetes_cluster#auto-upgrade-example
-
Dduvall authored
Istio should give us greater control and observability for services running in the cluster, and it supports gRPC services. We're hoping this will give us connection level metrics and traffic management for buildkitd which we can use to scale and limit max connection concurrency. Bug: T327416
-
Jeena Huneidi authored
max 10 buildkitd replicas
-
Jeena Huneidi authored
-
- 18 Jan, 2023 5 commits
-
-
Jeena Huneidi authored
-
Ahmon Dancy authored
Avoid a division by zero error when computing the gitlab_runner_jobs_builds_concurrent_saturation metric.
-
Ahmon Dancy authored
This reverts commit efb264c7
-
Add descriptions for do_token, access_id, and secret_key and mark do_token and secret_key as sensitive.
-
Jelto authored
Bug: T326815
-
- 13 Jan, 2023 2 commits
-
-
Ahmon Dancy authored
-
Ahmon Dancy authored
If a prior helm install/upgrade is interrupted for a variety of reasons (control-c'd, some kind of communication problem with the Kubernetes API server, whatever), it is very likely that the next attempt to perform a helm install/upgrade will result in the following error: Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress This situation can be resolved by rolling back the prior failed operation. The simplest way to do this is with: helm rollback <release>
-
- 12 Jan, 2023 4 commits
-
-
Ahmon Dancy authored
from 3500m.
-
Ahmon Dancy authored
Set buildkitd requests/limits to 3500m cpu and 25Gi mem. storage-optimized nodes currently have 3900m cpu and 29222Mi memory allocatable, so this configuration should ensure that each buildkitd gets its own node. Set minReplicas to 3 to help quickly absorb a surge of jobs.
-
- 11 Jan, 2023 3 commits
-
-
Ahmon Dancy authored
1.24.4-do.0 is not longer available. Advance to 1.24.8-do.0.
-
According to Terraform documentation, it is bad practice to set kubernetes provider (or any provider?) attributes using resource attributes. See https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs#stacking-with-managed-kubernetes-cluster-resources And https://itnext.io/terraform-dont-use-kubernetes-provider-with-your-cluster-resource-d8ec5319d14a Failure to do this can result in strange errors like the following: Error: Get "http://localhost/api/v1/namespaces/gitlab-runner": dial tcp [::1]:80: connect: connection refused with kubernetes_namespace.gitlab_runner, on cluster.tf line 100, in resource "kubernetes_namespace" "gitlab_runner": 100: resource "kubernetes_namespace" "gitlab_runner" { Using a data source to retrieve k8s cluster attributes from DO solves this problem. See https://registry.terraform.io/providers/digitalocean/digitalocean/latest/docs/data-sources/kubernetes_cluster
-
Dduvall authored
Testing out use of a simple upper concurrency value to autoscale GitLab runners.
-
- 16 Dec, 2022 2 commits
-
- 15 Dec, 2022 2 commits
-
-
Jelto authored
Tag of Trusted Runners was change to trusted instead of protected. This should be more self explanatory. To execute jobs on Trusted Runners, the new tag is needed now. Bug: T325069
-
Dduvall authored
Per the recent refactoring of kokkuri. See kokkuri@48b42680
-
- 13 Dec, 2022 3 commits
-
-
Dduvall authored
Moved `image` up as its former position truncated `runners`
-
This change explicitly pins the gitlab-runner version tag in the gitlab-runners/values.yaml. This value can be used to do version upgrades. This is needed for SRE to bump the version for security upgrades. See also https://gitlab.com/gitlab-org/charts/gitlab-runner/blob/main/values.yaml#L14
-
Ahmon Dancy authored
Use smaller generous cpu/mem requests/limits for executor, helper, and service pods. The new figures should allow us to pack more pods into nodes. This is particularly useful for kokkori-based workflows where the heavy lifting is being done by the buildkit pods which have more generous requests/limits.
-
- 11 Dec, 2022 3 commits
-
-
Dduvall authored
-
Dduvall authored
Per the production registry's config, let's disable `client_max_body_size` to allow large image layer pushes.
-
Dduvall authored
The old variables should be removed once the kokkuri refactor has merged. See kokkuri!20
-
- 09 Dec, 2022 1 commit
-
-
Dduvall authored
-