Skip to content

_do_run - retry runtime errors

ToolforgeClient configures max_retries on the requests session with Retry(total=10, backoff_factor=0.5), this handles transient HTTP error statuses (413, 429, 503) for most methods (notably not POST).

Unfortunatly this does not catch a lot of transient errors we experience between internal APIs which get raised to the user along the lines of

Long status:
  Got exception: Failed run for component celery-worker: HTTPSConnectionPool(host='api.svc.tools.eqiad1.wikimedia.cloud', port=30003): Read timed out. (read timeout=20)

Runs:
celery-worker(failed): HTTPSConnectionPool(host='api.svc.tools.eqiad1.wikimedia.cloud', port=30003): Read timed out. (read timeout=20)

There are 2 important 'phases' in a deployment (do_deploy):

  1. Build target images
  2. Update job configs

If requests to the builds-api in step 1 fail, this may leave an inconsistent state (target image may be overwritten by builds-api, but job not restarted), but it does not have severe consequences.

If requests to jobs-api in step 2 fail a number of things can happen:

  1. If there was 1 job (component) there are now no jobs
  2. If there are multiple components (jobs) and the one that failed is not the first, we are left in an inconsistent state. 1 job is missing (deleted), the components (jobs) specifeid after the failure are not reloaded (left inconsistent)

For the end user, there is not much that can be done except to create a new deployment. This can have additional side effects such as re-building the images (due to build-api retention), causing an extended outage.

This change adds a context manager which knows how to re-try http based failures and then "wraps" important functions with it.

While perhaps not the most ideal method (it has no concept of business logic) it is a simple change that should improve the life of users without re-designing how components are deployed.

Bug: T403175

Edited by David Caro

Merge request reports

Loading