_do_run - retry runtime errors
ToolforgeClient configures max_retries on the requests session
with Retry(total=10, backoff_factor=0.5), this handles transient HTTP
error statuses (413, 429, 503) for most methods (notably not POST).
Unfortunatly this does not catch a lot of transient errors we experience between internal APIs which get raised to the user along the lines of
Long status:
Got exception: Failed run for component celery-worker: HTTPSConnectionPool(host='api.svc.tools.eqiad1.wikimedia.cloud', port=30003): Read timed out. (read timeout=20)
Runs:
celery-worker(failed): HTTPSConnectionPool(host='api.svc.tools.eqiad1.wikimedia.cloud', port=30003): Read timed out. (read timeout=20)
There are 2 important 'phases' in a deployment (do_deploy):
- Build target images
- Update job configs
If requests to the builds-api in step 1 fail, this may leave an inconsistent state (target image may be overwritten by builds-api, but job not restarted), but it does not have severe consequences.
If requests to jobs-api in step 2 fail a number of things can happen:
- If there was 1 job (component) there are now no jobs
- If there are multiple components (jobs) and the one that failed is not the first, we are left in an inconsistent state. 1 job is missing (deleted), the components (jobs) specifeid after the failure are not reloaded (left inconsistent)
For the end user, there is not much that can be done except to create a new deployment. This can have additional side effects such as re-building the images (due to build-api retention), causing an extended outage.
This change adds a context manager which knows how to re-try http based failures and then "wraps" important functions with it.
While perhaps not the most ideal method (it has no concept of business logic) it is a simple change that should improve the life of users without re-designing how components are deployed.
Bug: T403175