Make pod and pvc watcher threads resilient to errors (!26) · Merge requests · repos / releng / k8s-pvc-cleaner

Ahmon Dancy requested to merge main-I0fa97c028ea4b2797b3b6e26dadc8eb841cb391a into main Feb 27, 2024

Sometimes the Kubernetes server will drop a connection associated with
an open watch (probably after some timeout period). This was causing
pvc-cleaner to terminate regularly, preventing it from properly
tracking idle PVCs.

This commit makes the following changes in an attempt to deal with this:

Set _request_timeout=60 in the Watch.stream() call. If no event
is received within that timeout period, the watch will raise a
timeout exception. We catch this exception and restart the watch
when it happens. This results in a fresh query to the k8s api
server on a regular basis, reducing the likelihood of server-side
idle timeouts.
Also catch urllib3.exceptions.ProtocolError and treat it the same
way as the timeout (restart the watch). This is the exception that
as seen in production.

Also, factor out code in common between watch_pvcs and watch_pods.

Bug: T351478

Edited Feb 27, 2024 by Ahmon Dancy

Admin message

Admin message

Admin message

Make pod and pvc watcher threads resilient to errors

Merge request reports