Skip to content

Make pod and pvc watcher threads resilient to errors

Ahmon Dancy requested to merge main-I0fa97c028ea4b2797b3b6e26dadc8eb841cb391a into main

Sometimes the Kubernetes server will drop a connection associated with
an open watch (probably after some timeout period). This was causing
pvc-cleaner to terminate regularly, preventing it from properly
tracking idle PVCs.

This commit makes the following changes in an attempt to deal with this:

  • Set _request_timeout=60 in the Watch.stream() call. If no event
    is received within that timeout period, the watch will raise a
    timeout exception. We catch this exception and restart the watch
    when it happens. This results in a fresh query to the k8s api
    server on a regular basis, reducing the likelihood of server-side
    idle timeouts.

  • Also catch urllib3.exceptions.ProtocolError and treat it the same
    way as the timeout (restart the watch). This is the exception that
    as seen in production.

Also, factor out code in common between watch_pvcs and watch_pods.

Bug: T351478

Edited by Ahmon Dancy

Merge request reports