Skip to content

test_k8s/dumpsv1: tolerate completed pods with the OOMKilled status

We've observed pods in the status

mediawiki-afwiki-sql-xml-dump-metahistorybz2dump-v7fv07c   2/3     OOMKilled   0          4h36m

after a memory hungry php process was OOMkilled by the kernel (and restarted by the dump worker). While the overall dump job completed, the status prevented airflow from considering the pod as completed.

Screenshot_2025-04-10_at_09.07.56

We add a bit of custom logic to tolerate this state, as well as increase the limits/request ratio (and re-wire the LARGE_WIKI_POD_RESOURCES and DEFAULT_POD_RESOURCES to the actual container resource requests, that somehow disappeared).

Signed-off-by: Balthazar Rouberol brouberol@wikimedia.org Bug: T391510

Edited by Brouberol

Merge request reports

Loading