Skip to content

Increase default timeout of sensor for mediawiki_wikitext_history dag.

Mforns requested to merge fix-mediawiki-wikitext-history into main

I paste the troubleshooting email that motivated this MR:

The sensor for mediawiki_wikitext_history has timed out.
It was waiting for the pages_meta_history_xml_dump success file in /wmf/data/raw/mediawiki/dumps/pages_meta_history/20230401/_SUCCESS

However, the data for the page history dumps has not been successfully imported (yet).
Although the number of wiki directories of the dump for 2023-04 matches other months,
the overall size of the data is much smaller, see the sizes for the last couple months:

January: 5368 GB
February: 5437 GB
March: 5498 GB
April: 3797 GB

I understand that this being a historical dump, its size is always growing,
thus the fact that April is still at 3797 GB means the import of the dump is incomplete.

I checked the systemd timer that is running the imports and its status is active.
The last successful run was today, a couple hours ago.
Looking at the script (refinery/bin/import-mediawiki-dumps) it seems it runs daily and adds any available data to HDFS.
So the data is added incrementally over a set of daily runs of the script.
Moreover, the _SUCCESS files of the previous months were added around the 16th, 17th or 18th of the month.
Actually, just found a comment on the old Oozie coordinator that says:
            <!--
                Use action actual time as SLA base, since it's the time used
                to compute timeout
                Job is waiting for the month data to be present which happen
                roughly the 17th of the month due to big wikis dump generation.
                Waiting for 24 days after new-month start should be enough.
            -->

This makes me think that there has been no issue so far.
And since this is the first month that this job runs in Airflow,
we must have misconfigured the sensor to fail before it should.
The sensor should be able to wait for 24 days (same as SLA) at least.
But it is configured to timeout after 7 days. This must be the default, since we don't set that value anywhere.

Merge request reports