WD sitelink segments dag + add published datasets to output targets
Contributor checklist
-
I have written tests for this DAG that will be merged into data-engineering/airflow-dags/tests/wmde -
I have locally ran the above tests as outlined in the tests section of the Airflow DAGs project readme -
I have tested the jobs for this DAG in my local database using the process defined in wmde/analytics/hql/airflow-jobs/wd_item_sitelink_segments/_test_weekly -
I have tested the included DAGs in my local database using the process outlined in TEST_AIRFLOW_DAGS.md and the test variable files provided for each DAG - All jobs have passed and we do have CSVs for each of the processes in the
/tmp
directory - I've checked the exported CSVs, and their values are consistent with expectations
- All jobs have passed and we do have CSVs for each of the processes in the
-
All Hive tables and HDFS directories that are needed by the included DAG jobs have been created -
Hive wmde.wd_item_sitelink_segments_weekly
-
HDFS /wmf/tmp/analytics/wmde/wd_item_sitelink_segments_weekly
/wmf/tmp/analytics/wmde/wd_rest_api_metrics_monthly
/wmf/data/published/datasets/wmde/analytics/wd_item_sitelink_segments_weekly
/wmf/data/published/datasets/wmde/analytics/wd_rest_api_metrics_monthly
-
Description
This MR is for two separate tasks as one is for adding the published datasets as a target of an existing DAG (T361203), so this was included in the work for a new DAG that also requires this target (T362849).
For T362849 the DAG for computing segments of Wikidata items based on their connection to sitelinks was completed along with the jobs for this DAG in wmde/analytics/hql/airflow-jobs/wd_item_sitelink_segments. This includes also targeting the published datasets directory.
For T361203 the task was to add the published datasets directory as a target of the Wikidata REST API metrics DAG that was introduced in MR#631.
Test outputs
I ran the following query on my local database:
SELECT
*
FROM
andrewtavis_wmde.wd_item_sitelink_segments_weekly
;
with the results being:
week | sitelink_items | sitelink_item_targets | all_other_items |
---|---|---|---|
2024-05-20 | 32289723 | 3026363 | 73269083 |
The tests for both DAGs were written such that each is checking that the paths are correct and that there are three tasks in each.
I'm unsure how to test targeting the published datasets directory locally without running the DAG, but would be happy to do this and send a link to the results!