Skip to content

WD sitelink pops dag + add published datasets to output targets

Contributor checklist

  • I have written tests for this DAG that will be merged into data-engineering/airflow-dags/tests/wmde
  • I have locally ran the above tests as outlined in the tests section of the Airflow DAGs project readme
  • I have tested the jobs for this DAG in my local database using the process defined in wmde/analytics/hql/airflow-jobs/wd_item_sitelink_segments/_test_weekly
  • I have tested the included DAGs in my local database using the process outlined in TEST_AIRFLOW_DAGS.md and the test variable files provided for each DAG
  • All Hive tables and HDFS directories that are needed by the included DAG jobs have been created
    • Hive
      • wmde.wd_item_sitelink_segments_weekly
    • HDFS
      • /wmf/tmp/analytics/wmde/wd_item_sitelink_segments_weekly
      • /wmf/tmp/analytics/wmde/wd_rest_api_metrics_monthly
      • /wmf/data/published/datasets/wmde/analytics/wd_item_sitelink_segments_weekly
      • /wmf/data/published/datasets/wmde/analytics/wd_rest_api_metrics_monthly

Description

This MR is for two separate tasks as one is for adding the published datasets as a target of an existing DAG (T361203), so this was included in the work for a new DAG that also requires this target (T362849).

For T362849 the DAG for computing segments of Wikidata items based on their connection to sitelinks was completed along with the jobs for this DAG in wmde/analytics/hql/airflow-jobs/wd_item_sitelink_segments. This includes also targeting the published datasets directory.

For T361203 the task was to add the published datasets directory as a target of the Wikidata REST API metrics DAG that was introduced in MR#631.

Test outputs

I ran the following query on my local database:

SELECT
    *

FROM
    andrewtavis_wmde.wd_item_sitelink_segments_weekly
;

with the results being:

week sitelink_items sitelink_item_targets all_other_items
2024-05-20 32289723 3026363 73269083

The tests for both DAGs were written such that each is checking that the paths are correct and that there are three tasks in each.

I'm unsure how to test targeting the published datasets directory locally without running the DAG, but would be happy to do this and send a link to the results!

Related task(s)

Edited by Andrew McAllister (WMDE)

Merge request reports