Refactor image-matching project structure. (!19) · Merge requests · Gmodena / platform-airflow-dags

Gmodena requested to merge T292748-refactor-ima into multi-project-dags-repo Dec 20, 2021

Merge request for https://phabricator.wikimedia.org/T292748

ima_v2 should match the convention introduced by the cookiecutter template.

Changes introduced to ima.py are to allow execution in cluster mode. After some experiment with replacing hive with spark-sql shell, I rolled back to using Hive for IMAv1. Some semantics difference make the final export fail. We should consider dropping anyway, since it breaks convention (data should be stored on HDFS/Swift instead). IMAv2 does use spark-sql under the hood.

@clarakosi maybe we could add a "sink to search's HDFS" task, but this could done as separate work.

Github CI:

https://github.com/gmodena/wmf-platform-airflow-dags/actions/runs/1826422560

Integration tests:

IMA v1 successful run: http://localhost:8600/tree?dag_id=image-suggestion-etl-pipeline&root=
IMA v2 successful run: http://localhost:8600/tree?dag_id=image-matching-v2_dag&root= (actual run was triggered on 2022-01-10, but the airflow scheduler backfilled for 2022-01-09)

To access Airflow UI you'll need to setup a tunnel with: ssh -t -N -L8600:127.0.0.1:8600 an-airflow1003.eqiad.wmnet

Edited Feb 10, 2022 by Gmodena

Admin message

Admin message

Admin message

Refactor image-matching project structure.

Merge request reports