Skip to content

Refactor image-matching project structure.

Gmodena requested to merge T292748-refactor-ima into multi-project-dags-repo

Merge request for https://phabricator.wikimedia.org/T292748

ima_v2 should match the convention introduced by the cookiecutter template.

Changes introduced to ima.py are to allow execution in cluster mode. After some experiment with replacing hive with spark-sql shell, I rolled back to using Hive for IMAv1. Some semantics difference make the final export fail. We should consider dropping anyway, since it breaks convention (data should be stored on HDFS/Swift instead). IMAv2 does use spark-sql under the hood.

@clarakosi maybe we could add a "sink to search's HDFS" task, but this could done as separate work.

Github CI:

Integration tests:

To access Airflow UI you'll need to setup a tunnel with: ssh -t -N -L8600:127.0.0.1:8600 an-airflow1003.eqiad.wmnet

Edited by Gmodena

Merge request reports