Refactor image-matching project structure.
Merge request for https://phabricator.wikimedia.org/T292748
ima_v2
should match the convention introduced by the cookiecutter template.
Changes introduced to ima.py
are to allow execution in cluster mode. After some experiment with replacing hive
with spark-sql
shell, I rolled back to using Hive for IMAv1. Some semantics difference make the final export fail. We should consider dropping anyway, since it breaks convention (data should be stored on HDFS/Swift instead).
IMAv2 does use spark-sql
under the hood.
@clarakosi maybe we could add a "sink to search's HDFS" task, but this could done as separate work.
Github CI:
Integration tests:
- IMA v1 successful run: http://localhost:8600/tree?dag_id=image-suggestion-etl-pipeline&root=
- IMA v2 successful run: http://localhost:8600/tree?dag_id=image-matching-v2_dag&root= (actual run was triggered on 2022-01-10, but the airflow scheduler backfilled for 2022-01-09)
To access Airflow UI you'll need to setup a tunnel with:
ssh -t -N -L8600:127.0.0.1:8600 an-airflow1003.eqiad.wmnet