Skip to content

run IMA data pipeline on an-airflow.

Gmodena requested to merge update-ima-project-wip into multi-project-dags-repo

This MR is a PoC of porting an existing airflow dag, and its dependency management, to Platform Engineering an-airflow dev instance (that provides a Scheduler and a LocalExecutor). The goals of this PoC are:

  1. Identify a possible DAG repo structure, and define software dependencies boundaries.
  2. Port an existing DAG and data pipeline (image-matching) to our dev environment.
  3. Build atop Gitlab CI capabilities to improve iteration speed.

These changes are about project layout and develpoment workflow. The MR does not aim to:

  • Replace production deployment practices and tooling.
  • Refactor the current IMA dag. This will be carried out in a separate Task.

Repo structure

This repo structure matches the layout of AIRFLOW_HOME on an-airflow1003.eqiad.wmnet.

  • ${AIRFLOW_HOME}/dags contains dags for all projects. Each DAG schedules a data pipeline. No business logic is contained in the dag.
  • ${AIRFLOW_HOME}/<project-name> contains the data pipeline' codebase and business logic, configuration file and resources.

In this MR <project-name> is image-matching.

Dependencies and Python runtime

Only the DAG and datapipeline business logic code are kept in this repo. We treat everything else as an external dependency.

ImageMatching algorithm

What follows is a PoC and request for comments re finding common standards for interacting and distributing code developed by partner teams at WMF.

We depend on a jupyter notebook developed by the Research team. This code lives outside of the image-matching dag, and we treat it like an upstream dependency. This dependency is defined in image-matching/requirements.txt, and installed in the project conda environment. Effectively, we treat "upstream" code as just another python dep like pandas, numpy, and sklearn.

A companion PoC MR (https://gitlab.wikimedia.org/gmodena/ImageMatching/-/merge_requests/32) contains the refactored and packaged version of upstream as a wheel.

Conda environment

For Python and Pyspark based projects, a conda virtual env is provided at ${AIRFLOW_HOME}/image-matching/venv.tar.gz. This is the runtime for all python binaries and pyspark job triggered by the DAG>

The conda env is assembled by maketargets (venv, test) in image-matching/Makefile. Targets can be triggered by CI (see .gitlab-ci.yml) or from any host. conda envs are assembled in a x86_64-linux docker container. This allows "cross-compilation" of venv on non-linux hosts (e.g. a developer's macOS laptop). Native conda can be used by setting the SKIP_DOCKER=true flag:

For example make venv SKIP_DOCKER=true will build a venv.tar.gz conda-packed environment using local conda instead of a container. This is useful when we want to build a venv (either for vendoring a runtime or running tests) on a Gitlab runner (which itself is a docker container), or when the build host does not provide Docker.

CI

Every successful build generates a platform-airflow-dags.tar.gz artifact. This archive packages the repo itself, and includes assembled virtual environments.

Integration commands are delegated to top level make targets (e.g. make test). A full CI pipeline can be triggered by invoking: make deploy-local-build. This will trigger testing, assembly of image-matching/venv.tar.gz, assembly of platform-airflow-dags.tar.gz and "deployment" of the build artifact to the worker's AIRFLOW_HOME.

Gitlab CI

CI is triggered at every push. The current CI pipeline is pretty straightforward.

Shipping code to the airflow scheduler/executor.

We don't have an automated way to deploy code from Gitlab. Instead the following workflows needs to happen every time we want to ship DAGs to the airflow worker.

  • Copy DAGm venvs, project files to the worker.
  • Impersonate the analytics-paltform_eng user on the ariflow worker and move files to AIRFLOW_HOME.

As a shortcut, top level make deploy and deploy-local-build targets are provided. They both automates the steps above. They are supposed to be run from a developer's laptop. Ideally we could replace this ad-hoc logic with scap or some battle test tool/deployment host.

Edited by Gmodena

Merge request reports