T295360 datapipeline scaffolding

This merge request adds a cookiecutter template to scaffold new data pipelines as described in

This template provides

  • Integration with our tox config (mypy/flake8/pytest)
  • A PySpark job template
  • A pytest template for pyspark code
  • An Airflow dag template to help users getting started.

Structure changes

The project directory largely follows image-matching's strcuture. Notable changes are:

  • Python code has been moved under pyspark
  • Python code is pip installable. This allows to package deps at build time, and ease spark deployment (e.g. we don't need to pass each module like --files - imports will be resolved from the venv).

How to test

checkout the T295360-datapipeline-scaffolding branch and run

A new datapipline can be created with:

make datapipeline                                                                                                       

This will generate a new directory for pipeline code under:


And install an Airflow dag template under


From the top level directory, you can now run make test-dags. The command will check that dags/ is a valid airflow dag. The output should look like this:

make test-dags

---------- coverage: platform linux, python 3.7.11-final-0 -----------
Name                                    Stmts   Miss  Cover
dags/factory/                   70      3    96%
dags/                                49      5    90%
dags/      20      0   100%
dags/             19      0   100%
TOTAL                                     158      8    95%

=========================== 8 passed, 8 warnings in 12.75s ===========================
______________________________________ summary ____________
