T295360 datapipeline scaffolding
This merge request adds a cookiecutter template to scaffold new data pipelines as described in https://phabricator.wikimedia.org/T295360.
This template provides
- Integration with our tox config (mypy/flake8/pytest)
- A PySpark job template
- A pytest template for pyspark code
- An Airflow dag template to help users getting started.
Structure changes
The project directory largely follows image-matching
's strcuture. Notable changes are:
- Python code has been moved under
pyspark
- Python code is pip installable. This allows to package deps at build time, and ease spark deployment (e.g. we don't need to pass each module like
--files schema.py
- imports will be resolved from thevenv
).
How to test
checkout the T295360-datapipeline-scaffolding
branch and run
A new datapipline can be created with:
make datapipeline
This will generate a new directory for pipeline code under:
your_data_pipeline
And install an Airflow dag template under
dags/your_data_pipeline_dag.py
From the top level directory, you can now run make test-dags
. The command will check
that dags/your_data_pipeline_dag.py
is a valid airflow dag.
The output should look like this:
make test-dags
---------- coverage: platform linux, python 3.7.11-final-0 -----------
Name Stmts Miss Cover
-----------------------------------------------------------
dags/factory/sequence.py 70 3 96%
dags/ima.py 49 5 90%
dags/similarusers-train-and-ingest.py 20 0 100%
dags/your_data_pipeline_dag.py 19 0 100%
-----------------------------------------------------------
TOTAL 158 8 95%
=========================== 8 passed, 8 warnings in 12.75s ===========================
______________________________________ summary ____________