Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • P platform-airflow-dags
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Merge requests 1
    • Merge requests 1
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Activity
  • Graph
  • Jobs
  • Commits
Collapse sidebar
  • Gmodena
  • platform-airflow-dags
  • Merge requests
  • !16

T295360 datapipeline scaffolding

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged Gmodena requested to merge T295360-datapipeline-scaffolding into multi-project-dags-repo Nov 26, 2021
  • Overview 37
  • Commits 60
  • Pipelines 0
  • Changes 27

This merge request adds a cookiecutter template to scaffold new data pipelines as described in https://phabricator.wikimedia.org/T295360.

This template provides

  • Integration with our tox config (mypy/flake8/pytest)
  • A PySpark job template
  • A pytest template for pyspark code
  • An Airflow dag template to help users getting started.

Structure changes

The project directory largely follows image-matching's strcuture. Notable changes are:

  • Python code has been moved under pyspark
  • Python code is pip installable. This allows to package deps at build time, and ease spark deployment (e.g. we don't need to pass each module like --files schema.py - imports will be resolved from the venv).

How to test

checkout the T295360-datapipeline-scaffolding branch and run

A new datapipline can be created with:

make datapipeline                                                                                                       

This will generate a new directory for pipeline code under:

your_data_pipeline                                                                                                      

And install an Airflow dag template under

dags/your_data_pipeline_dag.py                                                                                          

From the top level directory, you can now run make test-dags. The command will check that dags/your_data_pipeline_dag.py is a valid airflow dag. The output should look like this:

make test-dags

---------- coverage: platform linux, python 3.7.11-final-0 -----------
Name                                    Stmts   Miss  Cover
-----------------------------------------------------------
dags/factory/sequence.py                   70      3    96%
dags/ima.py                                49      5    90%
dags/similarusers-train-and-ingest.py      20      0   100%
dags/your_data_pipeline_dag.py             19      0   100%
-----------------------------------------------------------
TOTAL                                     158      8    95%

=========================== 8 passed, 8 warnings in 12.75s ===========================
______________________________________ summary ____________
Edited Dec 16, 2021 by Gmodena
Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: T295360-datapipeline-scaffolding