Commit d2a6f0e9 authored by Gmodena's avatar Gmodena
Browse files

Merge branch 'archive-repo' into 'multi-project-dags-repo'

Deprecate and archive repo

See merge request !33
parents 1057b2e3 25480522
Pipeline #1904 failed with stages
in 0 seconds
[![Project Status: Concept <E2><80><93> Minimal or no implementation has been done yet, or the repository is only intended to be a limited example, demo, or proof-of-concept.](](
# platform-airflow-dags
This repo contains data pipelines operationalised by the Generated Data Platform team.
You can reach out to us at
* TODO: Add wikitech url
* TODO: Add irc channel
* Slack: [#data-platform-value-stream](
# Requirements
Tools provided by this repository require [Docker](
# Data pipelines
> […] a pipeline, also known as a data pipeline, is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion. […] >
A Generated Datasets Platform pipeline is made up by two components:
1. Project specific tasks and data transformation that operate on input (sources) and produce output (sink). We depend on Apache Spark for elastic compute.
2. An [Airflow DAG](, that is a thin orchestration layer that composes and executes tasks
Data pipelines are executed on Hadoop. Elastic compute is provided by Spark (jobs are deployed in cluster mode). Scheduling and orchestration is delegated to Apache Airflow. Currently we support Python based projects. Scala support is planned.
This repo used to contain experiments and spike work for data pipeline tooling provided by the [Generated Data
team. This software is currently not maintained, and archived for historical reasons.
## Create a new data pipeline
The project has been rebranded to better capture its new scope, and has moved to
Clone this repo and create a dev branch with:
## Migrate to the new repo
If you cloned or forked this repo you'll need to update its `origin`.
cd platform-airflow-dag
git checkout -b your_data_pipeline_branchname
git remote set-url origin
A new datapipline can be created with:
You rebase on the origin with
make datapipeline
git fetch origin
This will generate a new directory for pipeline code under:
And install an Airflow dag template under
git checkout main
git rebase origin/main
## Repo layout
This repository follows a [monorepo]( strategy. Its structure matches the layout of `AIRFLOW_HOME` on the [an-airflow1003.eqiad.wmnet]( airflow instance.
* `dags` contains [Airflow dags]( for all projects. Each DAG schedules a data pipeline. No business logic is contained in the dag.
* `tests/` contain the `dags` validation test suite. Project specific tests are implemented under `<project-name>`
* `<project-name>` directories contain tasks and data transformations. For an example, see `image-matching`.
## Deployment
DAGs are currently deployed and scheduled on [an-airflow1003.eqiad.wmnet]( This service has no SLO and is meant for development and experimentation use.
The following command will run code checks and deploy data pipelines:
make deploy-local-build
### Deploy a new pipeline
Deployment piplines are declared in the `TARGET` variable in `Makefile`.
To deploy a new pipeline, append its project directory name to `TARGET`.
For example, if a new pipeline has been created as `my_new_datapipeline`, the new
`TARGET` list would look like the following:
TARGET := "image-matching my_new_datapipeline"
# CI & code checks
We favour test-driven development with `pytest`, lint with `flake8` and type check with `mypy`. We encourage, but not yet enforce, the use of `isort` and `black` for formatting code. We log errors and information messages with the Python logging library.
## Code checks
We enforce code checks at at DAG and project level
### Dag validation
DAG validation tests live under the toplevel `tests` directory. They can be triggered with
`make test_dags`.
### Project checks
The following commands can be executed at top level (they'll be invoked for all projects),
or inside a single project directory (they'll be triggered for that project only):
* `make lint` triggers project linting.
* `make mypy` triggers type checking.
* `make test` triggers unit/integration tests.
All targets are configured with [tox](
By default, code checks are executed inside a docker container that provides an [Conda
Python]( distribution. They can be run "natively" by passing `SKIP_DOCKER=true`. For example:
make test SKIP_DOCKER=true
## CI
This project does not currently have Gitlab runners available. As an ad interim solution,
we mirror to Github an run CI atop a `build` Action `build` is triggered on every push to any branch.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment